According to Google’s Director of Research, Peter Norvig, if you look at Google Trends for trends related to “full moon” or “ice cream”, you’ll see that Google searches for those terms imitate actual physical trends in the world. With a very large number of queries performed for those terms, searches for “full moon” peak every 28 days. Searches for “ice cream” peak every summer, 365 days apart. Large amounts of data make interesting things possible.
If you’re interested in how search engines work, and how large amounts of data can help them do what they do more effectively, it’s highly recommended that you read the paper The Unreasonable Effectiveness of Data (pdf), written by Alon Halevy, Peter Norvig, and Fernando Pereira, from Google. Even more highly recommended is a presentation from Peter Norvig of the same name from a Distinguished Lecture Series at the University of British Columbia last fall, which sadly has less than a 1,000 views at YouTube presently:
In the presentation, Norvig uses mostly plain language and great examples to describe many areas where more data overcomes problems with algorithms such as:
Word Sense Disambiguation – Large data can help you understand what the meaning of a word might be when a word that may have more than one meaning appears in a document on the Web.
Word Segmentation – Not as helpful in languages like English as it is in languages such as Chinese, but can be helpful in the world of domain names, where there aren’t usually separations between words.
Statistical Machine Translation – As Norvig noted in the presentation, “We’ve been able to build models for languages that no one on the team speaks.”
There are several papers that cite the Unreasonable Effectiveness of Data paper, and I found a couple that might be of interest to people who want more examples of how that might be applied to search and to search engines. These are worth a look:
On the Value of Page-Level Interactions in Web Search (pdf)
Exploring Web Scale Language Models for Search Query Processing (pdf)
38 thoughts on “Big Data at Google”
Thanks Bill, for taking an hour of my Saturday night! Still – the family were watching Hugh Grant on TV, so watching Peter on an iPhone – not so bad.
The lecture was really interesting. At Majestic we are wrestling with big data analysis at the moment – though in different contexts – and seeing the thinking as to how we can use large data to create short, iteritive routines to predict outcomes is pretty interesting. I should say, though, that fortunately they don’t let me anywhere near the maths on our algorithms. I hope one day we’ll be wrestling with similar predictive modelling.
Thanks for taking up my evening Peter Norvig!
Awesome information shared by Peter. Word sense disambiguation in particular was very interesting, but not sure how DIY SEO’s could use that for real-world implementation.
Funny, but I was planning on writing about something completely different this morning, but the USPTO website wasn’t working well. I had wanted to return to the Unreasonable Effectiveness of Data paper for a while, and noticed the video. Seeing that it was almost an hour long, it wasn’t something that I was going to write a post about, until about halfway through watching it and the examples that Peter Norvig provided for text segmentation and for disambiguation of query terms.
I guess the thing I really liked about the presentation was that it didn’t require a math or computer science degree. The algorithms in the examples were pretty simple ones, and proved the point that lots of data can overcome limitations in algorithms that might be used to solve some problems. Glad to hear that you enjoyed them as well.
What I think might be useful in the presentation for DIY SEOs are some of the insights that it gives on how Google does some of the things they do.
Most of the papers I’ve seen about algorithms for word sense disambiguation or text segmentation or machine translation tend to become pretty complicated pretty quickly, but this presentation did a great job of avoiding that level of complexity. As an introduction to the topics, it’s a pretty good one.
I’m still going to use hypens between words in directory names, and in file names for pages and images, rather than relying upon Google to segment that text themselves, but it still good to know how Google might try to do that segmentation on their own. When I do keyword research, and I come up with candidate keyword terms, I’m keeping a close eye on terms that might have more than one meaning. We see that Google has an approach to disambiguating those words, but again, I’m going to try to be careful when they might possibly try to do that.
The word predictions also give great insight into how Google Suggest is working, which could affect how the DIY SEO conducts their keyword strategy. I have, for ages, been wondering why… When I start to type, say “murder mystery…” into the search box, google wants to suggest “Murder Mystery party” over “murder mystery games”. I used to suspect that if I created lots of spurious searches for “murder mystery games” I might persuade Google’s algo to change its minds. I now see that is fruitless, because this data is taken from known occurrences of 3 word on page globally. Since “murde mystery party” is a brand of game that is available in stores, and murdermysterygames.co.uk is a brand with only one online store, it his a disadvantage. Not a critical one, but before Google suggest, the phrase “murder mystery games” was much more popular when users did not have a brand in mind. Now, that is starting to change, because although the algo might be giving the “wrong” suggestion, the synonym is close enough for Jazz and so people go for it.
frankly speaking i dont really know about such thing but that true that world is expanding rapidly and because of search engine its getting shirk virtually. and the word Word Sense Disambiguation and Statistical Machine Translation this word new to me thanks for making me aware of it.
Just the other day I was using Google Trends and wrote an article around a trending search term. There was a visible spike in traffic indeed, so now I’ve been wondering how I can utilize Trends more to my advantage. You posted this article at the perfect time for me. Thank you!
As a Spanish language SEO the disambiguation issue is one I deal with every day. This is because there are many words that have completely distinct meanings for different dialects. As you say, I will continue doing my own legwork to ensure that I am serving the interests of my clients, but it is always nice to know that the Big People are worried about the same things as the little people!
Lol thought i would only watch that for few minutes, watched the whole thing. As someone new to blogging and SEO I found it to be very helpful.
You are right, Wil, the video really draws you in. I was only planning to watch a few minutes of it yet an hour later I was finishing up.
GREAT video. Thoroughly enjoyed it. Also dug the page level interaction pdf. Many thanks!
I think that an algorithm, even if advanced and “smart” (IA), can never steal the semantic or meaning of a word, even if it is able to interpret the context in which it is inserted.
From a SEO point of view I believe that metaphors, for example, should be used with caution.
However I know that also in the field of robotics they are working hard in this direction.
I printed out “The Unreasonable Effectiveness of Data” and will read on today’s lunch break. Physics explained easily by math? Fascinating already.
That’s crazy that they can build models for languages none of the team speak. I’m endlessly impressed by that kind of capability. Also somewhat concerned.
Google has so much data and they’re not even sure what to do with it, especially in the social arena. However there isn’t anyone else out there as equipped (both academically and technically) to tackle this problem so we wait to see what Google comes up with next.
Hi Bill,. You replied:
>>>What Google suggest shows as suggested queries probably goes beyond what Google sees on pages published on the Web, or in their query logs, and may also involve other user data such as what people tend to click upon, which pages they tend to stay upon longer, how they refine their searches in query sessions, and more.<<<
But same maths on the algorithm if thats true! Just a different total when they count "SEO By… the sea" vs "SEO by Joast". But my guess is that when you look at phrases that start with "SEO by"… the ranking when you compare online mentions makes for a pretty good suggestion box, without much extra manipulation.
What Google suggest shows as suggested queries probably goes beyond what Google sees on pages published on the Web, or in their query logs, and may also involve other user data such as what people tend to click upon, which pages they tend to stay upon longer, how they refine their searches in query sessions, and more.
It is amazing that the kinds of data that Google collects comes from around the globe, and enables an increasingly intelligent approach to things like statistical translation. Google’s not perfect at it, but I think they get better on a consistent basis.
You’re welcome. It is interesting what you can learn using tools like Google Trends, and how it can be helpful when you’re writing something, to see if terms related to your topic are something that seems to be interesting to a lot of people when you’re writing about them. Or in choosing a topic. 🙂
It’s interesting that some languages have some significant dialect differences. From a few of the papers and patents that I’ve read about local search around the world, I understand thats a problem in Japan and China as well. There are often many synonyms for terms like “restaurant,” that showing relevant results can be difficult.
Hi wilhouse and Wailea,
I know the feeling. I started watching the video, expecting to stop after a few minutes myself. An hour later, I decided to write this post. 🙂
There are definitely many pitfalls in trying to understand language, and analogies and metaphors, but if a search engine is going to do what it does effectively, it may be something that they have to learn.
Glad to hear that you enjoyed the video, and the papers. The page interaction paper was pretty interesting. I’ve had someone point to me a few white papers that are pretty interesting about how people interact in different ways with search results pages, and will probably write about those sometime soon.
A very interesting paper. I hope that you enjoyed it.
Funny but that’s a little like the reaction I had when I heard that in the presentation.
Regardless of how well a statistical machine translation model might work, I think I’d like to have some linguists on hand to do some double checking. 🙂
I’ve heard that from at least one search engineer in the past as well. Big data might enable you to do a lot of things, but I suspect that sometimes looking at it all might leave you scratching your head wondering if you really are looking at the right things, and trying to find ways to use that data. But having too little data might be a bigger problem.
The data that Google uses for predictive queries in a dropdown go beyond just looking at online mentions, and can involve a number of other factors as well.
For instance, if there are suddenly a lot of searches for a particular query by searchers, Google may adjust the suggested queries shown to include some of those suddenly very popular queries.
I’ve also seen a query suggestion for someone appearing where the suggested query term just appeared upon one single page on the Web, and no where else. The person involved was a well known figure who had an active public life and was mentioned very frequently on the Web in relation to many different activities. If Google solely based suggested queries based upon frequency of mentions on the Web, the suggested query in question wouldn’t have appeared in the dropdown. If Google also based what they displayed on frequency of queries submitted to the search engine, it also wouldn’t have been appearing. It was controversial enough suggested query though that it was probably clicked upon frequently by searchers, and that may have played a large role in why it kept on appearing.
Yes, OK. QDF is likely to affect Google suggest just like it affects the SERPS. I think the phrase “Katrina” (before Google suggest existed I think) showed that a phrase can literally change meaning within the algorithm based on user behaviour.
Grr… now i find myself doing tests on Google Suggest 🙂 – I now see Google flipping the order of words, which completely breaks my suggestion that Google uses this methodology for Google suggest… or rather, there is at the very least another iteration that takes the phrases and checks the ordering of the suggested words. So well done Bill – I bow to you, sir.
Hey Bill and Dixon, in regard to your discussion about Google Suggest:
Experiments (in the context of online reputation management) have shown that the number of results Google finds for a certain phrase definitely increases the chance of that phrase to be shown as a suggestion. Obviously it is not the only factor but the impact is measurable. Search volume is a factor as well.
> It was controversial enough suggested query though that
> it was probably clicked upon frequently by searchers,
> and that may have played a large role in why it kept on appearing.
Clickrate is high confidence a factor as well. That is (again from the perspective of online reputation management instead of classic seo) the reason why negative phrases (like “company x sux”) tend to fuel themselves to the top, once Google starts showing them as suggests.
I did an interview with a company who tested the manipulation of Google Suggest quite a bit in German a couple of month ago: link no longer available
Sadly I could not publish everything, since this technique is prone to be abused :-/
As we have learned from Peter Noring, Google Translate should help you get the facts from said interview 😀
If you are looking for material in your native tongue about this topic, I recommend Bret D Paynes attempt at manipulating Google suggest with the help of Amazon Turk http://baldseo.com/presentations/Google-Suggest-Manipulation.pdf
I’ve read a number of patents involving query refinements and query suggestions from both Google and Yahoo and Bing, and we’re only really seeing some of the things within those suggestions that Google might offer at this point. At first, most of them tended to be suggestions based upon completing the words that you’ve typed into your search box, but we’re going to see more things like synonyms, or at least synonyms within the same or a similar context as well
Thank you for expanding on our discussion about Google Suggest. Both number of results and search volume do seem like they would play a role in what shows up as suggestions, but things like clickrates on previously shown suggestions definitely can play a part as well.
Thank you for the links to the articles. This is definitely one area where Google has some work to do to prevent people from trying to manipulate what shows up as suggestions, and and I mentioned to Dixon in my comment above, there are patent filings out there that show a number of additional possibilities that Google hasn’t incorporated into Google Suggest yet, but likely will. Some of those are reliant on big data, like n-grams analysis, and I imagine that part of the delay in seeing them does have to do with finding ways to keep them from being manipulated.
Trend or topic modeling is just getting started. At the university where I work we have an ongoing project with the NSF where we look at trends in grant proposals by topic (not keyword). This type of data is very useful not just for commercial purposes but to see what direction science and research is going in and trying to understand why.
With the incredible amount of content being created on the Web everyday, it’s hard to keep track of it all. Being able to visualize in some manner the topics and concepts being discussed even on a site like Twitter or Digg is a challenge. I know that many health organizations are working to understand the direction of medical research much in the manner that it sounds like you’re attempting to do with science and research proposals.
Large amounts of data present some significant challenges, but as the paper and presentation show, if you have an ability to look at that data in meaningful ways, it can be really helpful as well.
Comments are closed.