Google and Large Scale Data Models Like Panda

Search engine optimization grows and changes much as the Web itself does. With the recent addition of Google Plus to the services that Google offers, and this year’s introduction of the Big Panda updates, one of the growing areas of SEO involves seeing how Google and other search engines might incorporate more user information into how they rank webpages. The introduction of Google Plus has highlighted the importance of looking at how the search engine collects information regarding how people search, how they browser the Web, what they publish online, and how they interact with others in social networks, and what the search engine might do with that information.

With the Panda updates, we’ve seen Google introducing a way of modeling information in large scale data sets, like the Web, to try to identify and predict features of webpages that can be used to rank pages not only on the basis of relevance and popularity (based upon the links pointing to those pages), but also also upon a range of other features such as credibility, trust, originality, range of coverage of a topic, usability, and more.

I’ve been looking back at some of the patents that Google published, and ran into a couple that really weren’t discussed much when they were originally published, and probably should be talked about a little more.

One of them oddly is very similar to a patent from Microsoft that I wrote about back in 2007, in a post titled Personalization Through Tracking Triplets of Users, Queries, and Web Pages. The Google patent involves ranking documents on the Web by predicting which page might be selected by searchers faced with a set of search results. That prediction is based upon the collection of data in the form of tens of millions of “instances,” or information collected about queries, users, and documents. This patent was originally filed back in 2003, and was granted in 2007.

Around the same time that patent was granted, the same group of inventors from Google published another patent that focused less specifically on user data and more on building useful prediction models using machine learning that could help identify spam in emails, or predict which ads people might click upon in paid search, or how webpages should be ranked in organic search.

Instances of Data

In the first Google patent, the model being built looked at a combination of data from users, the queries that they used, and the documents that they may or may not have selected. Each of these combinations is referred to as an “instance. An instance is a “triple” of data: (u, q, d), where u is user information, q is query data from the user, and d is document information relating to pages returned from the query data.

Some examples include:

  • Country Where user u is located,
  • Time of day user u provided query q,
  • Language of country where user u is located,
  • Each of previous three queries that user u provided,
  • Language of query q,
  • Exact string of query q,
  • Word(s) in query q,
  • Number of words in query q,
  • Each of the words in document d,
  • Each of the words in the Uniform Resource Locator (URL) of document d,
  • The top level domain in the URL of document d,
  • Each of the prefixes of the URL of document d,
  • Each of the words in the title of document d,
  • Each of the words in the links pointing to document d,
  • Each of the words in the title of the documents shown above and below document d for query q,
  • The number of times a word in query q matches a word in document d,
  • The number of times user u has previously accessed document d, and;
  • Other information.

This is just a small handful of the types of information that could be stored by the search engine, and the patent notes that it’s possible that the data repository may collect more than 5 million distinct features.

The patent is:

Ranking documents based on large data sets
Invented by Jeremy Bem, Georges R. Harik, Joshua L. Levenberg, Noam Shazeer, and Simon Tong
Assigned to Google
US Patent 7,231,399
Granted June 12, 2007
Filed: November 14, 2003

Abstract

A system ranks documents based, at least in part, on a ranking model. The ranking model may be generated to predict the likelihood that a document will be selected. The system may receive a search query and identify documents relating to the search query. The system may then rank the documents based, at least in part, on the ranking model and form search results for the search query from the ranked documents.

In addition to collecting large amounts of information about each instance, the model works to find connections between that data to build a model about how people search the web, the queries that they use, and the pages that they chose or decide not to click upon.

So, the query data collected might include search terms previously provided by users to find specific pages, the user data might include Internet Protocol addresses, cookie information, query languages, and/or geographical information associated with the users, and the document information may include data about specific pages that were presented to users in search results, and which positions those documents were at when they were selected or passed by.

One of the focuses of this prediction approach relies considerably upon whether or not a page was selected in search results. That seems like a potential problem.

When someone chooses a page to look at from search results, all they see is a page title, a snippet, and an URL. They aren’t making a judgment based upon the documents themselves. Personally, when I perform a search, I’ll often open a number of results in a new tab if they look somewhat relevant to my informational need. I like to have more than one source of information, and my expectation is that having a few pages to look at is going to provide a better answer to any questions I might have than just looking at one. Those selections don’t necessarily mean that I found one document more relevant or higher quality than any of the others.

Models Based upon Document Features

While the idea of looking at instances and triples of data involving users, queries and documents is interesting and potentially a useful way of ranking documents, the model building aspect of that patent might be useful if focused in other areas as well. The second patent from Google sounds like a document classification model approach that could potentially power an update like Google’s Panda.

It doesn’t focus specifically upon ranking Web pages, but it tells us that this kind of model building could be useful in a number of ways:

Different models may be generated for use in different contexts.

For example, in an exemplary e-mail context, a model may be generated to classify e-mail as either spam or normal (non-spam) e-mail.

In an exemplary advertisement context, a model may be generated to estimate the probability that a user will click on a particular advertisement.

In an exemplary document ranking context, a model may be generated in connection with a search to estimate the probability that a user will find a particular search result relevant.* Other models may be generated in other contexts where a large number of data items exist as training data to train the model.

(*My emphasis)

The patent is:

Large scale machine learning systems and methods
Invented by Jeremy Bem, Georges R. Harik, Joshua L. Levenberg, Noam Shazeer, and Simon Tong
Assigned to Google
US Patent 7,769,763
Granted August 3, 2010
Filed: April 17, 2007

Abstract

A system for generating a model is provided. The system generates, or selects, candidate conditions and generates, or otherwise obtains, statistics regarding the candidate conditions. The system also forms rules based, at least in part, on the statistics and the candidate conditions and selectively adds the rules to the model.

The classification models described in this patent are built from training data which includes multiple attributes or features. The patent mostly provides examples involve email and spam detection, but as the inventors note could be used to predict which ads people click upon or how relevant a particular search result might be to a searcher.

For example, in an email context, one of the things that the classification system might look for are mentions of the word “free”. Or it might look for strings of exclaimations points!!!!!!!!! Or it might look for combinations of features, such as mentions of the word “free” coming from the Hotmail domain. A large number of features might be considered in a set of training data as candidates that might indicate whether an email is or isn’t spam.

This patent also tells us that one of the difficulties with using training sets in a classification model like this is that present day classification systems can only handle small quantities of training data.

The breakthrough of using MapReduce to handle large training sets as described in PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce (pdf), by Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo, could be one of the technological solutions that has helped Google overcome that limitation. The PLANET paper describes an experiment involving predicting click throughs on advertisements based upon a prediction model based upon features associated with those ads and the landing pages they point to. The experiment is detailed more fully in Predicting Bounce Rates in Sponsored Search Advertisements.

The Bounce Rate paper describes looking at triples of data that Google collected involving (q, c, p) query terms (q), creatives or advertisements (c) and landing pages (p). The paper also describes some of the specific features that they might rate sites upon, such as terms used in advertisements and landing pages, related terms used in those documents, categories that pages might fit into, and more.

The patent also describes three different approaches to models that might be created, and how new features that could be added might be identified and tested.

Conclusion

If you’re interested in Google’s Panda updates, it’s worth spending sometime looking through the PLANET and the Bounce Rate papers, and these Large Scale Data Model patent filings to get a sense of how Google may have developed the models that they are using to classify pages based upon features found within or associated with pages in the seed sets they’ve used to rank pages.

It’s possible that Google may have built classification models that work somewhat differently than those described in these documents, but the end result is the same. The method to build those models is probably less important than the pages that they chose as training sets, and the features that they identified as important – those that might be used to define quality, credibility, originality, topic coverage, and other qualities that were hinted at in the Google Webmaster Central blog post that I linked to above from Amit Singhal.

Share

31 thoughts on “Google and Large Scale Data Models Like Panda”

  1. Interesting how you’re tying these patents together across the years. I agree that the Panda technology may indeed have made it possible for Google to integrate signals together in a way it previously could not.

    I think, however, that the critical breakthrough is more likely documented in “The Model-Summary Problem and a Solution for Trees” as that case study deals directly with organizing large-scale data for the purpose of identifying the correct model to match real-world observations.

  2. Thanks Bill,

    A combination of user data, query data and document data to define the relevant information sounds like a neat way to start a machine learning principle. I would hazard a guess that correlating the click through data against the bounce rate / time spent on page / next user search would go someone to suggesting the precision of the response from the engine. I’m not sure how this would work with tabbed browsing though but something could be said for the order in which the links are clicked.

    I suppose then – that by taking this localised data for a single search query from a single user and cross-correlating this against another search query where the u, d and q data types aren’t identical wouldn’t give a true comparison of effectiveness but could be used to define a trend. The number of variables in the u field could be reduced using the autofill option google has enabled, the user trend data could be defined using the user profile through historic reaches.

    It’s certainly one to give more considered thought too – I know I’ll be pondering this one for a little while.

  3. Hi Michael,

    I’m definitely doing all I can to build some context for the Panda updates. From what I’ve read in the Wired interview with Matt Cutts and Amit Singhal, the updates were being worked upon before anyone named Panda was involved. The technological breakthrough that appears to have involved Panda does seem to be able to build a classification system that can handle very large datasets, or at least to focus upon observational data involving websites themselves rather than user activity, and then possibly involve that user data as a feedback mechanism to see whether or not the predictive model behind Panda was actually finding what might be the most relevant sites. The Planet paper points to the first – using MapReduce in a way that can handle that large of a model. The interview with Cutts and Singhal points to the second – Panda helped them to focus upon the features on pages and sites rather than upon user clicks, to see if those features could help predict the clicks. So, it’s possible that his influence helped in both ways.

    The Model Summary paper, The Model-Summary Problem and a Solution for Trees, primarily focuses upon looking at a very large collection of observational data about bird ecologies and finding ways to pull out meaningful and useful patterns from that data.

    I wish it focused more directly upon how predictive models based upon ensemble decision trees could be used to try to identify pages or sites that might be the most relevant ones for a particular query. The context differs enough that the first 4 or 5 times that I read it, I wondered if the Biswanath Panda was purposefully obsfucating what he was writing, but the facts that one of his co-authors is an ornithologist and a National Science Foundation grant was awarded for the research behind the paper led me to believe that his intent was really to help make it easier for scientists to interpret large amounts of observational data in meaningful ways.

    The Model Summary paper also involves building summaries that might yield useful data from the large set of observational data, and doesn’t involve the process of incorporating observations about pages into actual boosting or reductions of rankings for specific pages or sites. It may not be a far step from building that kind of summary to incorporating it into rankings, but that’s one of the interesting parts. What actual features on pages made them more likely to be relevant for specific queries? What features make it less likely that a page might be relevant for certain queries? What combinations of features would do the same? Why might some sites appear to be penalized as a whole, while others might have only seen some rankings reduced?

    If the process of classifications of documents for purposes of determining relevance of particular pages creates predictions, is actual user data being used as feedback for the Panda classifications? If so, that might explain why Panda isn’t an ongoing update, but seems to take place just every so often, so that actual user data could possibly be collected involving Panda produced changes, to suggest possible additional rules or changes to sites included within training sets.

    How would you interpet the Model Summary paper as it might apply to relevance and search rankings?

  4. Hi Tom,

    I think the original idea, in the patent filed in 2003 might have been to use that user information to classify and rank pages exactly as you point out, but it might suffer from a couple of problems, which I think the 2007 patent may have been intended to fix.

    One of them is that if a page isn’t ranked highly already, it might not get a lot of visitors and there might not be much user behavior data associated with it. The second is that people looking through search results and clicking on one aren’t necessarily making a decision to click on the document itself, but rather the summary in the snippet that the search engine produces. The snippet itself could be very very good, but the page that it leads to might not be anywhere near as good.

    So the question is, would it be better to rerank results based upon a prediction of relevance, based upon analyzing the features of some known very high quality websites, and then use the user information as feedback to see if your predictions were somewhat accurate?

  5. When you’re generating model summaries, it doesn’t matter if you’re looking at observations of birds at different elevations around the world or if you’re looking at scores for documents that were generated by document classifiers.

    Each model summary hypothesizes how the data might be sorted. The algorithm finds a way to quickly compute the model summaries so that a best match with a human division of data can be determined.

    The PLANET paper laid the foundation. The model summary paper showed how the process works. I consider it to be a proof of concept.

  6. Hi Micheal,

    Thanks. Those are good points. The patent filings I wrote about are aspirational – hoping to be able to take incredibly large amounts of observational data to build a predictive model, but they don’t describe that final step of building actionable summaries that can be used to decide which pages might be the most relevant for which queries. Much like being able to predict which birds might be seen where and when in the model summaries paper.

    As a proof of concept, it does work pretty well.

  7. Bill, I always felt that google is using a predictive model for user metrics like CTR, bounce etc. using the features extracted from the page. This was ever since I read that paper on bounce rate and had even shared it with other folks on webmasterworld. It all makes sense considering the fact that panda is run separately and is triggered manually.

    They seem to deriving these metrics by running some math on parsed terms extracted from the documents. Since most of these are predicted, the chances of false positives are high and Google acknowledged this fact while informing about the panda update.

    What kind of features do you think they extract from documents and how do you think they are doing it? Are things like keywords density influencing these metrics directly or indirectly?

  8. I honestly wouldn’t doubt that the bounce rate is already factored into a sites rankings, because my pages that have videos seem to always crawl to the top of their targeted keywords, because of this (I think): My hypothesis is that the people that land on pages of my site that have a video, watch the whole thing, and then click back to google, which tells google how long they were on my site. If this has been a long time, google sees this, and believes the site was helpful since the user spent a lot of time on it. That’s my evidence that bounce rate already does have an effect on search rankings.

  9. it’s incredible to imagine that Google can store as much data about each individual by searching!
    What power and storage used!
    I think that since the publication of these patents, Google’s test and analyze what the most interesting data to be stored, relative to the size of storage and added value. For information against a document or recheche can be more easily aggregated.

  10. I think Google’s concession this week that they have added more signals to the Panda process supports the research you’re doing, Bill. Even if you haven’t identified the right signals, you’re on the right track.

    People need to think about how the search engines are evaluating their Websites in a totally different way now. Bing may not yet have something like Panda but I’m sure they will go that route. Yandex appears to be the first major search engine to let artificial intelligence do the evaluating.

  11. Sheesh. It sounds like I’m saying you’re NOT identifying the right signals. Didn’t mean for it to come out that way. Of course we don’t know, Mr. Spock, but I trust your guesses far better than many other people’s facts.

  12. It will be interesting to see how Google uses the data it will now be collecting from Google+ profiles to improve upon personal search experience.

    Is this classification system for large data-sets similar to PSLA and information retrieval and if so how does Google take user information such as preferences or things like search intent and apply it to their algorithm?

  13. Fully understanding the metrics and how Google does the math is really a very important thing to look at if you’re serious on successfully ranking your sites.

    With the coming of Google+ in the scene, I’m really curious how Google would handle the information flowing in.

  14. Hi Rajesh,

    It’s possible that they may be using predictive models involving data on the pages themselves. For example, if they were to pick out terms that might be related to the topic the page may be about and the frequency of terms used on those pages (especially for terms that don’t appear at a high level of frequency on the Web), that might tell them something about the range and coverage of the topic involved. That might be one element involved in Panda.

  15. Hi David,

    How would you define a “bounce”? If they just view a single page, and then return to Google, wouldn’t that be a bounce, even if they spent time on that page watching a video? It’s possible that instead of defining that user activity as whether or not it was a bounce, they might instead be looking at the duration of a stay on a page, or what sometimes might be referred to as a “long click.” There’s a mention of this possibly being a signal that Google looks at in the book “In the Plex” by Steven Levy.

  16. Hi Nicolas,

    I do believe that Google stores more data involving usage information about the Web than they do about the Web itself. It is incredible to think about.

    The challenge might not be so much whether or not to use that kind of data, but rather which data to use, and how to use it.

  17. Hi Michael,

    Thanks. I agree that people should be thinking differently about how search engines are evalutating their pages. Some of the issues that are problems when it comes to Panda are issues that I would have recommended been addressed before Panda, such as eliminating as much as possible the chance that a page can be accessed at more than one URL on a site, or to avoid very thin pages as much as possible. But, the range of signals to be concerned about is likely much broader now.

  18. Hi Gary,

    It will be interesting to see how Google might incorporate the additional signals they receive from Google Plus profiles and their other social initiatives.

    It is possible that something like PLSA is involved. Some previous posts I wrote that consider PLSI or a probabilistic generative model:

    http://www.seobythesea.com/2011/02/document-level-classifiers-and-google-spam-identification/
    http://www.seobythesea.com/2011/01/why-a-search-engine-might-cluster-concepts-to-improve-search-results/
    http://www.seobythesea.com/2007/11/google-and-personalization-in-rankings/
    http://www.seobythesea.com/2007/04/google-news-personalization/
    http://www.seobythesea.com/2007/03/google-patent-application-clustering-users-for-personalization/

  19. Hi Andrew,

    While I agree with you, I think the best chance you have of understanding the metric in use and the math is in working for Google. I don’t think it hurts to try to understand the possibilities of what they are doing as much as possible, though.

  20. Thanks bill. If they are using user engagement data, I am almost certain that they are using some sort of predictive model to compute those metrics. Any analytics tool can be set up differently by different users to report bounce, etc. in different ways.

    The predictive model ensures that there is uniformity in the way these metrics are computed.People think that google is using their own analytics data to pandalize sites and many are even abandoning those tools. But many still assume that google will have to manually collect these data to know them.

  21. “the range and coverage of the topic involved”

    You seem to be saying something very similar to what they have as reading levels (Basic, intermediate and advanced). Is this any different from that? Any ideas on how those reading levels are determined?

  22. Hi Rajesh

    My suspicion is that Google is using data related to a page itself and its content, possible site level information related to the page, and possibly even information about affiliated sites and features associated with those to predict a relevance for the page to specific topics and query terms. These features and metrics associated with them are identified based upon a seed set of sites. User information associated with the sites might possibly be then used to measure the value of those predictions rather than being used directly to impact the rankings of those sites.

    Google has so many ways to collect user information that I don’t think they need to look at Google Analytics data at all.

  23. Hi Rajesh,

    What I was talking about involves Google possibly looking at the meaningful terms and phrases on a page and their usage to determine how well a topic might be covered.

    For example, on a fairly thin page about a baseball stadium, we might be told about the location of the stadium and how many people it might hold and a few other tidbits of information. On a better quality page, we might be told the dimensions of the park, the records of the teams that played there, some historic events (hank aaron getting a home run record there, perfect games pitched at the stadium, etc.), statues around the part, and more. This might be done by looking at the range of terms included on a page, and possibly some co-occurence data.

  24. Thanks again bill.that was a very good example. But i am still not sure how this is different from the reading level classification. The sample elements you suggested in that example for a thin page is what I thought that I would see in a basic reading level page and the elements about the park dimensions, history of events etc. is something that you will probably find in an intermediate or an advanced page, depending upon the coverage.

    I am agreement with you on the fact that they use several data points to correlate things than using them directly.

  25. Hi Rajesh,

    Simple reading level algorithms don’t really look at the specific terms used in a body of content to determine a reading level, but rather look at things like word length and sentence length. For example:

    http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_test

    Google looks like they came up with a slightly different approach:

    http://www.seroundtable.com/google-reading-level-algorithm-12638.html

    Reading level isn’t the same as the depth of coverage of a topic, which is what I was describing. A thin content page can use long sentences and words with a lot of syllables in them and be considered thin content with a very high reading level. A page that covers a wide range of aspects of a topic in detail but uses smaller words and shorter sentences might be considered a thick page with a low reading level.

  26. Thanks for the article links Bill, much appreciated, I will have a good read though them

    A really good blog, glad I came across it!

    Best

    Gary

  27. Hi Bill,
    Thank you for looking out. Your articles are wonderful. I look forward to checking in with you periodically. Thank you for sharing.
    Liz

Comments are closed.