How Google Might Classify Queries Differently at Different Data Centers

There’s some evidence that the Panda updates to Google’s ranking algorithm may be based upon a decision tree approach to classifying and creating quality scores for web pages and sites. Curious as to whether Google might be using a decision tree approach to classify other information, I went digging through some of Google’s other patent filings that I might not covered here in the past.

I found one that may have interesting implications regarding how queries are classified and different data that may be stored and emphasized at different Google data centers or data partitions.

An overview from the patent filing of how queries may be classified so that one or more data sources might be selected to return search results from.

When we think of how a search engine works, we usually focus upon the results that they present to us, and not the methods that they use to deliver those results. In addition to displaying relevant results to us quickly, search engines are also concerned about things such as how they use their resources to provide results to us.

If you’ve looked at Google’s Pagespeed or Yahoo!’s YSlow, one of the recommendations that they will often make to large sites that have audiences around the globe are the use of content delivery networks or content distribution networks, which would enable those sites to distribute content such as images or downloads or streaming media on servers closer to the users asking for it.

Different Results at Different Data Centers

If you perform a search in Google from one location (maybe home), and then perform the same search later in the same day from a different location (maybe work), you may notice that some of the rankings of results have changed, that some pages that were listed in the first search might not be visible in the second search, and that some new results have been added. If you then return to the first location (after a day of work), and perform the same search, you may see the same results from that first search again.

While Google’s localization (video) and personalization functions might play a role in the different results that you see, another thing to consider is that you might be looking at results from different data centers.

People on SEO and webmaster forums have noted for years that you might see different results if they are being served from different data centers. You might receive results from different data centers because you’ve changed locations, or because one data center is busy, and your query has been routed over to another one. The “reasons” for the differences are often cited as one data center testing a different algorithm or having been updated in some manner while the other isn’t. Another possibility that I haven’t really seen mentioned is the possibility that different data centers may optimize their search results based upon making searches better for people located closer to each datacenter. This patent filing describes how that might be done.

Google has a number of wholely owned as well as rented data centers around the globe, and they try to deliver services to searches from data centers that are close to searchers, though in some cases you may receive results from a data center that is more distant from you when the closest data center might be busy.

File-Datacenter-telecom-cropped

Classifying Queries at Different Data Centers

A patent application published by Google last year describes how the search engine might classify queries, so that it can store data responsive to those queries in places where it may be closest to those searchers who request the information.

The patent filing tells us that this system may use a hierarchical, tree-shaped architecture to accept queries and return results to searchers in a way that both “optimizes a usefulness/accuracy” of the results while also managing resources and costs associated with running the system. The results that are delivered to a searcher may come from more than one data source, but the idea behind the system is to make searches both more efficient and effective.

Under this system, a query might be sent to a “distibutor node” which would decide between a number of “producer nodes” to send that query to. Producer nodes are associated with indexes of data souces, which would search words or phrases within documents or meta data about those documents (including audio, video, and images).

A flowchart from the patent filing showing a query sent to a distributor node, and then to a producer node to see how well it fits with a query classification, and then to one or more producer nodes to gather results to return to a searcher.

It might be more expensive to access some of those producer nodes than others, based upon things such as a producer node being physically distant from a distributor node, or have a limited ability to respond to queries because it’s busy, or the database at that producer node may be so large that search times may be unreasonably long.

To reduce costs associated with creating results for queries, these producer nodes may be set up to focus upon to be more “local” and provide results for more widely accessed and more frequently-desired results for queries they may receive from nearby searchers, and may be set up to contain fewer possible total results so that they are “relatively fast and easy to update, access, and search.”

Example

When someone in Australia searches for [Football], they may expect to see search results involving Australian Rules Football, and a data center or source near them (in Australia) may be set up so that it is faster and less costly to access information about Australian Rules Football than information about American Football, or the Football that most Americans refer to as Soccer. Europeans searching for information about [Football] expect a whole different set of results, and data centers set up in Europe may be optimized to respond to queries involving those results. Americans anticipate a completely different set of results on their search for [Football] and data centers in the US might be optimized for their queries.

The patent filing is:

Productive Distribution for Result Optimization Within a Hierarchical Architecture
Invented by John Kolen, Kacper Nowicki, Nadav Eiron, Viktor Przebinda, William Neveitt, and Cos Nicolaou
Assigned to Google
US Patent Application 20100318516
Published December 16, 2010
Filed: October 30, 2009

Abstract

A producer node may be included in a hierarchical, tree-shaped processing architecture, the architecture including at least one distributor node configured to distribute queries within the architecture, including distribution to the producer node and at least one other producer node within a predefined subset of producer nodes.

The distributor node may be further configured to receive results from the producer node and results from the at least one other producer node and to output compiled results therefrom.

The producer node may include a query pre-processor configured to process a query received from the distributor node to obtain a query representation using query features compatible with searching a producer index associated with the producer node to thereby obtain the results from the producer node, and a query classifier configured to input the query representation and output a prediction, based thereon, as to whether processing of the query by the at least one other producer node within the predefined subset of producer nodes will cause results of the at least one other producer node to be included within the compiled results.

Intelligently Deciding Which Queries are Routed Where

Attempting to make an educated guess regarding which queries should be routed to which producer node or nodes isn’t really an option. There are too many queries, and too much information involved. A system like this is only going to work well if it produces search results that provide the “best-available query results for the query.”

There may be times where the results from a query can only be adequately answered by gathering information from more than one producer node, but the more likely it is that a query can be answered by a single producer node, the better – it reduces both computational expense and access latency, or time.

So, this system will attempt to predict how likely it is that relevant information may be found at a particular producer node, and may or may not take into account particular topics or subject matter that may be relevant to the query. Remember, this is a “prediction” before the query is processed, so as much that can be done without actually finding all results to predict whether or not the query may have to be sent to more than one producer node, the better.

This prediction is based upon query pre-processing that might look at query features such as:

  • Length of the query (i.e., a number of characters),
  • Number of terms in the query,
  • Boolean structure of the query,
  • Synonyms of one or more terms of the query,
  • Words with similar semantic meaning to that of terms in the query,
  • Words with similar spelling (or misspelling) to terms in the query, and/or
  • A phrase analysis of the query.

The analysis of phrases may include:

  • The length of each phrase,
  • An analysis of which words are close to one another within the query, and/or
  • An analysis of how often two or more words which are close within the query tend to appear closely to one another in other settings (e.g., on the Internet at large).

One of the inventors listed, Nadav Eiron, is also one of the inventors on a second generation phrase-based indexing patent filing from Google, Index Server Architecture Using Tiered and Shared Phrase Posting Lists. It’s possible that there might be a relationship between that type of indexing and the phrase analysis mentioned in this Result Optimization patent filing.

Including synonyms and possible misspellings in the prediction, and then in the pages returned from the index may result in a larger set of search results from the data source.

Looking at these types of features related to queries enables the search engine to classify the query to predict whether a specific producer node might contain sufficient search results to satisfy a searcher.

A flowchart from the patent filing showing part of the process of classifying a query based upon features related to that query.

Machine learning techniques might be used to build a model regarding the probability that each producer node contains a useful amount of results to respond to a query, as well as determining whether other producer nodes should be included.

Using a query model like this enables the system to make more intelligent decisions about whether or not there might be sufficient results from a particular data souce. The features related to each query can also be used to retrive information from the index in each producer node as well.

A table showing how query features might be collected for a query to determine whether or not a particular data source might be appropriate for a particular query.

In addition to looking at information about the query itself, the search engine might perform a cost/benefit analysis, looking at other factors such as whether or not the network is currently congested, and how costly it might be to access a particular producer node for a particular query.

We’re told that the classification algorithm that might be used with query predictions for different producer nodes might be a decision tree algorithm:

Thus, a classification algorithm may be selected which attempts to maximize the sending of productive queries, while minimizing lost queries/results. Such examples may include, e.g., a decision tree algorithm in which query results are sorted based on query feature values, so that nodes of the decision tree represent a feature in a query result that is being classified, and branches of the tree represent a value that the node may assume.

Then, results may be classified by traversing the decision tree from the root node through the tree and sorting the nodes using their respective values. Decision trees may then be translated into a set of classification rules (which may ultimately form the classification model), e.g., by creating a rule for each path from the root node(s) to the corresponding leaf node(s).

Other possible classification algorithms might also be used, and a training dataset might be used to work with that classification system.

Actual results produced from a particular producer node may be compared to the predictions on a regular basis to make sure that the classification model is working well, or needs to be updated.

A side note

When looking for more information about the inventors involved in this patent, I came across the resume of one of the inventors, and thought his section involving what he did at Google might be interesting for those looking for more information about Panda:

John Frederick Kolen

Employment

8/2008-5/2011 Senior Software Engineer (Web Search Infrastructure), Google, Inc., Mountain View, CA.

Developed machine learning approaches for optimizing quality of and resources used for web search. Experience with large-scale distributed systems. Developed metrics for web page similarity.

We know little about the people involved in the Panda upgrades, except that someone with the name of “Panda” provides some of the impetus behind the development of the system involved.

How similar web pages may be on a site or upon multiple sites appears to be an important issue with the Panda upgrades, as do machine learning approaches to classifying web pages and web sites based upon quality. It’s possible that one or more of the people who were involved with this results optimization patent may have also been involved with Panda.

Share

37 thoughts on “How Google Might Classify Queries Differently at Different Data Centers”

  1. To be totally honest, Bill, I really didn’t entirely understand everything in this post, but I will say this…this setup sounds like many nodes are being used to almost cache search queries. Additionally, if different nodes are used to effect different results, then it seems that something like this may indeed be the source of the dreaded “Google Dance” that we are all familiar with.

    Interesting.

    Mark

  2. Wow Bill, glad to see such a technical look at this and not just a general guess (as some other places may have written).

    Certainly helps explain some of the randomness between searches on different machines / locations. Although part of it is just Google seeing which results get the best CTR.

  3. Well, its very difficult to understand the working of different Data centers, but your post has solved a lot of my queries and somewhat had cleared my understanding of working of data centers.

  4. Hi Ben,

    There may be a couple of reasons why you may see different numbers of results for most queries, including those using the site operator. It’s possible that different data centers may list different amounts. But, according to at least a couple of Google patents, usually the numbers of results you see reported for a query are estimates (at least on the first page of results) based upon a look at possibly around 2% to 10% of Google’s index.

    Again, that’s a matter of the search engine attempting to do the most efficient thing rather than the most accurate. It costs a search engine less money in terms of computation and speed to only look at and calculate a ranking order and return a limited number of results for any one query, rather than for all of them. So, the results numbers we see for queries aren’t completely reliable.

  5. Hi Mark,

    Thanks for your feedback. I’ve added some images from the patent, and expanded some of the post to be a little more explanatory, but the topic is in some ways a complex one.

    Cacheing of popular queries definitely happens, but this isn’t really about cacheing. It’s more about making search more efficient by optimizing data centers for queries that are more likely to come from searchers nearer where they may be located.

    The “Google Dance” was a near monthly update of data into Google’s index that happened until around 2003, when Google started adding content to its index in a much more timely manner. One thing that is interesting about this patent application though is that the classification of queries isn’t something that happens upon an ongoing basis, but is run every so often – it might be done daily or weekly or monthy, but it isn’t constantly updated the way that Google’s indexes are.

  6. That was a great look at the way Google processes search queries and how they are processed through their different data centers. I’m eager to learn more and look forward to reading more of this blog!

  7. Hi Mike,

    I love it when I run across something in a whitepaper or patent from one of the search engines that helps explain and detail some of the things that we may have learned about search engines from our own experiences or those of others, but don’t have much detail beyond those observations and conclusions that we may make about them.

    It’s definitely more than just click throughs – it seems like there’s a role in this classification approach for drawing associations between different terms searched for in query sessions, and how words might be associated through things like phrase-based indexing, and more.

  8. Hi Jamie,

    You’re welcome.

    I did work in one location that was definitely tied to one data center, and live in another location with a different data center, and it was a fun thing to see what kinds of differences would happen when I performed the same searches from the different locations.

    Not only would I see different results, but I remember even seeing a different Toolbar Pagerank score for one set of pages based upon those locations. I’m not quite sure why on that particular point. I suspected a glitch, but have wondered if there was something more to it.

  9. Hi Vishal

    Thank you. I have a whole bunch of new questions now after reading the patent filing, and writing this post. One of them is whether or not different data centers have different supplemental indexes with different contents in each (it sounds like it). Another is whether or not the caches that hold results from popular queries tend to be different from one data center to another. What roles do language and country preferences play in the data that might be located at different data centers?

  10. Hi Jonathan,

    Thanks. I tend to look at this blog partially as my notebook for things that I learn about search and search engines, so hopefully we can learn together.

  11. So can we interpret that there might be a decision tree in terms of the type or classification of that query? Might that help in determining when to present a Onebox result and what Onebox it should present?

    From a high level, it seems clear that Google understands the nature of queries – from Local to Product to Health to Music. But as you mention, understanding how Google came to those conclusions is rather interesting.

    Panda seems to be part of a decision tree, acting as a filter before it enters the rest of the algorithm proper, essentially divvying up sites into quality buckets. It also explains why Panda produces impact at the site level and not page level.

    The ‘web page similarity’ bit is very interesting. I’ve seen some anecdotal evidence that makes me believe that Google may frown on sites that thwart it from applying keyword clustering. This seems to occur when a site consistently targets slightly different modifiers of a root term. So, a page for both ‘top rated cameras’ and ‘best cameras’ might be too similar.

    True or not, it’s fun stuff to think about.

  12. Hi AJ,

    It is quite possible that a decision tree would be used to identify feature associated with queries to use to classify those queries.

    I’m not sure if that would be a driving force behind determining whether or not a OneBox result should be shown. Some of the stuff I’ve read about OneBox results seems to indicate that a situational relevance score might be the driving force behind including one one of those (such as an intent to take a flight, see the score or schedule of a sporting event, find local weather, etc.) Another Google patent filing indicated that clicks and mouseovers might be used to determine whether the search engine might continue to show OneBox results as well.

    Panda does seem to be identifying features associated with specific pages and sites, and scoring those based upon a quality score associated with those features. It’s definitely possible that Google is using a decision tree process, though it could potentially use other classification methods as well. It was interesting to come across this “results optimization” patent, and see something about how decisions trees could be used to classify queries on different data sources.

    The questions that Amit Singhal posed about Panda did come at web page similarity from at least a couple of different directions, asking about both the originality of content, and whether or not a site contains “duplicate, overlapping, or redundant articles on the same or similar topics with slightly different keyword variations.” I’d like to see the metrics for web page similarity that Jeff Kolen produced while at Google as well.

    Definitely fun to think about.

  13. Bill,

    I hadn’t read that the ongoing presentation of Onebox might be influenced by user behavior. Thanks for that information.

    One of the theories around Panda (or maybe it’s just my theory) is that Panda samples a representative amount of pages from a site and scores them on quality to produce a site-level score for use as a type of filter on results. If too many pages score poorly, Google assumes the general quality of the site is poor and won’t apply as many algorithmic resources against that site.

    That might explain why Panda is a site and not page level attribute and would make the advice to remove shallow content from your site corpus make more sense.

    The question that keeps coming up for me is: “Why is Panda applied at the site level and not the page level?”

    Is it purposeful to encourage better content or a necessity because of computational constraints?

  14. Hi AJ,

    I’ve written a few articles about the OneBox in the past. On Search Engine Land back in 2007, I wrote about a patent that introduced it, and described how user activity on Google’s vertical search engines might influence whether or not a OneBox result might be shown in Google’s Web results:

    Google’s OneBox Patent Application

    Of course, One box results for weather, for flights, for sports scores, don’t really rely upon user behavior to be selected as a onebox result, but might be triggered other ways. For example, a [define:xxxxxx] or a [what is xxxxxx] might trigger a definition OneBox result.

    I wrote about these types of OneBox results in response to patterned queries in the post:

    Google Subscribed Links Patent: Why Do Some OneBox Results Require No Subscription?

    Google did publish a patent filing which describes how they might use mouseovers to determine how much interest and attention searchers might have in both search results and OneBox results:

    Where you Point Your Mouse May Influence Google Search Rankings, Advertisement Placement, and Oneboxes

    Not sure that Google has been looking closely at mouseovers, but it’s quite likely that user behavior influences whether or not a OneBox result hangs around, at least for those that aren’t triggered by patterned queries.

    It’s possible that the quality scores from Panda may be implemented on a page-by-page basis or a sitewide basis, or potentially even across more than one domain. When I wrote about the Google Patent, Deriving and using document and site quality signals from search query streams in Google’s Quality Score Patent: The Birth of Panda?, one of the things that I noted about the patent is that it might be applied per page or per site, or on an even wider basis, so that some types of sites might gain or lose quality points based upon such things as whether they are in a certain language, hosted in a particular country, or even focusing upon a specific topic, or a set of sites under common ownership.

    The process described in that patent appears to be more focused upon associating a specific site or sites with a certain query, based upon a quality score for those sites based upon a few different features.

    It’s possible that the Panda quality scores may also be applied very narrowly, or upon a much wider scope. Those scores may be based upon a wide range of features derived from machine understandable factors responding to the questions that Amit Singhal published about site quality relating to Panda.

    I’m not sure that Google is just taking a representative sampling of pages from sites to come up with a sitewide score, but there has been a lot of emphasis from Googlers writing about the update that suggest blocking or removing or improving lower quality pages on a site, and also suggesting that bad pages could drag down good ones.

  15. Great article as always Bill.

    The decision tree development is interesting and it’s refreshing to see the google view rather than an implication focused article.

    On a slightly different note though – the example regarding a search on “football” certainly covers the European, Australian and American perspectives, but couldn’t this issue be resolved with a single IP lookup on the users location and a redirect to the corrsponding google page – .co.uk in the UK, .com.au in Oz and .com in the US?

    If I do a search for my site in the UK AWE using the term AWE:

    The first thing I see is Atomic Weapons Establishment in the UK.
    In the US we see AWE Tuning.
    In Australia we see the Australian Oil and
    Gas Exploration and production company.

    In this example if the geography is the deciding factor in displayed indexed results then by calibrating a machine learning code against the geography I suppose it could be a way of streamlining the data centre resource.

    You’ve definitely given me something to think about Bill!

  16. Hi Bill,

    Thank you for such a technical insight post. I appreciate your effort.I normally tend to use Yoast Unpersonalized search plugin to check the results.Do you think would there be a possibility of data center impact on this as well.

  17. I like the examples of Football, Here in Europe you wouldn’t expect to see many results on Aussie rules or American Football so Google must be reading the location of the user and sending the information as per Locale

  18. Hi Bill,

    Fascinating. I think it’s entirely rational for Google to load balance between data centres. Do you think that pre-fetching via instant pages in Chrome adds to this load?

    Cheers,

  19. Hi Tom,

    Thank you. The decision tree approach may show up in some other places in Google patents, like the features that Google might use to weigh links in their Reasonable Surfer approach to PageRank. My post on that is:

    Google’s Reasonable Surfer: How the Value of a Link May Differ Based upon Link and Document Features and User Data

    I think Google wants people to be able to make a decision about whether they use Google.com in their searches or one of the country based Google domains, and isn’t in a hurry to force a redirect upon them. Also, IP addresses aren’t necessarily always the best indication of where people are actually located. For example, people accessing the Web through a service like AOL would go through a proxy server located in Virginia, even if they weren’t anywhere near the state.

    For a term like your initialism [AWE], chances are that there are a variety of results that are responsive to it, and they might be coming from more than one data source. There may not be an overwhelming single result for that query, and Google may also be attempting to make sure that they provide a diversity of results in the top ten or top 50 or top 100 that are responsive to it. Since the term isn’t very specific, and it may be difficult to associate an intent with the term, or imply a geographical intent like you might be able to do with something like [Pizza] or [plumber], it’s not a bad idea to build a classification model that might be based upon actual data found on the Web, and possibly even usage data, to decide which data center to access.

    If you’re in the UK, and you perform a search for the Richard M. Nixon Presidential Library and Museum in California, it might not be a bad guess that a data center near the West Coast of the US may not be a bad choice as a source. But that’s much more tightly tied to a specific geographic location. With more ambiguous queries, a system like the one described in this patent might start looking at data centers nearer you first.

  20. Hi Stefan,

    You’re welcome. I haven’t looked at Yoast’s plugin, so I’m not sure how it works. If it functions by adding the particular parameter that removes personalization from results, that may not impact how Google might choose which data center that you may receive results from.

  21. Hi Philip,

    I’ve been attributing a preferred country bias in the past to the different results that people see when they search for [football] in different parts of the World, but hadn’t anticipated that the choice of data center might also play a role. At least until now.

    See my post: Changing Google Rankings in Different Countries for Different Searchers.

    It’s possible that a preferred country biasing could work with this approach towards predicting which data source to show information from.

  22. Hi Chris,

    It’s quite possible that prefetching does add something, which would mean that an approach like this one, that would help identify the best data source, and maybe optimise results so that a nearby data source is the best option for the majority of queries for people it services, would help in reducing the bandwidth required somewhat.

  23. Bill – a very interesting article.

    Do you know how many data centres Google owns/rents? Do you think that there is at least one data centre for each country specific Google version?

  24. HI Blaine,

    Thank you. Google does share some information about the data centers that they own, but they seem to be pretty quiet about the ones that they don’t.

    This 2008 article points out some locations that Google doesn’t list on their pages:

    Where Are All The Google Data Centers?

    They tell us:

    There are 36 data centers in all—19 in the U.S., 12 in Europe, 3 in Asia, and one each in Russia and South America. Future data center sites may include Taiwan, Malaysia, Lithuania, and Blythewood, South Carolina, where Google has reportedly bought 466 acres of land.

    That was 3 years ago, and I expect Google has probably spread out even more, including a data center in Finland that will be cooled completely by ocean water.

    Google’s Language Tools page lists 184 cc tlds that Google is available at, so they might not have one data cetner for each of the country specific local domains.

  25. Wow so much to learn for today and I am already overflowing with information. Thank you Bill fr this information. There are so many comments as always.

  26. I had never thought of this matter in such depth. The post was very technical, unlike others I’ve read that’s more or less a guess. This explained and cleared quite a few things for me in regard to how data centers may or may not affect search results.

  27. Hi George,

    I found the approach described in this patent pretty fascinating as well. I can’t say that I had seen something from any of the search engines before that described in such great detail how they might decide to send someone to one data center or another in response to a query, though I had read a good number of blog and forum posts that described generally how things like load balancing was likely involved. The patent went beyond load balancing to describe other reasons, including efficient uses of resources to decide whether they might send a searcher to one data center or another.

  28. Pingback: searchfanatics.org
  29. Bill,

    I understand all of this. I am in a fix though. I have a site which ranks superbly well in one part of the country (the North) while in the South, it is on the 2nd, 3rd and even the 8th page. Some competitors (and i know that they are doing a pretty good job at their SEO) are uniformly placed everywhere.

    Let’s put it this way

    The 3 competitors are invariably ranked 1,2 and 3 for most keywords in the north and my site is ranked 4th. In the south, I am nowhere.

    Any light on how to resolve this would be immense.

    Thanks!

Comments are closed.