Google’s Human Evaluators, Local Search, and Web Search Algorithm Testing

Google employs human evaluators to judge the relevance of web pages in search results, but according to Google’s Matt Cutts, usually only when engineers from the search engine are testing a new algorithm, and want to compare the results with the ranking algorithms that they might be replacing. (We’ve also seen that Google likely uses human evaluators to uncover web spam as well.) Matt Cutts answered a question on how Google uses human evaluators in a video filmed last month:

Google was granted a patent today originally filed in July of 2005, that describes how human evaluators might be used to test algorithms, as well as in actual live ranking systems for local search and for web search. Those evaluations of search results pages for specific queries could be used in a statistical model that might influence search results. Google may only be using human evaluators for purposes of testing search results (and finding web spam), but it’s interesting to see both the testing and ranking approaches described within a patent from Google.

One of the things that I enjoyed about the patent was how it describes some of the possible ranking signals that might be used in ranking Google Maps type results, and how important the relevance of a business name to a query might be in the process of ranking that result. A couple of months ago, Google was granted another patent that looked at business names as local ranking signals.

This newly granted patent is:

Prediction of human ratings or rankings of information retrieval quality
Invented by Michael Dennis Riley and Corinna Cortes
Assigned to Google
US Patent 8,195,654
Granted June 5, 2012
Filed: July 13, 2005

Abstract

A statistical model may be created that relates human ratings of documents to objective signals generated from the documents, search queries, and/or other information (e.g., query logs). The model can then be used to predict human ratings/rankings for new documents/search query pairs. These predicted ratings can be used to, for example, refine rankings from a search engine or assist in evaluating or monitoring the efficacy of a search engine system.

We’re given some examples of how a human evaluator might work. In the first, someone might be shown a page corresponding to the home page of the store “Home Depot,” and may be asked to rate how relevant it is to the search query “home improvement store.” They might provide a rating between a certain range of numbers such as 1-5. Multiple evaluators might be asked the same question, and those might be collected, along with ratings for other pairs of queries and pages.

The results of those ratings might be used to automatically generate a number of statistical signals.

In another example, we’re told that someone might be shown a query of “maternity clothes.” and shown 4 pages associated with web pages. Two of those might be for stores that actually sell maternity clothes, and they might be given the highest number (5). The next might be a business that “caters to pregnant women, but does not specialize in selling maternity clothes.” It might be given a lower score (1). A fourth page might be that of a law office. Since it’s likely not going to be relevant to a searcher looking for maternity clothes, it might be given a relevance rating of 0.

In a slightly different approach, the human evaluators might be given a specific query term, and a list of pages. Instead of giving those pages scores, they might be asked to list them in order from most relevant to least relevant.

Examples of Possible Local Search Signals from Evaluators’ Scores and Rankings

In a local search arena, the following are some examples of rules that might be generated from evaluators:

Number of Words Matching a Business Name

The number of words in a search query matching the business name from a search result might be ranked from zero to one. A one would indicate that all the words in the search query match the business name. A zero would indicate that none of the words in the search query match the business name. If the query is two words long, and one of the words match, that pair would be given a value of 0.5.

Number of Words matching a relevant category

Instead of looking at the business name, the category and/or subcategories that a business might be listed in might be paired up with a query to see how many words match. A search for [pizza restaurant] might find a partial match with the category “Italian restaurant,” and be given a 0.5 score, for example.

Prefixes, Suffixes, and Substrings

I’m bundling three possible signals here. A prefix might be the first word in a business name. A suffix might be the last word. A substring might be any word within a business name.

Imagine that someone searches for [lowe's], and one of the choices is “lowe’s home improvement.” The match of The query term with the prefix portion of the business name is a positive match, with a score of one. It’s also a substring of the business name, so it also might have a value of one. Since it’s not a suffix, it gets a value of zero. If the query was [home improvement] instead, it would have a prefix value of zero, a suffix value of one, and a substring value of one.

Exactly Matching a Business Name

Someone searches for [home depot] and the business name is “home depot,” the score for an exact match would be a one. But, if the query was [home depot garden] and the business name is “Home Depot,” the relevance score for exact business name might be a zero.

Best Match of Query to Business Name rather than Category Name

When the best match of a query is to a business name, it might be given a relevance score of one. If instead, it’s a better match for the category name, it might be give a relevance score of zero.

Dynamic Signals

In addition to a statistical model of signals like those above, created from human evaluation scores on a test set of local search results and queries, the search engine might also look at a set of “dynamic” signals. These signals might be taken from query log files from prior local search sessions. For example, if someone quickly clicks upon “a business listing or clicking on a phone number link, directions link, or other link associated with the business listing may indicate that the business listing is ‘good.'”

We are also told that when there’s a decision as to which signals to include in the statistical model, this system might err on the side of over-including many signals rather than under-including signals. Signals that might not be statistically relevant may end up being de-emphasized by the model.

Web Search Results

Like the local search signals that might be generated as relevance signals, it’s possible that other signals might be identified from human evaluators. One might be whether or not the query term appears within the URL of the page. While that might not be one of the actual signals that might be part of a statistical model created from human evaluator rankings, it is a possibility under the approach in this patent.

In addition to the human evaluator-based statistical model, web search results would also likely be ranked based upon “query dependent signals,” like an information retrieval relevance score, and “query independent signals, which don’t change based upon what the query might be, such as PageRank.

Again, another aspect of a final ranking score may also involve “dynamic features,” such as clicks and time variations between clicks. The patent tells us:

Signal 925 may define how long it takes (i.e., the duration between when a user first views the result document set and selects the document) an average user to select the document when it is returned to the user at a particular location in a list of search results or how long a user spends viewing the document based on the sequence of user click times. Signal 926 may define the fraction of users that first selected another document before selecting this document.

The patent does tell us that instead of being used as a ranking signal for web search results shown to searchers, that instead it might be used in a manner as described by Matt Cutts in the video I started this post with, as a way to compare the results from one algorithm with different results from a newer ranking signal.

Takeaways

We know, based upon the video from Matt Cutts and this patent, that Google likely uses human evaluators to help them test the relevance of search results. We don’t know if the kind of statistical model described in this patent is part of that process.

The signals that are described within the patent seem reasonable, but somewhat simple. They are examples for purposes of helping describe how the process described within this patent might work, but the patent does tell us that other signals might be created using this approach as well.

I’m not sure how well an approach like this would work based upon how many documents there are on the Web, and how many queries are performed on a regular basis on both web searches and local searches. Actual query log information and user behavior data such as clicks seems to potentially provide a much wider and more scalable range of information that could be useful in making relevancy determinations

A Google Lat Long post from December of 2010, How Local Search Ranking Works tells us that key elements involving how local search results are ranked include location prominence, distance from some geographic centerpoint, and relevance.

We don’t know if that “relevance” is determined in a method like that described within the patent, or by using an approach like I wrote about recently involving business names, or in another manner.

Share

41 thoughts on “Google’s Human Evaluators, Local Search, and Web Search Algorithm Testing”

  1. I have not commented for a while but wanted to say Thank You Bill for always explaining these things in a way I can understand and learn from. Your dedication is very much appreciated.

  2. Hi Michael,

    It is good to see you. Thank you for stopping by and commenting. SEO by the Sea has been my online workbook, and I learn best by trying to explain some of the things I run across in a way that others might be able to understand them. Its a gift to be able to share. :)

  3. Hi Frank,

    Thanks for sharing the link to your article. I do think it helps to have a variety of ways to measure the quality of your results, and I can see the value in using human evaluators as one of a number of approaches in trying to do that. Great to have some insight from someone who has experienced it first hand.

  4. Great Google! At least you’ll get rid of those unnecessary spammers out there. I see websites being de-indexed after Panda, and I don’t feel bad for them, the Search engine rules are there for a good reason, if you can’t follow it, then that’s too bad for you..

  5. Good report Bill.Google should employ more human Evaluators to check websites which are ranked within top 5 for popular niche specific keywords like website design, seo etc.
    Testing algorithm is ok but they should make use of human evalutors for checking top search results also.Because you know there are many spammy websites are within Google top 5 results which are not helpful or useful for normal people.
    Also I think they should be more active on spam report and reconsideration requests

  6. It is great that Google is trying to get rid of spammers. But it seems a bit unfair to devalue thousands of websites without any warning. Some business don’t have any work because of the devaluation of their sites. Not all of them are just spammers or unfair players.

  7. Bill awesome, awesome article. The best and I mean, the best thing I like about you is, the fact you really understand the technology behind search, and this article proves it. I do not know of anyone else who understands it like you, well except maybe Danny S.

    You are a true Gem my friend!

  8. Thanks a lot for this wonderful post. It is great way for Google to get rid of spammers. Lets hope this will make our life better.

  9. Has Google become human!?
    Thanks for making this so clear as usual.
    Well done for finding the video, I haven’t seen it anywhere else yet.

  10. We share the view on this. Wanted to drop by and give you my takeaway from SMX in Seattle as well. It’s a quote you might like :)

    “A lot of references to Bill Slawski but that’s ok becasue he’s a genius”

  11. I think as we progress and SE’s are jumping around signal to signal, changing relevancy, they’ll realize that human moderation is the easiest method to ensure integrity within ranking results. The landscape changes quicker than the test periods for algo updates, let’s be real!

  12. Hey Bill, if I were an online business owner of a company selling physical goods, then I’d be real happy with the local results relevancy.

    First off, because that company that relates to pregnant women may actually have better stuff than those maternity clothing stores.

    Secondly, because the Venice update is far from accurate: if Google can’t find any store for your specific area then it should display stores within a certain radius of your location… call it radius relevancy. I don’t see that happening any time soon, but they should make it happen.

    Good example: I’m NOT from Amsterdam (1 hour away) but live in Holland and yet Google shows me Amsterdam stuff while the nearest city is 5 minutes away. Like I said: we need radius relevancy haha.

    Regards,

    Dennis Miedema

  13. Although Google is getting smarter everyday, nothing has been developed yet that can replace Human Editing. Glad to see that they are testing some of their results with humans.

  14. Matt Cutts said recently that all emails to webmasters regarding unnatural links were sent after a manual review of the concerned websites. Since there seem to have been around 700K emails of this kind in a relatively short time, that’s a lot of work :-)

  15. Bianca, I totally agree. Though Google is incredibly smart and important, they can still never completely replace human editing. There are just things that humans know more than Google bots.

    Great video.

  16. The human raters are just a control group test for the actual algorithm changes. I don’t think their data / opinions of which site is better is factored into the algorithm itself. It is a litmus test, essentially

  17. Very nice article its nice to see a bit of insight as to how google works. Human evaluators can only be good for everyone as the number of spammy trash sites tht offer nothing in terms of quality or content is getting out of hand. Thanks Bill

  18. Hey Bill, great blog you have here. There is just so much “average” information out there, so when I come across something really interesting then I bookmark it immediately.

    Everyone is always pretending they know how Googles algorithms work – but they are all guessing. However, keeping informed about their patents give a good way to preempt changes to the algs, or even align yourselves with new products or services that are in the pipeline.

  19. Hi James,

    While Google likely uses some human evaluators to help identify spam, I would suspect that the approach described in this patent is aimed more at trying to improve the quality of search results.

    It may have the impact of having some sites not rank as well for certain queries when they aren’t very good matches, but it’s aim isn’t directly in identifying and removing spam from search results.

  20. Hi Kushal,

    Given the amount of actual queries that get performed on a daily basis, including a good percentage that are likely new everyday, I’m not sure that Google could use human evaluators to check all those pages.

  21. Hi Dirk,

    I really just have to believe that many of the people who were actively engaged in buying links and participating in blog link networks knew the risk that they were taking, and understood that those links could disappear overnight, and their sites could be penalized.

    I do feel sorry for site owners who didn’t have any knowledge and were harmed as well. For anyone who launches a site on the Web though, you can’t just rely upon one source of traffic, like Google, though.

  22. Hi sajan,

    It is more likely that the approach described that involves human evaluators is aimed at improving search quality rather than identifying spam. It may have the result of moving results that are spammy lower in search results, though.

  23. HI Geoff,

    Not human, but humans may play a little larger role in what Google does than we might have thought in the past.

    I remembered the video as I started writing this post. It did get some attention when it first came out, but I don’t know how much attention it did get.

  24. Hi Brent,

    The search engineers do seem to place a lot of value in human judgments. I know there are automated methods to try to judge the quality of search results as well, and people at the search engines can look at what people are clicking upon, and not clicking upon as well. But the human evaluator still seems to have a role in deciding upon the quality of results.

  25. Hi Dennis,

    Google does have something they call location sensitivity, where they will broaden the scope of a map that they might show results from. So a search in an urban area for Pizza might only show you a few blocks around you, while the same search conducted in a rural area might have a scope of miles. Likewise, an urban search for antique car dealers might show a map radius of miles as well. It is a kind of radius relevancy.

    It does sound like Google has some issues though with showing relevant Google Maps results in Holland.

    As for giving a higher relevance score for stores that are completely about maternity clothes, as opposed to one that doesn’t specialize in just clothes, I’m not sure that I have a problem with that decision.

  26. Hi Eliseo,

    I suspect that a lot of those notices were ones that came with a lot of supporting documentation from analysis programs, and likely used templating systems that made it easy for a lot of those emails to go out quickly.

  27. Hi Vince

    That is possible. Matt does tell us that human raters don’t directly impact the rankings of specific pages. He tells us that human raters are used for testing new algorithms.

    But nothing Matt said tells us whether or not those ratings might not be used as part of a statistical model that could be used as part of a relevance signal for other web sites, as described in this patent. Not a direct rating, but rather an indirect one.

  28. Hi Antony,

    Thank you. If I’m going to make guesses, I want the most reliable information that I can find to make them with. I like patents and papers from the search engines themselves as sources of that kind of information. :)

  29. Very useful and informative article. Especially interested in how google are using expert raters/evaluators to improve on the algorithm. I wonder whether they are studying the specific ways in which the evaluators are making their decisions and the particular instances which the algorithm fails.

  30. Very interesting….Well these humans are not doing a good job! After the last update a few weeks ago I was seeing some really strange results. I was finding craigslist ads ranking higher that high quality aged pages. It was going on for about a week and then it seems like they figured it out and reverted the changes. I am just not sure they have the ability to really weed out low quality sites. The future of SEO will be interesting!

Comments are closed.