How Human Evaluators Might Help Decide Upon Rankings for Search Results at Google

A Google patent granted last week describes how the search engine might enable people to experiment with changing the weight and value of different ranking signals for web pages to gauge how those changes might influence the quality of search results for specific queries. The patent lists Misha Zatsman, Paul G. Haahr, Matthew D. Cutts, and Yonghui Wu amongst its inventors, and doesn’t provide much in the way of context as to how this evaluation system might be used. As it’s written, it seems like something the search engine could potentially make available to the public at large, but I’m not sure if they would do that.

In the blog post Google Raters – Who Are They?, Potpiegirl writes about the manual reviewers used by Google to evaluate the relevance and quality of search results by parsing through a forum where people have been discussing their experiences as reviewers for Google search results and collecting information about how the review program works. It contains some interesting information about the processes used by people who have been working to provide human evaluations for Google’s results, including a discussion of two different types of reviews that they participate in. One of those involves being given a particular keyword and a URL, and deciding how relevant that page is for that keyword. The other involves being given two different sets of search results for the same query, and deciding which set of results provides the best results for the query term.

A screenshot from a Google patent describing a framework for evaluation search results generated using different scoring weights.

In the Youmoz blog post Introducing SERP Turkey: A Free Tool to Split-Test and Gather CTR Analytics of SERP Entries, Tom Anthony describes a process someone can use to measure changes to click through rates when taking a set of search results from Google and making changes to it. It’s an interesting idea, and may be worth experimenting with.

The patent examiner involved in prosecuting the patent included a link to a Google Custom Search Engine page which describes how users of Google Custom Search Engines can tweak the results returned by those in Changing the Ranking of Your Search Results. The process involved doesn’t allow the custom search builders to give different weights to different types of signals, such as more weight to words found within page titles, and less to words found in links to pages, but it does enable the application of different weights to labels used for some pages.

The patent, Framework for evaluating web search scoring functions (US Patent 8,060,497), was filed on July 23, 2009 and was granted on November 15, 2011. In its abstract, we are told that it covers:

Methods, systems, and apparatus, including computer program products, for testing web search scoring functions.

A query is received. A first and a second scoring function are selected by receiving search results responsive to the query; applying candidate scoring functions to the search results to determine scores for the search results for each candidate scoring function; identifying pairs of the candidate scoring functions, and calculating a diversity score for each of the pairs. A pair of candidate scoring functions is chosen from the one or more pairs of candidate scoring functions based on the diversity scores, and the alpha function is selected as the first scoring function and the beta function is selected as the second scoring function.

The plurality of search results are presented in an order according to scores from the first scoring function and are presented in an order according to scores from the second scoring function.

It’s hard to tell exactly who the processes in this patent might be written for. Is it for human evaluators who are hired by Google to experiment with ranking signals to find the ones they like best? Is it for creators of Google Custom Search Engines? Is it for site owners and searchers to explore and experiment with?

We are told about the advantages of this framework, which includes:

  1. Existing search engine infrastructure can be leveraged to allow users to experiment with various scoring functions without large implementation overhead.
  2. The performance of different scoring functions can be compared.
  3. An ordering can be generated for scoring functions based on pair-wise comparisons of the scoring functions, even if all of the scoring functions have not been compared to the each of the other scoring functions.
  4. Evaluations from questionable evaluators can be discounted.
  5. A market for scoring function evaluations, in which evaluators are rewarded with incentives, can be generated.
  6. A contest, in which teams of users submit scoring functions and evaluate each other’s scoring functions, can be run.

Would Google open up evaluation of search results to the public at large, with the possibility of incentives as a draw?

The patent does tell us about a number of different signals that might be considered and compared in a side-by-side comparison of search results for specific queries.

Those might include doing things like giving different weights to words found within a page title, or within anchor text pointing to a page, or whether terms from queries are found in the title, or in a URL of a page in the search result, or in the body of a result, or even how many times a term can appear within the body of a page before subsequent appearances of that term are discounted.

The patent goes into detail on different types of signals that might be involved in ranking pages in search results, such as:

  • The terms of the query
  • A geographic location from where a query was submitted
  • The language of the user submitting the query
  • Interests of the user submitting the query
  • Type of client device used to submit a query (mobile device, laptop, desktop)
  • Locations where the query term appears within a document (title, body, text of anchors pointing to a page)
  • The term frequency (how frequently the term appears in the document compared to how frequently it appears on the Web in documents in the same language)
  • A document frequency (the number of documents that contain the query term divided by the total number of documents on the Web)
  • A measure of the quality of the individual search result
  • The geographic location where the search result is hosted
  • When the search system first added the search result to its index
  • The language of the search result
  • The size of the search result
  • The length of the title of the search result
  • The length of anchor text in links pointing to the search result
  • The number of documents in the domain of the search result that have a link pointing to that document using certain anchor text
  • The number of documents on other domains that have a link pointing to that document using certain anchor text

An API, or application programming interface, might be set up to easily allow different weights to be applied to different signals in a system like this. It’s possible that this system might be set up to be used by Google employees who are tweaking different ranking signals to be compared by evaluators that Google has hired.

Conclusion

I can’t say that I’ve seen anything from Google before on how they might present different sets of search results to their evaluators for comparisons like described in the blog post from Potpiegirl that I’ve linked to above, and it’s possible that this patent may give us a few hints at how the results being compared might be generated and tracked.

One thing that the patent does point towards is the possibility that Google may use these evaluations to create unique mixes of scoring signal weights for different query terms or different classifications of queries.

The patent describes a few different ways of classifying queries based upon either characteristics associatied with those queries or their subject matter:

Long queries – Having more than a certain threshold number of characters.
Short queries – Having less than a certain threshold number of characters.
Popular queries – These show up in recent query logs more than a threshold number of times.
Unpopular queries – These appear in recent query logs less than a certain number of times.
Commercial queries – These contain terms that indicate commercial activity such as “deal,” “price,” “buy,” “store,” etc.
Noncommercial queries – These don’t contain terms that indicate some kind of commercial intent.

Queries might also be classified based upon topics such as travel, food, current events, etc.

The patent also describes how it might statistically identify when the results of comparisons received from evaluators might be suspicious in some way, such as failures to objectively evaluate scoring functions. For instance, if an evaluator often or only selects results shown on one side when comparing two different sets of results, there’s likely something funny going on.

If you’re interested in how human evaluators might been involved in helping to improve the quality of search results at Google, this patent appears to contain some hints at how that may be done.

Share

25 thoughts on “How Human Evaluators Might Help Decide Upon Rankings for Search Results at Google”

  1. I have heard through at least a few people that Google regularly combs through trending and popular searches with these types of manual reviews, but I didn’t think they had the manpower to extend far beyond that (such as unpopular or long tail keywords).

    It’s apparent though during the holidays especially, that a Google search for “black friday deals” could stand some manual review :)

    Enjoyed reading Bill!

  2. Bill,

    Like all ranking metrics, I am sure that this one will also be tested in terms of synthetic manipulation in coordination with outsourcing.

    As in the case of link building, it seems that all a team would need to do would be to find a natural-in-appearance voting system where votes are simply sold to the highest bidder or a constantly changing package is sold to anyone who will pay for it.

    This is always a game of “cat and mouse” it seems.

    Very interesting, as usual :)

    Mark

  3. I’m pretty sure Google will never reveal the results of their “human” tests. They want to keep SEO’ers confused and guessing. Real human reviews are the best way to really determine the best site for a particular topic. Google wants to do whatever they can to make that “real” and not manipulated.

  4. Very interesting indeed, Bill. I think if Google was concerned about a decline in search volume they very well could move towards incentivized input and search.

    It might keep people around.

    Darren

  5. Wow I never thought a day would come when google would go to become even partial manual! I thought google was the big robot controlling the internet. Thanks for the post this really makes me want to learn on what google plans for the future of search results.

  6. Doesn’t Google really already have human evaluators every time someone clicks on a SERP link? If they didn’t muddy the water so much with Adwords and other junk their click through data would be more reliable.

  7. Hi Keith,

    Thank you.

    There are a number of different ways that Google might try to evaluate the quality of the search results that it serves, and I suspect that while a number of them are automated (for instance, see the patent I wrote about in the post How Google Might Suggest Topics for You to Write About), I think Google also places some value in having people manually review queries for relevance as well. Of course, that doesn’t scale well, but it may help surface some problems that an automated system just can’t.

    The automated systems Google has may also suggest some of the long tail and unpopular queries seen at get looked at as well, and using both systems to check on each other sounds reasonable, especially if they help find queries that deserve different algorithmic approachs than what Google might be using now.

    A fairly recent interview with Google’s head of research, Peter Norvig indicated that Google performs a very large number of experiment on their core search system every year, on the scale of tens of thousands. We’ve also been hearing from Google’s Matt Cutts and Amit Singhal that Google implements about 500 changes a year to those algorithms as well. The idea of human evaluators being able to see some of those changes before they might be tested live at Google makes sense.

    I’m going to stay away from that “black friday deals” search. :)

  8. Hi Mark,

    I wasn’t too suprised to see that the patent had included a paragraph that described some of the signals that Google might be looking for from their evaluators that might indicate that they might be biased in some way, or taking some shortcuts, or doing other things that might be suspicious. Of course, in most patent applications, the description provides some details, but likely is purposefully living out others. While you do need to provide a reasonable amount of details in a patent description, you don’t need to include a step-by-step guide on how to build the invention described within the patent.

    I’m sure that Google would try to do all that they could to keep a system like this from being gamed if they were using it in a way that could directly impact search results, and that they would limit how much of an impact that any single evaluation might have.

  9. Hi Alan,

    Google did publish a paper last year which describes some of their framework for testing new algorithms and approaches, which I wrote about in We’re All Google’s Lab Rats.

    Chances are good that Google will test new ranking signals or different weights on old signals in a number of different ways, possibly using some small control groups, and then human evaluators, and in a number of cases by showing results live to a small percentage of people accessing Google at a specific data center. By running a number of different tests under different conditions, it gives them the chance to compare the results of those tests.

    Google can also use automated evaluation systems along with those, and work with others who may not have been involved directly with those changes and experiments to evaluate how well they might work. Interesting though that this patent seems to give us some insight into one aspect of testing.

  10. Hi Raymond,

    I’m not sure if Google would open up this type of testing to the public, but if you ever get the chance, you should look into Google Custom Search Engines, and the many things you can do with it while setting up your own search engine.

    It’s possible that some of the information that gets created through Google Custom Search may end up as part of how Google creates things like Query Refinements and may even play a role in the rankings of some search results under Google’s approach to trust rank.

  11. Hi Steve

    Google does collect a very large amount of data within their query logs about the results that people see, the results that they click upon, how long people might spend on pages they do visit, where they are visiting from, what language they have set as their preferred language, and much much more.

    That data can be used to personalize search, to show customized results based upon a web history or location, to generate query refinements and spell correction, to offer queries that can be used in Google suggest, to determine how relevant results might be for queries, and so on.

    But, Google has also developed a framework for testing new or altered algorithms as well, and having some manual reviews as part of that framework provides them with a chance to test changes before they might make them live.

  12. The sheer amount of websites that are out there makes it difficult to have human evaluators in place.

    Perhaps human evaluators sre used to test the effectiveness of their algorithm, and rate the results that they obtain from search queries.

  13. Hi Talha,

    The Web is too large to rely upon human evaluators to determine the relevancy of each page, but they definitely could be used to test the algorithms in place for different queries, and for different classifications of queries like I described in the post.

    I think that’s why we’re seeing less reliance on sites like the Yahoo Directory and DMOZ – they just can’t keep up.

  14. It’s good to know that humans are being used as evaluators but as you say, that shouldn’t scale very well unless they are used to test and validate algorithms which are then improved and deployed for massive automated evaluations. The use of social media and other indicators does seem a better indicator of quality and relevant sites, using the concept of crowdsourcing to establish what is popular and what isn’t.
    Heading over to Potpiegirl’s blog!

  15. I wonder how much demography they have on search graders. Certainly SERPS preferences can be subjective in some regards (even though the grader’s document I read seemed very explicit). What controls in place could be there to evaluate the influence of personal opinions? Perhaps just a large sample set?

  16. I’m pretty sure Google will never reveal the results of their “human” tests.

    Btw, Nice post. Waiting to get your next. Thanks for sharing. :-)

  17. Hi Eliseo

    The only reason to use human evaluators is to test changes to algorithms, to do checks on some results where automated evaluations show there might be some problems, and to do some spot checks on other queries or categories of queries.

  18. Hi Chris,

    I would suspect that they would ask for that kind of information when hiring people, so that they do have demographics for people they are signing up as evaluators.

    Search results evaluations are likely somewhat subjective, but chances are that they would use a fairly large sample. The patent also notes that Google would look for unusual or suspicious patterns and activities in the evaluations that are provided by evaluators as well.

  19. Hi Al-Amin,

    Thanks.

    I’m not sure that there’s really much value in sharing specific data about human evaluations of search results, for Google and for people looking at the results. If Google is testing the algorithms that they use to rank pages, the algorithms are something they would likely want to keep from sharing with the public, with possible competitors, and with people who might want to try to manipulate rankings of pages.

    Google has published the results of some studies where they’ve conducted usability tests with people, and provided some details of those tests.

  20. Definitely agree with Alan here. While the idea of Google actually sourcing the man-power to maintain a remotely influential evaluation system seems a bit far fetched, what little tasks they do perform ‘manually’ will be most likely kept secret/used as a scare tactic.

  21. Hi Dan,

    These human evaluators are used to test different mixes and versions of rankings when Google tests new algorithms. I don’t believe that their evaluations directly affect search results ever. The Web is too large to rely upon human evaluators when it comes to ranking pages.

    No scare tactic or even secret on the part of Google. They’ve been making it clear for years that they have human evaluators who look at some search results.

    Of course, Google has also made it pretty clear that they will perform manual spam review checks when appropriate, but the people performing those aren’t these human evaluators.

  22. Hi Fergus,

    Thank you. I did manage to catch that video when it came out, but it’s a nice one to include in this discussion, so thanks for thinking to post it.

Comments are closed.