How a Search Engine Might Determine the Relevance of Search Results from Related Queries

It’s interesting to see how a search engine might try to ensure the relevancy of its own search results.

A recently granted Yahoo patent investigates an approach that might help it identify how relevant the results it displays to searchers might actually be, and how likely those results are to show a variety of results when a searcher uses a query term that might cover a range of topics.

Before presenting their automated approach for checking relevance and variety, the patent tells us about some of the limitations it sees in using manual review or click data for determining how relevant results might be.

Human Reviewers

One option for checking on the relevancy of search results would be to manually screen results for each query. That might be pretty time consuming, involve the possibility of human error, and doesn’t seem like it would even begin to cover all of the queries that are conducted on the web.

I did see an ad on Craig’s List a few weeks ago from Lionbridge Technologies, Inc., asking for part timers to act as Internet Judges. A little sleuthing on the Web revealed Google may have used Lionbridge in the past to hire people to rank the relevancy of search results, though the Craig’s List posting didn’t identify the ultimate employer. From the job description from the ad:

Position Description

Relevance measurement is the foundation of all search engines, without it no one can tell whether a change has made the system better or worse. As an Internet Judge you will be a key participant in helping determine the relevance of search engines. We are looking for Internet Judges who would work from home; review and rate websites based on an objective set of guidelines. Candidates must be avid internet enthusiasts. If you love browsing the web and can follow specific set of guidelines to rate websites, then we want to hear from you.

Search engines do use manual reviewers. So does baseball. They never make mistakes, now do they?

An umpire judging a play at second base.

Tracking Clicks

In a recent post, I described a patent filing from Yahoo, where they presented a method of ranking images based upon a method of predicting clickthroughs of images at different positions in search results.

The assumption behind that approach was that images that appeared to be relevant for a query would be clicked upon, and that a prediction rate for images at certain positions in the search results could be used to identify images which overperformed based upon where they appeared in the results and move those up, and to find images which underperformed based upon their position, and move them down in the results. With image search results showing a thumbnail of an image, that might work well for images.

Would tracking the number of times that web search results get clicked upon when they appeared in search results reveal that those results are relevant for the query terms that they rank for?

A problem with that approach is that searchers only see a page title, abstract (or snippet), and URL for web pages, and those may not accurately reflect the content that appears upon the pages they represent. That limitation means that clicks upon search results for web pages may not be good indication of how relevant those results are for a particular query.

an automated system for judging balls and strikes over home plate.

The image above is from a patent for an Automated Baseball Umpiring System. While it may possibly do a good job of calling balls and strikes, it probably isn’t going to be helpful with other tasks, such as determining whether a hitter was hit by a pitch, or if a runner is safe or out on a close play at the plate.

An Algorithm to Determine the Relevancy and Variety of Search Results

Yahoo’s patented process uses information from recent searches to see if search results match up well with the searches that people are performing at the search engine.

Automatic relevance and variety checking for web and vertical search engines
Invented by Jignashu G. Parikh
Assigned to Yahoo
US Patent 7,558,787
Granted July 7, 2009
Filed July 5, 2006

Abstract

Techniques for automatically checking the relevance and variety of search results are provided.

A query is submitted to a search engine, which uses a search algorithm to obtain search results based on the query. A set of the top n related terms for the query is identified. For each related term in the set of terms, its relative frequency in relation to all terms in the set of terms is determined. If the term does not occur in any of the results, then a loss in variety proportional to the relative term frequency for the term has occurred.

Otherwise, the relevance of the search results is calculated by comparing the proportion of results containing the term with the relative term frequency for a term. This process is repeated for all terms in the set of related terms to produce a total variety and relevance for the results.

When someone searches at a search engine, they enter a query term into the search box, and hit enter.

A set of results is returned by the search engine which ranks those results according to search algorithms. The actual algorithms used to rank those results usually include elements that measure both the relevance and the importance of pages matching the query searched for.

This patent filing describes a testing interface that search algorithm and search engine developers can use to test the variety and relevance of search results.

As I noted at the start of this post, it’s interesting to see how a search engine might attempt to determine how relevant search results might be.

Using Related Terms

This process of determining relevancy and variety in search results starts by identifying terms that might be related to a searcher’s query.

Someone searches for [Amazon] and the search engine retrieves results related to the query and displays results to the searcher.

The results that appear may be relevant to the online store at “Amazon.com” or to the “Amazon river.”

There’s no way to actually determine automatically whether the searcher wants information about one or the other, or about something even different.

But, the search engine might look at query logs and session based search data and other data sets to determine sub-concepts for a query.

Those sub-concepts might be the kind that you see offered at query suggestions by a search engine. See my previous post, How Search Engines May Decide Upon and Optimize Query Suggestion for some ideas on how a search engine might identify and optimize query suggestions for a specific query.

The same kind of data that Yahoo might use to offer “Also Try” type queries or Yahoo predictive search suggestions might also be used to identify sets of related terms for a searcher’s query.

yahoo auto completion query suggestions under the search box

A search engine also tracks the times that queries are submitted to a search engine, which may be helpful in identifying time-sensitive queries.

Related terms may be collected from search engine query log data from the last week, rather than the last year, to make sure that the information is timely.

So, if an earthquake took place a couple of months ago, the query logs around that time might have included many searches for [Amazon earthquake]

A month or more later, there might be a lot less searches for that term, and [amazon earthquake] might not be considered a related query like it would be recently after the time of the event.

A search through recent query logs might show how many times queries that included, or co-occurred” with “Amazon” appeared in that data. So related queries such as “amazon books,” “amazon river,” and “amazon rainforest” might be determined to be related queries if they show up frenquently enough in the query logs that are examined.

The search engine may also look at search sessions from searchers in the query logs, to see how often other queries appear in the same search sessions as queries for, or that contain “Amazon.”

A search session might be defined as multiple searches from a searcher within a specific amount of time, such as an hour or a day.

Relative Term Frequency and Checking for Relevancy

Once a search engine has come up with a set of related terms for a query, it might calculate the relative frequency of each of those related terms compared to the original searcher’s query in the query logs that were examined. Here’s an example of how that calculation might work from the patent filing.

For example, referring to table 216, the F.sub.term of the term “books” is 25, meaning that “books” co-occurred with “Amazon” 25 times within the selected portion of Query Log 210, represented by table 212. Further, the F.sub.total is 50, corresponding to the total number of co-occurrences for all terms within the set of table 216.

Therefore, a determination can be made that the F.sub.relative of the term “books” is 25/50 or 50%. Table 216 further contains the relative term frequencies of all the other terms within the set of related terms. Specifically, the term frequency of “rainforest” is 12/50, or 24%, of “river” is 8/50, or 16%; and of “fish” is 5/50, or 10%.

The relative term frequency of each related term in the set is used to determine both the relevance and variety of search results for a primary query as further described herein.

Those ratios might be used in looking at the search results for the original search query.

If you look at the titles and snippets (or the actual content) of the top ten results in a search for [amazon], does half of those results contain the word “books” like the query logs examined do? Does a quarter of them contain the word rainforest? Is there a mention of the word “river” in one or two of them? Is there at least one with the word “fish” in it?

If the ratios between the query logs and the search results match up well, it might be an indication that the relevance of those results is pretty good. It may also indicate that the variety of the results is good as well.

The patent does warn that some search results may be very relevant, but may also completely lack variety if a searcher’s query doesn’t contain may sub-topics, or related terms involving different topics.

Conclusion

I thought it was interesting that this patent describes a way of finding related query suggestions that is very similar to the method described in a Microsoft patent application in my last post.

The idea that the frequency of appearance of words from related queries could be used to gauge the relevancy and variety of results for a searcher’s query is also worth thinking about.

If half the people using [amazon] in their searches include the word “books” in those searches, should half the search results in a search for [amazon] contain the word “books?” If 20 percent of searchers looking for [amazon] include the word “rainforest’ in those searches, should two of the top ten search results be results about the Amazon rainforest?

Presently, the top ten search results at Yahoo for [amazon] contain two results for the .com version of the bookstore, followed by two results for the .ca version of the bookstore, then the wikipedia page for amazon.com, an entry about the Amazon river, a couple of pages about Amazon’s web services, a result for the co.uk Amazon store, and a final result for Amazon seller services, which lets people sell their products through Amazon.

Do these results reflect recent searches contained in Yahoo’s query logs that include the word “amazon,” or that show up in the same search sessions as a search for [amazon]?

Should the relevancy of search results be based upon the frequency of related terms in recent query logs? Is that a good measure of how relevant those results might be?

I’ve actually written an earlier post about this patent when it was published as a patent application in January of 2008. I didn’t realize that until I was most of the way done with this post, but I think that the two posts actually compliment each other, so I decided to go ahead and publish this post.

I think the two posts do a good job of emphasizing the importance of trying to understand what a search engine might see as “related queries” for a specific query, and how those might not only influence which search suggestions might be shown in a set of search results, but also how relevant a search engine might believe those search results to be based upon those related queries.

Share

8 thoughts on “How a Search Engine Might Determine the Relevance of Search Results from Related Queries”

  1. Great post. Your posts on patents always helps us to understand Google algorithms.

  2. Hi Lohith,

    Thank you. This particular patent is from Yahoo, but I think it does a good job of showing how methods that might be used for something like finding related queries, which all of the major search engines do, could be used in other ways as well.

  3. Very interesting Bill. Related queries are certainly something we pay a lot of attention to here. Getting a handle on what a search engine sees as related and relevant for a particular query provides great insight for content development.

  4. Hi Bullaman,

    We’re thinking along the same lines. I really enjoy going through search results for queries, and exploring different meanings for them, seeing what shows up as query suggestions, looking at plurals and singular versions, joining and separating and hyphenating words when appropriate, and so on. I like seeing what triggers the appearance of images and videos and news and local results as well. That examination can give you lots of ideas when creating content.

  5. Hey Bill! I have a question and I think you would be the man to answer it. How much do you think that visitor tracking in terms of Time on Site, Bounce Rates and other factors persuade sites rankings in the search engines. If very little, will that change in the near future? I am guessing you have already addressed this but I can’t recall your opinion.

  6. Hi Joel,

    There are a lot of hints, scattered through a number of patent filings, that user behavior information such as time spent on site, amount scrolled down a page, mouse movements on a page, and more may play a role in the rankings of pages – and in personalization. I think that user behavior may play an even increasing role in the future. I’ll point to a couple of those here:

    Google’s patent, Information retrieval based on historical data, has a section on how user behavior might be used. Here’s a snippet:

    User Behavior

    According to an implementation consistent with the principles of the invention, information corresponding to individual or aggregate user behavior relating to a document over time may be used to generate (or alter) a score associated with the document. For example, search engine 125 may monitor the number of times that a document is selected from a set of search results and/or the amount of time one or more users spend accessing the document. Search engine 125 may then score the document based, at least in part, on this information.

    If a document is returned for a certain query and over time, or within a given time window, users spend either more or less time on average on the document given the same or similar query, then this may be used as an indication that the document is fresh or stale, respectively. For example, assume that the query “Riverview swimming schedule” returns a document with the title “Riverview Swimming Schedule.” Assume further that users used to spend 30 seconds accessing it, but now every user that selects the document only spends a few seconds accessing it. Search engine 125 may use this information to determine that the document is stale (i.e., contains an outdated swimming schedule) and score the document accordingly.

    In summary, search engine 125 may generate (or alter) a score associated with a document based, at least in part, on information corresponding to individual or aggregate user behavior relating to the document over time.

    Another Google patent describes how that kind of tracking user behavior could work with personalization of search results. See: Personalization of placed content ordering in search results.

    This patent describes how information such as choices of pages from search results, amount of scrolling upon a page, browsing habits, and other user information can help a search engine personalize search results for a particular query. Here’s a snippet from that patent that explains the “why” of why they would do that:

    [0009] For example, assume that a user submits to a search engine a search query having only one term “blackberry”. Without any other context, on the top of a list of documents returned by a PageRank-based search engine may be a link to “www.blackberry.net,” because this web page has the highest page rank. However, if the query requester is a person with interests in foods and cooking, it would be more useful to order the search results so as to include at the top of the returned results web pages with recipes or other food related text, pictures or the like. It would be desirable to have a search engine that is able to reorder its search results, or to otherwise customize the search results, so as to emphasize web pages that are most likely to be of interest to the person submitting the search query. Further, it would be desirable for such a system to require minimal input from individual users, operating largely or completely without explicit input from the user with regard to the user’s preferences and interests. Finally, it would be desirable for such a system to meet users’ requirements with respect to security and privacy.

Comments are closed.