When someone performs a search at a search engine they tend to use only a handful or less words to try to find information about a topic. That presents a search engine with the challenge of trying to find web pages and other results in response and attempting to understand the intent behind that search.
If someone enters “new york pizza sunnyvale” (without the quotation marks) into a search box at Google or Yahoo or Bing, it’s not quite clear whether they are looking for: (1) pizza in New York, in a neighborhood or area referred to as Sunnyvale, (2) New York style pizza in a place called Sunnyvale, (3) a place called “New York Pizza,” in Sunnyvale, or (4) some other result.
One approach that could be followed to try to understand the intent behind a query like this is to break down the words in the query into entity types, and apply labels to those entities. With the “new york pizza sunnyvale” example, that could be done a few ways:
[new york pizza]/food [sunnyvale]/location
[new york pizza]/business [sunnyvale]/location
[new york]/location [pizza]/food [sunnyvale]/location
This kind of attempt to disambiguate, or find the meanings or senses behind words and phrases, used in a query could be helpful in finding results that might better match what a searcher may be looking for.
When I perform the search “new york pizza sunnyvale” in Google, the top result is Giovanni’s New York Pizzeria in Sunnyvale, California. At Yahoo, my top result is a place called New York Pizza in Mountain View, California. A search at Bing gives me a top result showing a directory of pizza places in Sunnyvale that serve New York style pizza. Most of the other top ten results at all three search engines are about pizza in California, rather than results about pizza in New York.
If a search engine were to try to break a query down into entities, and apply labels to them, it would then have to try to choose between the best of those disambiguation attempts to decide which might be closest to the intent of a searcher. It could potentially identify entities by creating a confidence score for each of the possible interpretations, based upon information found in online dictionaries or encyclopedias, web pages, and other kinds of documents found online.
The tags assigned to different entities found within queries could cover a wide range of labels, such as:
- Product names,
- Persons names,
This kind of query interpretation system could be created from training data, that might be collected from human judges to train a model that would score interpretations of queries.
A Yahoo patent application published last week explores how such a system might be used:
Search Query Disambiguation
Inventors: Gilad Mishne, Raymond Stata, and Fuchun Peng
US Patent Application 20100205198
Published August 12, 2010
Filed: February 6, 2009
Disclosed herein is a system and method of query disambiguation. At least one model is generated using training data, which model can be used to score, or rank, possible interpretations identified for a query, which can be used to select an interpretation from a number of possible interpretations.
A selected interpretation can be used to process a web search request, e.g., to generate search results that relate to the selected query interpretation, rank or order the items in the search result based on relevance to the selected query interpretation, and/or identify a presentation to be used to display the search results based on the selected query interpretation.
The patent filing goes into a fair amount of detail about how a system like this might be used, but the basic concept that entities might be identified from those query terms, and labeled is at the heart of the approach.
For some queries, more than one interpretation may be identified with a certain level of confidence, and search results might contain pages covering those interpretations.
In addition to helping decide which web pages to return in search results, query interpretations might sometimes trigger specialized results, such as a local search map result, or certain kinds of advertisements.
The patent filing also branches off to explore how numeric terms might be interpreted when found in queries, and provides a large number of examples. For instance, “Godfather 3″ might be interpreted to be equivalent to “GodFather III,” but “firefox 3″ might not be seen to be equivalent to “firefox III.”