Search for the word “automobile” at Google, and the search engine might expand your search to include results for the word “car” as well, since it is a synonym of the word automobile. Accidentally misspell the word as “automoble” and Google might automatically correct your spelling error and search for “automobile.”
Follow that up with a search for the word “driving” and Google could expand your query by using a process called stemming to look at the root of the word (driv-) and adding common endings to it, to come up with, and include in the search, such words as “driving,” and “driver.”
This kind of query expansion is aimed at providing searchers with better search results. This method of expanding queries might not happen yet (though it sometimes appears to for spelling corrections at least), and it might not happen in all searches.
Typical approaches to query expansion include:
- Stemming of words,
- Correction of spelling errors, and;
- Augmentating search queries by doing things such as using synonyms of words that occur in the original query.
A couple of white papers from Google and a newly published patent application explore some of the ways that Google might use machine translation to find synonyms for words to expand the search terms that you might use.
There are a few different ways that expanding queries using synonyms can be done.
1) Synonyms for a word might be found in a thesaurus where those synonyms have been identified by experts, or a lexical ontology (an organized vocabulary of words).
2) Synonyms might be identified from other search queries that are syntactically similar (an ordering of and relationship between words in phrases that are similar) to the original query.
One challenge to those methods is when a word has multiple potential synonyms, with widely varying meanings. For example, in the query “How to ship a box,” the word “ship” could have synonyms such as “boat” and “send.”
If that query is expanded based upon the boat meaning, it might provide very irrelevant search results to a searcher, who probably doesn’t expect to see search results related to fishing trawlers.
The Google patent application involves methods that are also explored in papers from Google; Translating Queries into Snippets for Improved Query Expansion (pdf), and Statistical Machine Translation for Query Expansion in Answer Retrieval (pdf).
The patent application lists a couple of inventors who were also authors of those papers:
Machine Translation for Query Expansion
Invented by Stefan Riezler, Alexander L. Vasserman
Assigned to Google
US Patent Application 20080319962
Published December 25, 2008
Filed: March 17, 2008
Methods, systems and apparatus, including computer program products, for expanding search queries. One method includes receiving a search query, selecting a synonym of a term in the search query based on a context of occurrence of the term in the received search query, the synonym having been derived from statistical machine translation of the term, and expanding the received search query with the synonym and using the expanded search query to search a collection of documents.
Alternatively, another method includes receiving a request to search a corpus of documents, the request specifying a search query, using statistical machine translation to translate the specified search query into an expanded search query, the specified search query and the expanded search query being in the same natural language, and in response to the request, using the expanded search query to search a collection of documents.
Using Statistical Machine Translation (SMT) to find Synonyms
The patent application goes into a good amount of detail on how Google might use statistical machine translation to translate a sequence of words from one language to another, to learn how words in different languages are related. If you want a detailed version of how statistical machine translation works , it’s worth looking at the patent filing for their description.
The Google Research Blog, in a post from 2006 titled Statistical machine translation live, provides a much simpler explanation:
Several research systems, including ours, take a different approach: we feed the computer with billions of words of text, both monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model.
So, how does SMT help find synonyms?
The word “ship” in a particular context can be translated to another language the same way that the word “transport” can be. In that context, the word “ship” is synonymous with the word “transport”. So, our example above of a query such as “how to ship a box” might have the same translation as “how to transport a box.”
The search might be expanded to include both queries – “how to ship a box” as well as “how to transport a box.”
A machine translation system may also collect information about words in the same language, to learn about how those words might be related.
Approaches for Training a Statistical Machine Translation Model
The first step is collecting a training set of words, possibly from a number of different sources, such as the following:
1) Looking at Question-Answer Pairs
Imagine looking at as many frequently asked questions pages as possible, and comparing how the same questions are answered differently (or similarly). Taking those questions and answer pairs, and using them as a training body for statistical machine learning might be helpful.
2) Looking at Query and Snippet Pairs
Look at the search results for a query in a search engine, and the snippets of those results. Perhaps look even closer at the results that have been selected and viewed more frequently and/or longer by people who searched using those query terms (possibly indicating that those snippets are more relevant for the query term searched with).
Those query and snippet pairs might also be used as a training body for statistical machine learning. Text from the documents themselves, from anchor text in links pointing to those documents, and other information about words appearing in those results such as whether they were used in the page title, or if they are part of a string of text that is relevant to the query used may also be considered.
3) Look at phrase and paraphrase pairs
Like our examples above of “how to ship a box,” and “how to transport a box,” these phrases can be translated into the same term in another language, and that term might be reasonably translated back into either phrase.
Phrases and paraphrases might also be supplied manually be language experts. A body of synonyms and similar phrases might be collected from that approach.
A query such as “how to become a mason” might yield a translated search query of “how to be a bricklayer” using this approach.
Using Context Maps with Synonyms
Synonyms might be found during a search, or they might be prepared beforehand and used with a context map that pays attention to words that might appear to the left and right of one of the words in a query phrase. The context map might be prepared before a search is ever conducted.
For example, with the query “how to tie a bow,” the left and right context of the word `tie` in that query is “how to” and “a bow.”
In the context map, the word tie may be associated with two synonyms, `equal` and `knot`. The word “knot” could be chosen as a synonym for “tie” since it also fits in well within the context found in the context map of “how to” and “a bow.” The query might be expanded to something like [how to (tie or knot) a bow].
When a misspelling is entered into Google as part of a search query, the search engine will sometimes show a message at the top of the results asking if you meant the correct spelling. Other times Google will just show a mix of results for the misspelled version and the corrected version, expanding the query. And sometimes, Google will just show results from the corrected version of the word.
We don’t know for certain if Google is using stemming for query expansion, or if they are using synonyms for query expansion. But these are very real possibilities. If you search for a query that includes the word “automobile” and the word “car” produces very relevant results as well, then that kind of query expansion is very reasonable.
If you’re interested in how a search engine might expand the words used in a search, and how they might decide upon what words to use in that query expansion, it’s worth spending some time going through this patent application to learn how that kind of expansion might be done.
Added: January 1, 2009 -
Google does tell us that they use stemming on their Web Search Help page:
Word variations (stemming)
Google now uses stemming technology. Thus, when appropriate, it will search not only for your search terms, but also for words that are similar to some or all of those terms. If you search for pet lemur dietary needs, Google will also search for pet lemur diet needs, and other related variations of your terms. Any variants of your terms that were searched for will be highlighted in the snippet of text accompanying each result.