How Google May Expand Searches Using Synonyms for Words in Queries
When someone searches the Web, one of the challenges that they often face is using the right words in their search to find what that they are looking for.
Search engines usually rank pages based upon how prominently terms from a searcher’s query appear on those pages, and if a searcher doesn’t use the right words in their search, they may miss the pages and the information that they would like to find.
For example, someone looking for web hosting in the City of Ft. Wayne may type the query [Web hosting Fort Wayne] into a search engine, and not see many pages about hosting in that location because the City is usually referred to as “Ft. Wayne” rather than “Fort Wayne.” I find myself frequently challenged by this kind of problem when looking for information about Washington, D.C., or the District of Columbia, or DC.
A patent granted to Google this week explores how the search engine might expand the search terms that searchers use to include synonyms in searches, to make it easier for searchers to locate information on the Web. In the Ft. Wayne example, this could mean that Google would look for pages on the Web that were relevant for both [web hosting Fort Wayne] and [web hosting Ft. Wayne].
The Fort Wayne example is taken from the patent, and the authors of the patent provide another example of a search query that someone looking for music for a video they are making might use in a search – [free loops for flash movie]. Chances are that most people offering music that can be used for free for videos are going to use the word “music” rather than “loops.” They may also use the word “animation” rather than “movie.” When that searcher types [free loops for flash movie] into Google’s search box, the search engine might not return pages that provide free music for flash animations because those pages don’t use the words “loop” or “movie,” or the words “loop” and “movie” are used on some pages that aren’t very prominent and the pages don’t rank very well in Google for those terms.
We’re told by the inventors of the patent, that as the number of terms in a query increases, this problem becomes more serious:
Thus, documents that satisfy a user’s information need may use different words than the query terms chosen by the user to express the concept of interest. Since search engines typically rate documents based on how prominently the user’s query terms are in the documents, this means that a search engine may not return the most relevant documents in such situations (since the most relevant documents may not contain the user’s query terms prominently, or at all).
This problem becomes progressively more serious as the number of terms in a query increases. For queries longer than three or four words, there is a strong likelihood that one of the words is not the best phrase to describe the user’s information need.
Synonyms and Context
One of the simpler ways for a search engine to try to find synonyms for terms that people use in queries to expand those queries would be to come up with a thesaurus or database of synonyms, and lookup the words in a query to identify possible synonyms. But there are some limitations to that approach. The most significant is that the meaning of a term often relies upon the context of how it is used.
For example, “music” is not usually a good synonym for “loops,” but it is a good synonym in the context of the example query above. Further, this case is sufficiently special that “music” is not listed as a synonym for “loop” in standard thesauruses; many other examples of contextually dependent non-traditional synonyms can be easily identified.
And even when conventional synonyms can be identified for a term, it can be difficult to identify which particular synonyms to use in the particular context of the query.
The patent presents a process for finding synonyms for words that appear in search queries, evaluating the quality of those synonyms within the context of a particular query, and using those synonyms to expand queries and return relevant pages to searchers.
It starts by finding queries that are alike, and performing tests upon those query terms and phrases, while looking at information related to those queries.
- The number or percentage of times both terms appeared in search queries within a certain amount of time.
- The number or percentage of times both terms appeared within a particular user search session.
- How much alike the search results are that are returned for the original search query and for a search where a candidate synonym is substituted.
Once synonyms are found that might be good replacements within a query, the search engine might offer a modified query using the synonym as a search suggestion, or the revised query might be used to expand the scope of the search results presented to a searcher.
So, someone searching for [Web hosting Fort Wayne] might be shown a set of search results with a query suggestion at the top of the results with a link to results for [Web hosting Ft Wayne], or they might see a set of search results that includes pages that are good matches for both [Web hosting Fort Wayne] and [Web hosting Ft Wayne].
The patent is:
Determining query term synonyms within query context
Invented by John Lamping and Steven Baker
Assigned to Google
US Patent 7,636,714
Granted December 22, 2009
Filed: March 31, 2005
A method is applied to search terms for determining synonyms or other replacement terms used in an information retrieval system. User queries are first sorted by user identity and session. For each user query, a plurality of pseudo-queries is determined, each pseudo-query derived from a user query by replacing a phrase of the user query with a token.
For each phrase, at least one candidate synonym is determined. The candidate synonym is a term that was used within a user query in place of the phrase, and in the context of a pseudo-query. The strength or quality of candidate synonyms is evaluated. Validated synonyms may be either suggested to the user or automatically added to user search strings.
How the Process Works
Someone enters a query at the search engine, and a set of pages which are relevant for the query are retrieved and ranked based upon their perceived relevance and importance.
The search engine then looks at the query terms, and might attempt to identify possible synonyms for words or phrases within that query from a list that might have been created from analyzing the search engine’s query logs.
To create that list, all queries received over a certain period of time might be reviewed and potential, or candidate synonyms may then be identified.
For example, the original query might have been [free loops for flash movie], and there might be previous queries within the log such as [free music for flash movie] that may be worth reviewing.
Or, query fragments with wildcard tokens within them might be used, such as [free * for flash movie].
Information from the query logs about the queries with the candidate synonyms in them might then be analyzed.
For instance, how frequently has someone searching for [free loops for flash movie] within a short period of time then searched for [free music for flash movie] or [free loops for flash animation].
Other tests may also be performed as well, such as what is the probability that both queries might have a number of the top search results in common if someone searched for both. So, if in a search for [free loops for flash movie] and a search for [free loops for flash animation], there are a certain number of pages in the top 10 (or some other number) that are the same, then “movie” and “animation” are good synonyms within the context of that query.
The patent includes a number of examples of how synonyms might be selected for words that appear in queries, and is worth spending a good amount of time upon if you’re interested in how a search engine like Google might expand search results for searchers to include those synonyms.
When I search for [district of columbia museums], the top result after local results is a page that doesn’t include the word “Columbia.” If I look at the cached copy of the page at Google, I am told that “Columbia” does appear within anchor text in links to the page, which may be why it shows up as the top result for my query. But, there are plenty of pages that are also good matches for the words I used to search with.
Is Google deciding that there are other words or phrases on that page that are synonyms for “district of columbia” such as “D.C.”, and modifying my search results to include that page?
While not conclusive evidence by any means, it is interesting that in the top search result (past the local results) for my query, the acronym “D.C.” is bolded as if it were one of my query terms. Google usually highlights query terms when they appear in search results using bold text to show searchers that the pages they are returning are relevant for the query used in a search.
There’s no mention in this patent that Google might highlight or display synonyms in bold text in search results if they are used to expand search results for a query, and the highlighting process used by Google is a separate process, but it is interesting that the search engine bolded the synonym for District of Columbia.
What does this mean for you as a searcher or as a site owner if Google is using this process?
For searchers, it might mean that Google may add pages to your search results based upon words it perceives as synonyms to words you used in your query. Search for something while including the words “District of Columbia” in your search, and you may see also see pages that use “Washington, D.C.” or “D.C.” instead of “District of Columbia.”
For site owners, it could mean that if you target specific keyword phrases on your pages for searchers, that other sites that use synonyms for some of the words in your chosen keyword phrases may also show up in the same search results as your pages.
Added – January 19, 2010 – An Official Google Blog post was just published which describes a recent change at Google on how Google handles synonyms, as well as the use of bold in search results to highlight those synonyms. The description sounds very much like the process above, with the use of synonyms determined in context.
Note that the author of that Official Google Blog post, Steven Baker, is one of the named inventors on this patent as well
Matt Cutts also follows up with More info about synonyms at Google
Google also published a patent filing which looks at synonyms in context, but also uses statistical language models to translate a query into another language and then back into the first language to attempt to find more than one phrase or term that may include synonyms within the same context. That approach and the one that I described above could be seen to be related in a number of ways. I describe it in the post: How a Search Engine Might Find Synonyms to Use to Expand Search Queries.