Google Tells Us About Ways to Rewrite Search Queries using synonyms
Search for the word “automobile” at Google, and the search engine might rewrite your search to include results for the word “car” as well since it is a synonym of the word automobile. Accidentally misspell the word “automobile,” and Google might automatically correct your spelling error and search for “automobile.”
Follow that up with a search for the word “driving.” Google could expand your query by using a process called stemming from looking at the root of the word (drive-) and adding common endings to it to come up with and include in the search. Those would be such words as “driving,” and “driver.”
This kind of query expansion becomes aimed at providing searchers with better search results. But, this method of expanding queries might not happen yet (though it sometimes appears to be for spelling corrections at least), and it might not happen in all searches.
Typical approaches to rewrite search queries include:
- Stemming of words
- Correction of spelling errors
- Augmentating search queries by doing things such as using search engine synonyms of words that occur in the original query
A couple of white papers from Google and a newly published patent application explore some of the ways that Google might use machine translation to find synonyms for words to expand the search terms that you might use.
There are a few different ways to rewrite search queries using synonyms that can get done.
1) Synonyms for a word might become found in a thesaurus where those synonyms have gotten identified by experts or a lexical ontology (an organized vocabulary of words).
2) Synonyms might get identified from other search queries that are syntactically similar (an ordering of and relationship between words in phrases similar) to the original query.
One challenge to those methods is when a word has multiple potential synonyms with widely varying meanings. For example, in the query “How to ship a box,” the word “ship” could have synonyms such as “boat” and “send.”
If that query is rewritten based upon the boat meaning, it might provide very irrelevant search results to a searcher, who probably doesn’t expect to see search results related to fishing trawlers.
The Google patent application involves methods that are also explored in papers from Google; Translating Queries into Snippets for Improved Query Expansion (pdf), and Statistical Machine Translation for Query Expansion in Answer Retrieval (pdf).
The patent application lists a couple of inventors who were also authors of those papers:
Machine Translation for Query Expansion
Invented by Stefan Riezler, Alexander L. Vasserman
Assigned to Google
US Patent Application 20080319962
Published December 25, 2008
Filed: March 17, 2008
Methods, systems, and apparatus, including computer program products, expand search queries. One method, For example, one includes receiving a search query, selecting a synonym of a term in the search query based on a context of occurrence of the term in the received search query, the synonym having gotten derived from statistical machine translation of the term, and expanding the received search query with the synonym and using the expanded search query to search a collection of documents.
Alternatively, another method includes receiving a request to search a corpus of documents, the request specifying a search query, using statistical machine translation to translate the specified search query into an expanded search query, the specified search query and the expanded search query being in the same natural language, and in response to the request, using the expanded search query to search a collection of documents.
Using Statistical Machine Translation (SMT) to rewrite search queries
The patent application goes into a good amount of detail on how Google might use statistical machine translation to translate a sequence of words from one language to another or learn how words in different languages become related. So if you want a detailed version of how statistical machine translation works, it’s worth looking at the patent filing for their description.
The Google Research Blog, in a post from 2006 titled Statistical machine translation live, provides a much simpler explanation:
Several research systems, including ours, take a different approach: we feed the computer with billions of words of text, monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model.
So, how does SMT help rewrite search queries?
The word “ship” in a particular context can get translated into another language the same way that “transport” can be. In that context, the word “ship” is synonymous with the word “transport.” So, our example above of a query such as “how to ship a box” might have the same translation as “how to transport a box.”
The search might be rewritten to include both queries – “how to ship a box” as well as “how to transport a box.”
A machine translation system may also collect information about words in the same language to learn how those words might get related.
Approaches for Training a Statistical Machine Translation Model
The first step is collecting a training set of words, possibly from many different sources, such as the following:
1) Looking at Question-Answer Pairs to rewrite search queries
Imagine looking at as many frequently asked questions pages as possible and comparing how the same questions got answered differently (or similarly). Taking questions and answer pairs and using them as a training body for statistical machine learning might be helpful.
2) Looking at Query and Snippet Pairs to rewrite search queries
Look at the search results for a query in a search engine and the snippets of those results. Perhaps look even closer at the results that have gotten selected and viewed more frequently and/or longer by people who searched using those query terms (possibly indicating that those snippets are more relevant for the query term searched with).
Those query and snippet pairs might also get used as a training body for statistical machine learning. From the documents themselves, from anchor text in links pointing to those documents, and other information about words appearing in those results. Such as whether they got used in the page title or if they are part of a string of text that is relevant to the query used may also get considered.
3) Look at phrase and paraphrase pairs to rewrite search queries
Like our examples above of “how to ship a box” and “how to transport a box,” these phrases can get translated into the same term in another language, and that term might be reasonably translated back into either phrase.
Phrases and paraphrases might also get supplied manually by language experts. A body of synonyms and similar phrases might be collected from that approach.
A query such as “how to become a mason” might yield a translated and rewritten search query of “how to be a bricklayer” using this approach.
Using Context Maps with Synonyms to Rewrite Search Queries
Synonyms might get found during a search, or they might get prepared beforehand and used with a context map that pays attention to words that might appear to the left and right of one of the words in a query phrase. The context map might become prepared before a search is ever conducted.
For example, with the query “how to tie a bow,” the left and right context of the word `tie` in that query is “how to” and “a bow.”
In the context map, the word tie may get associated with two synonyms, `equal` and `knot.` The word “knot” could get chosen as a synonym for “tie” since it also fits in well within the context found in the context map of “how-to” and “a bow.” The question is rewritten to something like [how to (tie or knot) a bow].
When a misspelling gets entered into Google as part of a search query, the search engine will sometimes show a message at the top of the results. This is what they refer to as a prompt. The search engine may ask if you meant the correct spelling. Google will also show a mix of results for the misspelled version and the corrected version, expanding the query. And sometimes, Google will show results from the corrected version of the word.
We don’t know for certain if Google uses stemming for query expansion or using synonyms for query expansion. But these are authentic possibilities. So, for example, if you search for a query that includes the word “automobile” and the word “car” produces very relevant results, it is reasonable to rewrite search queries like that.
You may be interested in how a search engine might rewrite queries used in a search and how they might decide upon what words to use in that query expansion. It’s worth spending some time going through this patent application to learn how that rewriting might get done.
Added: January 1, 2009 –
Google does tell us that they use stemming on their Web Search Help page:
Word variations (stemming)
Google now uses stemming technology. Thus, when appropriate, it will search not only for your search terms but also for words like some or all those terms. If you search for pet lemur dietary needs, Google will also search for pet lemur diet needs and other related variations of your terms. Any variants of your terms that get searched for will could get highlighted in the snippet of text accompanying each result.
I’ve written a few posts about synonyms in search. Here are some of those:
- 2/19/2006 – Multi-Stage Query Processing at Google
- 5/25/2007 – Refining Queries Using a Local Category Synonym
- 12/29/2008 – How a Search Engine Might Use Synonyms to Rewrite Search Queries
- 1/23/2009 – Google to Expand Language Search and Shrink Our World?
- 6/29/2009 – Semantic Relations from Query Logs
- 12/22/2009 – Google Search Synonyms Can Get Found in Queries
- 1/19/2010 – Google Synonyms Update
- 1/27/2010 – Paid Search Results and Query Expansion using Synonyms and Related Concepts
- 2/16/2011 – More Ways Search Engine Synonyms Might Rewrite Queries
- 8/12/2013 – How Google May Substitute Query Terms with Co-Occurrence
- 9/27/2013 – The Google Hummingbird Patent?
- 12/8/2013 – How Google May Rewrite Queries
- 9/9/2013 – How Google May Reform Queries Based on Co-Occurrence in Query Sessions
- 10/15/2013 – Google’s Hummingbird Algorithm Ten Years Ago
- 12/21/2015 = How Google Might Make Better Synonym Substitutions Using Knowledge Base Categories
Last Updated July 4, 2019.
22 thoughts on “How a Search Engine Might Use Synonyms to Rewrite Search Queries”
Interesting. Yes, Google often does this, they definitely use synonyms, especially when the search retrieves a small set of results.
For this reason, I generally include as many synonyms as possible for keywords in my articles. If Google uses synonyms, then my article will have the upper hand because Google will be able to empirically verify the subject of my article – perhaps make a stronger case for my article than others – and match it to the query… Maybe. 🙂 Either way, it doesn’t hurt. lol.
I wonder if machine translation could return some perverse results in the case of hospice, which is widely misunderstood and therefore possibly often misrepresented in web content. Hospice sites themselves universally devote high-placed content to misconceptions about hospice, and stating those misconceptions in doing so.
This thought was sparked by an experience I had a couple weeks ago when I queried Princeton’s WordNet lexical database for “hospice” and found it associated with “hospital” and “medical,” which in some was are antithetical to hospice in that they practice curative care rather than palliative care.
Some algorithm misled WordNet into making these off-target associations, no?
Thanks. I think including synonyms for words that you think are important on a page is a pretty good practice regardless of whether Google is presently using a method like the one described here. It’s something that you would do naturally in developing strong content for a page as well.
You raise a really good point. I went to Wordnet to try out some other terms, but the online version seems to be offline tonight.
Machine translation, and the use of synonyms to expand queries might lead to results that aren’t completely on target when it comes to the actual meanings of words. We can easily exchange the word automobile and car in a conversation, and likely not have any problems when it comes to the meanings of those words. But others aren’t so closely related.
Wordnet did get many of its synonyms from thesauruses, and if I look up hospice at thesaurus.com, it provides “hospital” as one synonym.
Wordnet is still a work in progress, and it appears to be partially funded by Google these days.
But the point that you bring up is a really good one – we have to take care not to confuse synonyms with definitions, and keep in mind that a synonym might be useful in expanding some queries, but potentially could also result in misleading searchers if used.
A “passenger vehicle” could also be considered a synonym for “car” or “automobile,” but planes, trains, bicycles, stage coaches, and boats are all passenger vehicles as well. Part of why I like looking at patent filings like this one is to keep an eye on things that could possibly be a problem, like hospital search results when someone is looking for a hospice.
Very interesting, thank you. While people point to the (sometimes laughable) inaccuracies caused by statistical machine analysis, in many cases this can be more accurate and consistent than human analysis. At Jinni we’ve used a combination to build the synonym vocabulary of our semantic movie search engine.
I’m ok with the spelling correction…but from an SEO standpoint the synonyms make life much difficult – how can we guess which words Google will produce? From a searches standpoint – synonyms may help produce a wider range of results.
Isn’t this LSI? I think all search engines have been using it for quite a while now…
I don’t know if you mentioned this but there is a query in google it goes like this: You add ~ in front of a word and google returns a synonim for example if you write: ~car it will return several synonyms like cars, BMW and so on.
Thank you. Jinni looks interesting.
Good points. Query expansions with synonyms might make optimizing a single page for a single term more difficult, because better results associated with synonyms might be available. The solution might be to build higher quality pages that include synonyms as well.
Thanks for asking about Latent Semantic Indexing. This process is not LSI. There are some good resources on LSI that are worth looking at:
Latent Sematic Indexing – from people who originated LSI, wrote about it in white papers, and patented it.
SVD and LSI Tutorial 1: Understanding SVD and LSI
A tutorial on Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI), its advantages, applications and limitations. Covers LSI myths and misconceptions from search engine marketers.
If you want to learn the math behind LSI, and how it works, this set of tutorials explores the math and many of the myths behind latent semantic indexing.
I didn’t mention the use of the tilde (~) by Google, but they do tell us that it can be used to search for synonyms. Not sure if follows the same processes as described in this patent application, or will work with a context map like this process either. I will be looking around to see if there’s more on how the tilde synonym finder works.
I am wondering if we are seeing the first instances of Google’s incorporation of the Orion algorithm purchased in 2006. Here not only linguistic relationships are exploited but also conceptual relationships.
I might be wrong but on the face of it, it seems this solution would homogenize search results and make it harder and harder to rank for keywords. It also seems this would help their ad revenue by driving up the demand for major keywords rather than having them more segmented like they are now.
Good question. When the Orion algorithm was purchased, the inventor made a statement that he expected his work on it to be completed in about 18 months. It’s been longer than that.
However, the algorithm was aimed at taking keywords that appeared on a page found in search results, and providing suggested search queries based upon those keywords. It’s a different concept than found in this patent filing.
You raise an interesting point. I’m not sure that it’s in the best interests of the search engines to lesson the amount of potential results that appear in response to queries. If the quality of search results seems to diminish drastically, there’s a good chance that people will turn to other search engines.
Wow thanks for this. The Google ~ query does not exactly searches for synonyms…try for example type “~mobile” and you’ll get Nokia! Nokia is not a synonym for mobile!
You’re right – the tilde (~) doesn’t always provide synonyms, or at least what we might consider synonyms.
While many people might think of “mobile” when you mention Nokia, it isn’t really a word that could be interchanged and produce the exact same meaning (or even a substantially similar meaning). A Nokia phone is usually a mobile phone, but a mobile phone doesn’t have to be a Nokia phone.
I used to do SEO a year or to ago and I wish I had this information at my disposal then. I have a current site that I paid someone to Optimize, because I figured my practices were a little out of date since the industry is ever chaning especially with googles continually changing search algorythm. We use such keywords as boat buy and sell and I never really thought of using synonyms. I guess it makes sense. Its kind of simple really. Anyway Im rambling thak you for a great post.
Do u have any articles regarding on-site vs. off-site SEO?
Thanks for your kind words.
There are some aspects of SEO that just don’t change, I think mainly because the goals of search really haven’t changed much – search engines want to provide pages to searchers that match the intents of their searches.
If you create pages that a search engine can crawl and index, provide content that searchers who want to find what you offer will find value in, and use words on your pages that they will search for and expect to see on your site, you’ve done most of what you want to do with SEO.
In researching keywords for your pages, it’s always been helpful to consider a range of words that your audience might use to find your pages, including synonyms. It was nice to see this patent describe how a search engine might look at those, but not a surprise.
I’ve written a number of posts about both on-site and off-site SEO, but not really anything that compares different ranking signals from either classification. I do believe that search engines will give more weight to pages where it sees on-page and off-page ranking signals together that reinforce each other.
Google values the semantic content. The proof is that if your content appear in a variety of semantic content related to your subject rather than mere repetition of key words, your content becomes more relevant not only for visitors but for search engines.
It makes sense to write content that is engaging, interesting, and informative, that convinces a visitor to return, and to look at other pages on a site as well. Providing rich content on your pages provides a search engine with more to index, and if a search engine is using in the kind of expansion of queries described in this patent, it may help your pages rank well for a wider range of keywords that are related.
Any word on whether this was put into place? Also, I wonder whether Google would be using a stochastic algorithm to do this.
It appears that Google is using the query expansion/synonym approach described in this patent filing. See my post Google Synonym Update. It does appear that Google is using an N-gram approach to identify the language of documents they find on the Web and to provide translation services in places like the latest release of Google Chrome, though the patents related to the statistical machine translation approach to query expansion and synonyms don’t provide much in the way of details.
Comments are closed.