How a Search Engine Might Find Synonyms to Use to Expand Search Queries

Search for the word “automobile” at Google, and the search engine might expand your search to include results for the word “car” as well, since it is a synonym of the word automobile. Accidentally misspell the word as “automoble” and Google might automatically correct your spelling error and search for “automobile.”

Follow that up with a search for the word “driving” and Google could expand your query by using a process called stemming to look at the root of the word (driv-) and adding common endings to it, to come up with, and include in the search, such words as “driving,” and “driver.”

This kind of query expansion is aimed at providing searchers with better search results. This method of expanding queries might not happen yet (though it sometimes appears to for spelling corrections at least), and it might not happen in all searches.

Typical approaches to query expansion include:

  • Stemming of words,
  • Correction of spelling errors, and;
  • Augmentating search queries by doing things such as using synonyms of words that occur in the original query.

A couple of white papers from Google and a newly published patent application explore some of the ways that Google might use machine translation to find synonyms for words to expand the search terms that you might use.

There are a few different ways that expanding queries using synonyms can be done.

1) Synonyms for a word might be found in a thesaurus where those synonyms have been identified by experts, or a lexical ontology (an organized vocabulary of words).

2) Synonyms might be identified from other search queries that are syntactically similar (an ordering of and relationship between words in phrases that are similar) to the original query.

One challenge to those methods is when a word has multiple potential synonyms, with widely varying meanings. For example, in the query “How to ship a box,” the word “ship” could have synonyms such as “boat” and “send.”

If that query is expanded based upon the boat meaning, it might provide very irrelevant search results to a searcher, who probably doesn’t expect to see search results related to fishing trawlers.

The Google patent application involves methods that are also explored in papers from Google; Translating Queries into Snippets for Improved Query Expansion (pdf), and Statistical Machine Translation for Query Expansion in Answer Retrieval (pdf).

The patent application lists a couple of inventors who were also authors of those papers:

Machine Translation for Query Expansion
Invented by Stefan Riezler, Alexander L. Vasserman
Assigned to Google
US Patent Application 20080319962
Published December 25, 2008
Filed: March 17, 2008

Abstract

Methods, systems and apparatus, including computer program products, for expanding search queries. One method includes receiving a search query, selecting a synonym of a term in the search query based on a context of occurrence of the term in the received search query, the synonym having been derived from statistical machine translation of the term, and expanding the received search query with the synonym and using the expanded search query to search a collection of documents.

Alternatively, another method includes receiving a request to search a corpus of documents, the request specifying a search query, using statistical machine translation to translate the specified search query into an expanded search query, the specified search query and the expanded search query being in the same natural language, and in response to the request, using the expanded search query to search a collection of documents.

Using Statistical Machine Translation (SMT) to find Synonyms

The patent application goes into a good amount of detail on how Google might use statistical machine translation to translate a sequence of words from one language to another, to learn how words in different languages are related. If you want a detailed version of how statistical machine translation works , it’s worth looking at the patent filing for their description.

The Google Research Blog, in a post from 2006 titled Statistical machine translation live, provides a much simpler explanation:

Several research systems, including ours, take a different approach: we feed the computer with billions of words of text, both monolingual text in the target language, and aligned text consisting of examples of human translations between the languages. We then apply statistical learning techniques to build a translation model.

So, how does SMT help find synonyms?

The word “ship” in a particular context can be translated to another language the same way that the word “transport” can be. In that context, the word “ship” is synonymous with the word “transport”. So, our example above of a query such as “how to ship a box” might have the same translation as “how to transport a box.”

The search might be expanded to include both queries – “how to ship a box” as well as “how to transport a box.”

A machine translation system may also collect information about words in the same language, to learn about how those words might be related.

Approaches for Training a Statistical Machine Translation Model

The first step is collecting a training set of words, possibly from a number of different sources, such as the following:

1) Looking at Question-Answer Pairs

Imagine looking at as many frequently asked questions pages as possible, and comparing how the same questions are answered differently (or similarly). Taking those questions and answer pairs, and using them as a training body for statistical machine learning might be helpful.

2) Looking at Query and Snippet Pairs

Look at the search results for a query in a search engine, and the snippets of those results. Perhaps look even closer at the results that have been selected and viewed more frequently and/or longer by people who searched using those query terms (possibly indicating that those snippets are more relevant for the query term searched with).

Those query and snippet pairs might also be used as a training body for statistical machine learning. Text from the documents themselves, from anchor text in links pointing to those documents, and other information about words appearing in those results such as whether they were used in the page title, or if they are part of a string of text that is relevant to the query used may also be considered.

3) Look at phrase and paraphrase pairs

Like our examples above of “how to ship a box,” and “how to transport a box,” these phrases can be translated into the same term in another language, and that term might be reasonably translated back into either phrase.

Phrases and paraphrases might also be supplied manually be language experts. A body of synonyms and similar phrases might be collected from that approach.

A query such as “how to become a mason” might yield a translated search query of “how to be a bricklayer” using this approach.

Using Context Maps with Synonyms

Synonyms might be found during a search, or they might be prepared beforehand and used with a context map that pays attention to words that might appear to the left and right of one of the words in a query phrase. The context map might be prepared before a search is ever conducted.

For example, with the query “how to tie a bow,” the left and right context of the word `tie` in that query is “how to” and “a bow.”

In the context map, the word tie may be associated with two synonyms, `equal` and `knot`. The word “knot” could be chosen as a synonym for “tie” since it also fits in well within the context found in the context map of “how to” and “a bow.” The query might be expanded to something like [how to (tie or knot) a bow].

Conclusion

When a misspelling is entered into Google as part of a search query, the search engine will sometimes show a message at the top of the results asking if you meant the correct spelling. Other times Google will just show a mix of results for the misspelled version and the corrected version, expanding the query. And sometimes, Google will just show results from the corrected version of the word.

We don’t know for certain if Google is using stemming for query expansion, or if they are using synonyms for query expansion. But these are very real possibilities. If you search for a query that includes the word “automobile” and the word “car” produces very relevant results as well, then that kind of query expansion is very reasonable.

If you’re interested in how a search engine might expand the words used in a search, and how they might decide upon what words to use in that query expansion, it’s worth spending some time going through this patent application to learn how that kind of expansion might be done.

Added: January 1, 2009 -

Google does tell us that they use stemming on their Web Search Help page:

Word variations (stemming)

Google now uses stemming technology. Thus, when appropriate, it will search not only for your search terms, but also for words that are similar to some or all of those terms. If you search for pet lemur dietary needs, Google will also search for pet lemur diet needs, and other related variations of your terms. Any variants of your terms that were searched for will be highlighted in the snippet of text accompanying each result.

Share

23 thoughts on “How a Search Engine Might Find Synonyms to Use to Expand Search Queries”

  1. Interesting. Yes, Google often does this, they definitely use synonyms, especially when the search retrieves a small set of results.

    For this reason, I generally include as many synonyms as possible for keywords in my articles. If Google uses synonyms, then my article will have the upper hand because Google will be able to empirically verify the subject of my article – perhaps make a stronger case for my article than others – and match it to the query… Maybe. :-) Either way, it doesn’t hurt. lol.

  2. I wonder if machine translation could return some perverse results in the case of hospice, which is widely misunderstood and therefore possibly often misrepresented in web content. Hospice sites themselves universally devote high-placed content to misconceptions about hospice, and stating those misconceptions in doing so.

    This thought was sparked by an experience I had a couple weeks ago when I queried Princeton’s WordNet lexical database for “hospice” and found it associated with “hospital” and “medical,” which in some was are antithetical to hospice in that they practice curative care rather than palliative care.

    Some algorithm misled WordNet into making these off-target associations, no?

  3. Hi Shirley,

    Thanks. I think including synonyms for words that you think are important on a page is a pretty good practice regardless of whether Google is presently using a method like the one described here. It’s something that you would do naturally in developing strong content for a page as well.

  4. Hi Michael,

    You raise a really good point. I went to Wordnet to try out some other terms, but the online version seems to be offline tonight.

    Machine translation, and the use of synonyms to expand queries might lead to results that aren’t completely on target when it comes to the actual meanings of words. We can easily exchange the word automobile and car in a conversation, and likely not have any problems when it comes to the meanings of those words. But others aren’t so closely related.

    Wordnet did get many of its synonyms from thesauruses, and if I look up hospice at thesaurus.com, it provides “hospital” as one synonym.

    Wordnet is still a work in progress, and it appears to be partially funded by Google these days.

    But the point that you bring up is a really good one – we have to take care not to confuse synonyms with definitions, and keep in mind that a synonym might be useful in expanding some queries, but potentially could also result in misleading searchers if used.

    A “passenger vehicle” could also be considered a synonym for “car” or “automobile,” but planes, trains, bicycles, stage coaches, and boats are all passenger vehicles as well. Part of why I like looking at patent filings like this one is to keep an eye on things that could possibly be a problem, like hospital search results when someone is looking for a hospice.

  5. Very interesting, thank you. While people point to the (sometimes laughable) inaccuracies caused by statistical machine analysis, in many cases this can be more accurate and consistent than human analysis. At Jinni we’ve used a combination to build the synonym vocabulary of our semantic movie search engine.

  6. I’m ok with the spelling correction…but from an SEO standpoint the synonyms make life much difficult – how can we guess which words Google will produce? From a searches standpoint – synonyms may help produce a wider range of results.

  7. Isn’t this LSI? I think all search engines have been using it for quite a while now…

    I don’t know if you mentioned this but there is a query in google it goes like this: You add ~ in front of a word and google returns a synonim for example if you write: ~car it will return several synonyms like cars, BMW and so on.

  8. Hi Phoebe

    Thank you. Jinni looks interesting.

    Hi James,

    Good points. Query expansions with synonyms might make optimizing a single page for a single term more difficult, because better results associated with synonyms might be available. The solution might be to build higher quality pages that include synonyms as well.

    Hi Dare,

    Thanks for asking about Latent Semantic Indexing. This process is not LSI. There are some good resources on LSI that are worth looking at:

    Latent Sematic Indexing – from people who originated LSI, wrote about it in white papers, and patented it.

    SVD and LSI Tutorial 1: Understanding SVD and LSI
    A tutorial on Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI), its advantages, applications and limitations. Covers LSI myths and misconceptions from search engine marketers.

    If you want to learn the math behind LSI, and how it works, this set of tutorials explores the math and many of the myths behind latent semantic indexing.

    I didn’t mention the use of the tilde (~) by Google, but they do tell us that it can be used to search for synonyms. Not sure if follows the same processes as described in this patent application, or will work with a context map like this process either. I will be looking around to see if there’s more on how the tilde synonym finder works.

  9. I am wondering if we are seeing the first instances of Google’s incorporation of the Orion algorithm purchased in 2006. Here not only linguistic relationships are exploited but also conceptual relationships.

  10. I might be wrong but on the face of it, it seems this solution would homogenize search results and make it harder and harder to rank for keywords. It also seems this would help their ad revenue by driving up the demand for major keywords rather than having them more segmented like they are now.

  11. Hi Marianne,

    Good question. When the Orion algorithm was purchased, the inventor made a statement that he expected his work on it to be completed in about 18 months. It’s been longer than that.

    However, the algorithm was aimed at taking keywords that appeared on a page found in search results, and providing suggested search queries based upon those keywords. It’s a different concept than found in this patent filing.

    Hi Mike,

    You raise an interesting point. I’m not sure that it’s in the best interests of the search engines to lesson the amount of potential results that appear in response to queries. If the quality of search results seems to diminish drastically, there’s a good chance that people will turn to other search engines.

  12. Wow thanks for this. The Google ~ query does not exactly searches for synonyms…try for example type “~mobile” and you’ll get Nokia! Nokia is not a synonym for mobile!

  13. Hi Mark,

    You’re right – the tilde (~) doesn’t always provide synonyms, or at least what we might consider synonyms.

    While many people might think of “mobile” when you mention Nokia, it isn’t really a word that could be interchanged and produce the exact same meaning (or even a substantially similar meaning). A Nokia phone is usually a mobile phone, but a mobile phone doesn’t have to be a Nokia phone.

  14. Great article!

    I used to do SEO a year or to ago and I wish I had this information at my disposal then. I have a current site that I paid someone to Optimize, because I figured my practices were a little out of date since the industry is ever chaning especially with googles continually changing search algorythm. We use such keywords as boat buy and sell and I never really thought of using synonyms. I guess it makes sense. Its kind of simple really. Anyway Im rambling thak you for a great post.

    Do u have any articles regarding on-site vs. off-site SEO?

  15. Hi Raine,

    Thanks for your kind words.

    There are some aspects of SEO that just don’t change, I think mainly because the goals of search really haven’t changed much – search engines want to provide pages to searchers that match the intents of their searches.

    If you create pages that a search engine can crawl and index, provide content that searchers who want to find what you offer will find value in, and use words on your pages that they will search for and expect to see on your site, you’ve done most of what you want to do with SEO.

    In researching keywords for your pages, it’s always been helpful to consider a range of words that your audience might use to find your pages, including synonyms. It was nice to see this patent describe how a search engine might look at those, but not a surprise.

    I’ve written a number of posts about both on-site and off-site SEO, but not really anything that compares different ranking signals from either classification. I do believe that search engines will give more weight to pages where it sees on-page and off-page ranking signals together that reinforce each other.

  16. Google values the semantic content. The proof is that if your content appear in a variety of semantic content related to your subject rather than mere repetition of key words, your content becomes more relevant not only for visitors but for search engines.

  17. It makes sense to write content that is engaging, interesting, and informative, that convinces a visitor to return, and to look at other pages on a site as well. Providing rich content on your pages provides a search engine with more to index, and if a search engine is using in the kind of expansion of queries described in this patent, it may help your pages rank well for a wider range of keywords that are related.

  18. Any word on whether this was put into place? Also, I wonder whether Google would be using a stochastic algorithm to do this.

  19. Hi Anthony,

    It appears that Google is using the query expansion/synonym approach described in this patent filing. See my post Google Synonym Update. It does appear that Google is using an N-gram approach to identify the language of documents they find on the Web and to provide translation services in places like the latest release of Google Chrome, though the patents related to the statistical machine translation approach to query expansion and synonyms don’t provide much in the way of details.

Comments are closed.