More Ways a Search Engine Might Identify Synonyms to Expand Queries With

A couple of years back, Google was granted a patent on an approach to identifying synonyms by looking at and comparing queries that searchers used to find information. The patent was Determining query term synonyms within query context, and I covered it in my post How Google May Expand Searches Using Synonyms for Words in Queries.

A month or so after that patent was granted and I wrote my post, Google researcher Steven Baker published a blog post at the Official Google Blog titled Helping computers understand language, where he announced that Google would start including synonyms for query terms in search results when the search engine thought that the synonym was a good match for a query term.

A mechanic working on a car (or auto).
Car Mechanic or Auto Mechanic or Both?

The Google blog post included some examples of when replacing a query term for a synonym might be useful (such as replacing “word” for “lyrics” when someone searches for [song words]), and when replacing a query term for a synonym might not be as helpful (such as replacing “pictures” with “photos” on a search for [motion pictures]). Steven Baker not only wrote the blog post, but he was also one of the inventors listed on the query synonyms patent.

Google was granted a new patent this week on synonyms, and Steven Baker is again one of the named inventors on the patent.

In the beginning of the patent’s description, some of the difficulties mentioned in Steven Baker’s blog post are repeated about finding helpful synonyms that can be used to replace other words in a query to bring meaningful results to a searcher.

The earlier patent described some of the ways that Google might identify synonyms from query logs, and this patent explores some additional approaches that look at how those words are used within pages online.

For example, the synonym identification method might look at how frequently words tend to appear together in documents on the Web.

It might also look at how close together those words appear within those documents as well. It seems that words that frequently appear on a page near each other, but not too close may stand a good chance of being synonyms.

So this process would look to see if those words that tend to co-occur in the same documents frequently tend to appear so close together that they might appear within the same sentence or phrase. If so, they might not be synonyms because, as we’re told in the patent, “synonyms rarely occur in the same sentence or phrase.”

For example, the words “ice” and “cream” tend to show up frequently in the same documents, but they often tend to be adjacent to each other in the phrase “ice cream.”

A closeness score might be calculated by dividing the probability that the words are very close to each other by the probability that the words are near each other.

Words that are “very close” to each other might be less than a certain number of words apart, such as 4 words. Words that are “near” each other might be within a certain other number of words, such as 100. Words that are near, and appear within the same documents frequently, but not too close might be synonyms.

This system might also look at correlations between the appearance of certain words within the content of a page, and words in page titles and in anchor text pointing towards those pages.

While co-occurrence and closeness of words are two factors to be considered, another signal might involve looking at how the words are actually used on the page, referred to in the patent as “word forms.”

For instance, if the author of a page is writing about “car mechanics,” on a page, and then refers to “auto mechanics,” on the same page, that usage might provide a clue that “car” and “auto” could be synonyms. The patent also tells us that the search engine might also look at query logs to see if searchers sometimes replace one of the words with another during the same query session.

So, if the words “car” and “auto” tend to appear frequently on the same page, and people searching for [car mechanics] also frequently change that search to [auto mechanics] during the same query session, that there’s a strong probability that the terms are synonyms.

The patent, which provides much more detail, is at:

Document-based synonym generation
Invented by Oleksandr Grushetskyy and Steven D. Baker
Assigned to Google
US Patent 7,890,521
Granted February 15, 2011
Filed: February 7, 2008

Abstract

One embodiment of the present invention provides a system that automatically generates synonyms for words from documents. During operation, this system determines co-occurrence frequencies for pairs of words in the documents. The system also determnes closeness scores for pairs of words in the documents, wherein a closeness score indicates whether a pair of words are located so close to each other that the words are likely to occur in the same sentence or phrase.

Finally, the system determines whether pairs of words are synonyms based on the determined co-occurrence frequencies and the determined closeness scores. While making this determination, the system can additionally consider correlations between words in a title or an anchor of a document and words in the document as well as word-form scores for pairs of words in the documents.

Share

30 thoughts on “More Ways a Search Engine Might Identify Synonyms to Expand Queries With”

  1. I hope I would not be seeing “online scream” when I search for “online rant”. Insightful post. Thanks.

  2. Hi Bill,
    I didn’t know how many patents are granted to google, 1000 or 1000000?
    I’m not crazy about it. If you try to understand how to incorporate these patents to your SEO strategy… Ouch.
    You can finish in psychiatric hospital. I always try to be a logical. The knowledge of consumer online behavior it’s more important than any Google patents.

  3. I first saw this when I Google’d ecommerce seo and I saw ecommerce optimization bolded in one of the SERPS. Google obviously knew what I was talking about even though I didn’t spell it out. This is a very interesting turn in the SEO world. Get ready to adjust your Title Tags.

  4. It’s kind of confusing they way they find out what words could be synonyms of each other and I think that process used is kind of unreliable but I’m not the expert so we’ll just have to wait and see.

  5. Hi Laurent,

    I agree completely. The n-gram approach could be very useful in a number of ways.

    One of them is in creating statistical language models for a number of languages, and then translating a phrase into one language and then back into the original language. If we translate “car mechanic” from English to French, and then translate it back into English, we might see that both “Car Mechanic” and “Auto Mechanic” are valid translations back into English. That might give us a hint that “car” and “auto” are synonyms.

    Google has a patent on that approach which I wrote about in How a Search Engine Might Find Synonyms to Use to Expand Search Queries

  6. Hi Manuel,

    It was interesting that when Official Google Blog announcement came out, we were told that we might sometimes see some bad results like the one that you mention, but only in a very small percentage of searches. I wondered at the time what kinds of steps Google might take to try to solve that problem.

    When I saw this patent, I wondered if it was in response to those bad synonyms – another set of signals to look at to try to avoid things like an “online scream.” :)

  7. Hi Dimitry,

    Google has a little over 800 granted patents listed as being assigned to them in the US patent and trademark office assignment database.

    I think there’s an incredible amount of value in looking through those patents, as well as the published pending patent applications for a number of reasons:

    1. They are directly from Google itself, rather than SEO folklore or mythology, based upon educated guesses, uneducated guesses, anecdotal evidence, and so on.

    2. They provide some insight into areas where Google is conducting research, and give us hints about possible directions the search engine might take.

    3. They offer looks at things like the hardware Google is using and the database designs they’ve implemented.

    4. They offer us a chance to view the Web, searchers, and search from the perspective of search engineers.

    5. They sometimes give us a chance to learn about experiments that Google has conducted, as well as the results of those experiments

    6. They show us how Google is addressing search related problems on a global scale, and in multiple languages.

    7. They raise questions about problems with search and search engines that many of us don’t even know exist.

    That’s just skimming at the surface.

    I consider SEO to be marketing with an insight into how search engines operate, and how they can impact the online efforts of people attempting to communicate, to conduct business, and to interact with others. If the search engines are providing useful information in the form of patent filings that can help us with that understanding and insight, then I think the question transforms from “why I would look at patents” to “why aren’t other people.”

    Google faces the same issues as other website owners, including understanding online consumer behavior. The patents often provide information about how Google understands and addresses online consumer behavior. There’s a great amount of value in that.

  8. Hi Ryan,

    I think the addition of synonyms in search results could be very helpful to searchers, creating the possibility that they may be more likely to find information that they are looking for. It does make SEO more interesting.

  9. Sifting through search patents often leads to wild, unfounded speculations but I agree with Bill that knowing about these proposed technologies offers some insight into how search engineers think and what the search engines may be attempting to do.

  10. Hey Bill Nice Post! I think this is Awesome that Google is trying to be logical and being closer to how a human brain can think.

    Hey Dimitry, you are right the Knowledge of customers online behavior are way more important than Google patents but I guess Google synonyms is there for the same reason to get closer to human and display the results as a user can think.

  11. I disagree that synonyms are ‘rarely’ used closely to each other. What bout all those instances when people type “x, also known as y”, where x is the first instance of word and y is the synonym.

    Also, you would need to account for dialects, cultural difference and differing uses of each word and its corresponding synonyms (my Irish family refer to ‘swede’ as ‘turnip’).

    I also wonder if Google have looked into the Neuroscience of Language, including how words are ‘stored’ in the brain and how the brain itself finds synonyms.

  12. Very interesting! But it sounds like this might have a huge negative effect on the accuracy in the organic search results. And it will clearly make it more difficult to optimize your webpage to certain keywords, knowing that you have to consider synonyms aswell.

    But I’m very curious to see how this might work.

  13. Hi Michael,

    I do try to keep from making any wild and unfounded speculations here when I see something in a patent that provides a new approach or possible insight into how a search engine might work. But that doesn’t keep me from considering how something I see might have an impact, and coming up with questions and things to look for and possibly experiment with. That’s the fun part that goes with the actual work of digging through a patent, and trying to make some sense of it. :)

  14. Hi Moosa,

    It is interesting to see how Google does try to consider how searchers think, and attempt to make it easier to find the information they are looking for. I’m not sure that Google always gets it right, but I think they do a good job most of the time.

  15. Hi Emma and Michael,

    I agree that sometimes synonyms do appear closer together than 4 words. I wondered at that myself as I was reading through the patent.

    I was happy to see Michael providing a link to the papers that Google has published on natural language processing. I know that Google has published a lot more papers that that, but many of them are internal documents that haven’t found their way onto the Web.

    I’d love to see a paper that describes how they came up with the statement about synonyms not appearing closely together. I would guess that they looked at a lot of pages before they came up with an assumption like that.

  16. Hi Petter

    If the synonyms being shown to searchers are truly synonyms with the context that they are being searched for, then I’m not sure that they quality of results from Google would suffer. It’s that context that’s important though.

    The words “car” and “auto” are synonyms when they both appear in front of the word “mechanic.” But “auto” by itself is sometimes a synonym for the word “car,” and it’s sometimes a description of a setting. For instance, the machine was set to “auto” so that it went through two cycles each time it was started by default.

    It might make it harder to optimize pages for specific keywords, especially since it makes it more necessary to think about possible synonyms for phrases that you might want to optimize for. I’m not sure that’s a bad thing though if it can potentially bring more visitors to your pages.

  17. Bill would this type of technology fall into the ‘machine learning’ category? Whereby a search engine is capable of determining the relationships between words across multiple pages and subsequently gets smarter as it comes across more and more examples?

  18. Hi Jonathan,

    The patent filing doesn’t describe the process involved in this approach as a machine-learning based system, though it’s possible that it could be set up in a manner like that.

  19. Just a little insight to share: In undertaking search engine optimisation for a client in the ecommerce sector I’ve been noticing some interesting developments in the use of apostrophes in search terms.

    Taking a very recent example (in the UK), until very recently the Google AdWords Keyword Tool was displaying differentiated search volumes for ‘mothers day gifts’ and ‘mother’s day gifts’ (with the apostrophe). However they are now amalgamated. The Google Insights for Search tool is also currently displaying these as identical search terms. The SERPs for these two variants are different, so this amalgamation is not being reflected in actual results as yet.

    What I am currently observing is not new; in late 2010 the Google AdWords Keyword Tool started displaying amalgamated search volumes, before reverting back to displaying separate search volumes for terms with and without apostrophes.

  20. Hi Dave,

    Odd how the search engines sometimes treat things like punctuation, especially apostrophes. A word like “don’t” is a valid contraction of two separate words – “do not.” When it is included as part of a search query, the searcher isn’t looking for “don t,” so the search engine shouldn’t be treating it as if it were an empty space, or eliminating the space between the n and the t.

    When an apostrophe is used in a possessive, such as “mother’s”, the apostrophe is being used to create meaning that differs from the same word without the apostrophe and no space. But again, the searcher isn’t searching for “mother s.” But, how many people searching for “mother’s day,” or “valentine’s day” use the apostrophe in their query, and how many type in “mothers day,” or “valentines day.” If the intent within that context is similar, I would guess that the ideal thing to do would be to treat those phrases as if they were the same, both in the keywords tool, and in search results.

    But it’s interesting to see that Google is flip flopping on how they handle it.

    Thanks for sharing this with us.

  21. Hi Bill, to provide a little more detail to my original observations, here are the figures presented by the Google AdWords Keyword Tool at one point last year (note: I have seen these figures significantly downsized since) –

    ‘valentines day gifts’ = 49,500 global ‘Exact Match’ monthly average
    ‘valentine’s day gifts’ = 12,100 global ‘Exact Match’ monthly average

    ‘mothers day gifts’ = 74,000 global ‘Exact Match’ monthly average
    ‘mother’s day gifts’ = 27,100 global ‘Exact Match’ monthly average

    So apostrophes are reportedly featuring in approximately one third the volume of searches of their abbreviated cousins. The underlying intention behind these searches is almost certainly the same, and simply imply that more people are inclined to drop the apostrophe when entering a search (speed over grammatical correctness).

    I would fully support the idea that the search engines should treat these examples as the same search terms, but whilst they fail to do so in the actual search results, it would be helpful if they kept their reported search volumes segmented into the two variations. Anyone researching these types of search terms afresh might fail to target the primary head terms. Maybe the alignment of reported search volumes and search results is imminent?

    I’ll continue to watch the SERPs, and your posts, with interest! All the best.

  22. Bill – this is a fantastic article!

    Considering the mathematical theory behind the proximity of words with similar meanings to one another in a document seems like a fairly viable way of understanding synonyms.

    I would assume that there would be a maximum and minimum threshold within the proximity as well as additional variables for paragraph separation.

    Very interesting stuff!

  23. Hi Dave,

    Thanks for the followup comment. Pretty intriguing. There are some significant differences between the versions of the terms with and without the apostrophes.

    As a search engine, I can understand wanting to track and follow search trends and understand how people search when it comes to terms like that. Reporting upon them in the keyword tool as separate terms is probably a good idea as well, but might it be a little misleading to people using the tool if Google provides both variations in search results as if they were the same term?

  24. Hi Tom,

    Thank you.

    I would also think that in addition to considering thresholds involving proximity and possibly separation within paragraphs, Google might use a page segmentation process to determine if the terms are appearing with the same blocks or segments.

    As for paragraphs, I wonder. When I write on paper or in an word processing document, I usually use longer paragraphs than when I write for the Web. While the sentences of a paragraph should ideally capture a specific concept and expand upon it, I find myself breaking my paragraphs down into smaller parts in blog posts in an attempt to avoid big blocks of text to make posts more readable. I suspect that I’m not alone in doing that.

Comments are closed.