A couple of years back, Google was granted a patent on an approach to identifying synonyms by looking at and comparing queries that searchers used to find information. The patent was Determining query term synonyms within query context, and I covered it in my post How Google May Expand Searches Using Synonyms for Words in Queries.
A month or so after that patent was granted and I wrote my post, Google researcher Steven Baker published a blog post at the Official Google Blog titled Helping computers understand language, where he announced that Google would start including synonyms for query terms in search results when the search engine thought that the synonym was a good match for a query term.
Car Mechanic or Auto Mechanic or Both?
The Google blog post included some examples of when replacing a query term for a synonym might be useful (such as replacing “word” for “lyrics” when someone searches for [song words]), and when replacing a query term for a synonym might not be as helpful (such as replacing “pictures” with “photos” on a search for [motion pictures]). Steven Baker not only wrote the blog post, but he was also one of the inventors listed on the query synonyms patent.
Google was granted a new patent this week on synonyms, and Steven Baker is again one of the named inventors on the patent.
In the beginning of the patent’s description, some of the difficulties mentioned in Steven Baker’s blog post are repeated about finding helpful synonyms that can be used to replace other words in a query to bring meaningful results to a searcher.
The earlier patent described some of the ways that Google might identify synonyms from query logs, and this patent explores some additional approaches that look at how those words are used within pages online.
For example, the synonym identification method might look at how frequently words tend to appear together in documents on the Web.
It might also look at how close together those words appear within those documents as well. It seems that words that frequently appear on a page near each other, but not too close may stand a good chance of being synonyms.
So this process would look to see if those words that tend to co-occur in the same documents frequently tend to appear so close together that they might appear within the same sentence or phrase. If so, they might not be synonyms because, as we’re told in the patent, “synonyms rarely occur in the same sentence or phrase.”
For example, the words “ice” and “cream” tend to show up frequently in the same documents, but they often tend to be adjacent to each other in the phrase “ice cream.”
A closeness score might be calculated by dividing the probability that the words are very close to each other by the probability that the words are near each other.
Words that are “very close” to each other might be less than a certain number of words apart, such as 4 words. Words that are “near” each other might be within a certain other number of words, such as 100. Words that are near, and appear within the same documents frequently, but not too close might be synonyms.
This system might also look at correlations between the appearance of certain words within the content of a page, and words in page titles and in anchor text pointing towards those pages.
While co-occurrence and closeness of words are two factors to be considered, another signal might involve looking at how the words are actually used on the page, referred to in the patent as “word forms.”
For instance, if the author of a page is writing about “car mechanics,” on a page, and then refers to “auto mechanics,” on the same page, that usage might provide a clue that “car” and “auto” could be synonyms. The patent also tells us that the search engine might also look at query logs to see if searchers sometimes replace one of the words with another during the same query session.
So, if the words “car” and “auto” tend to appear frequently on the same page, and people searching for [car mechanics] also frequently change that search to [auto mechanics] during the same query session, that there’s a strong probability that the terms are synonyms.
The patent, which provides much more detail, is at:
Document-based synonym generation
Invented by Oleksandr Grushetskyy and Steven D. Baker
Assigned to Google
US Patent 7,890,521
Granted February 15, 2011
Filed: February 7, 2008
One embodiment of the present invention provides a system that automatically generates synonyms for words from documents. During operation, this system determines co-occurrence frequencies for pairs of words in the documents. The system also determnes closeness scores for pairs of words in the documents, wherein a closeness score indicates whether a pair of words are located so close to each other that the words are likely to occur in the same sentence or phrase.
Finally, the system determines whether pairs of words are synonyms based on the determined co-occurrence frequencies and the determined closeness scores. While making this determination, the system can additionally consider correlations between words in a title or an anchor of a document and words in the document as well as word-form scores for pairs of words in the documents.