How Google May Use Synonym Substitutions to Rewrite Queries
A couple of months ago, I wrote about a Google patent that involved rewriting queries, titled Investigating Google RankBrain and Query Term Substitutions. There’s likely a lot more to how Google’s RankBrain approach works, but I came across a patent that seems related to the patent I wrote about in that post and thought it was worth sharing and starting a discussion about. The patent I wrote about in that post was Using concepts as contexts for query term substitutions. The title for this new patent was very like that one (Synonym identification based on categorical contexts), and the more recent patent from December 1st of this year.
The new synonym substitutions patent starts by describing a scenario that is a good example of how it works. The inventors tell us:
For example, learning that “restaurants” is a good synonym for “food” in the query [food in San Francisco] is relatively straightforward because the volume of query traffic, including the query term “San Francisco,” is huge. For much smaller cities, such as Grey Bull, Wyo., the query stream may have never seen any supporting evidence for this synonym substitution.
Both cities are entities that fit into the same category. That of “Cities” means that they could potentially be good synonyms for each other. That’s what the inventors of this patent tell us specifically, using the San Francisco and Grey Bull example:
For example, if “San Francisco” and “Grey Bull” are both cities, and “restaurants” is a good synonym for “food” in queries about San Francisco, the synonym relationship may apply to queries related to “Grey Bull” as well. Thus, the category “city” may be a useful category when identifying synonyms for query expansion in the circumstances such as this.
So, we are told that the process involved in this synonym substitutions patent is to identify categories from a knowledge base involving several entities. Other entities within that same category could be synonyms for each other in similar contexts. Thus, the process from the patent involves identifying those entities from a query stream and identifying the category as one that they call a “coherent” category.
A Coherent Category is One in Which a Certain Threshold of Terms Tends to Co-Occur in a Query Stream Involving Those Entities
The patent tells us that a coherent category is one in which a certain threshold of terms tends to co-occur in a query stream involving those entities. The patent tells us, for instance, that a category that might include entities that are cities, villages, and towns. They might see a lot of co-occurring terms involving hotels and roads. If the number of co-occurring terms appearing in that query stream meets a certain threshold, it would be a coherent category. The entities from the same categories could then be synonyms for each other.
The synonym substitutions patent in question is:
Synonym identification based on categorical contexts
Invented by: Zachary A. Garrett, Takahiro Nakajima, Tasuku Oonishi
US Patent 9,201,945
Granted December 1, 2015
Filed: March 8, 2013
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training recognition canonical representations corresponding to named-entity phrases in a second natural language based on translating a set of allowable expressions with canonical representations from a first natural language, which may expand a context-free grammar for the allowable expressions for the first natural language.
Synonym Substitutions Take Aways
When I wrote about the query term substitution patent I refer to at the start of this post, I included several examples of queries that were re-written based upon some substitutions of query terms. Those may have seemed reasonable to a search engine looking at words that tended to show up, or co-occur, in a query stream involving those search terms.
For instance, someone searching for [New York Yankees stadium] was likely searching for results that involved “baseball.” That is because queries that included “New York Yankees” and “stadium” also often included the term “baseball.”
That patent didn’t use the term “co-occur,” nor did it explain how a knowledge base might substitute entities that might be in the same categories like this one does. The idea that a shared context like entity categories can trigger synonym substitutions in a query is interesting.
Synonyms in Search Show Up Frequently
It’s worth spending time with both patents and reading through each of them many times and thinking about how they are being used.
I’ve written a few posts about synonyms in search. Here are some of those:
- 2/19/2006 – Multi-Stage Query Processing at Google
- 5/25/2007 – Refining Queries Using a Local Category Synonym
- 12/29/2008 – How a Search Engine Might Use Synonyms to Rewrite Search Queries
- 1/23/2009 – Google to Expand Language Search and Shrink Our World?
- 6/29/2009 – Semantic Relations from Query Logs
- 12/22/2009 – Google Search Synonyms Are Found in Queries
- 1/19/2010 – Google Synonyms Update
- 1/27/2010 – Paid Search Results and Query Expansion using Synonyms and Related Concepts
- 2/16/2011 – More Ways Search Engine Synonyms Might be Used to Rewrite Queries
- 8/12/2013 – How Google May Substitute Query Terms with Co-Occurrence
- 9/27/2013 – The Google Hummingbird Patent?
- 12/8/2013 – How Google May Rewrite Queries
- 9/9/2013 – How Google May Reform Queries Based on Co-Occurrence in Query Sessions
- 10/15/2013 – Google’s Hummingbird Algorithm Ten Years Ago
- 12/21/2015 = How Google Might Make Better Synonym Substitutions Using Knowledge Base Categories
Last Updated July 13, 2019