How Google May Use Synonym Substitutions to Rewrite Queries
A couple of months ago, I wrote about a Google patent that involved rewriting queries, titled Investigating Google RankBrain and Query Term Substitutions. There’s likely a lot more to how Google’s RankBrain approach works, but I came across a patent that seems to be related to the patent I wrote about in that post and thought it was worth sharing and starting a discussion about. The patent I wrote about in that post was Using concepts as contexts for query term substitutions. The title for this new patent was very similar to that one (Synonym identification based on categorical contexts), and the more recent patent was granted on December 1st of this year.
The new synonym substitutions patent starts off describing a scenario that is a good example of how it works. The inventors tell us:
For example, learning that “restaurants” is a good synonym for “food” in the query [food in San Francisco] is relatively straightforward because the volume of query traffic including the query term “San Francisco” is very large. For much smaller cities, such as Grey Bull, Wyo., the query stream may have never seen any supporting evidence for this synonym substitution.
That both cities are entities that fit into the same category, that of “Cities” means that they could potentially be good synonyms for each other. That’s what the inventors of this patent tell us specifically, using the San Francisco and Grey Bull example:
For example, if “San Francisco” and “Grey Bull” are both cities, and “restaurants” is a good synonym for “food” in queries about San Francisco, the synonym relationship may apply to queries related to “Grey Bull” as well. Thus, the category “city” may be considered a useful category when identifying synonyms for query expansion in circumstances such as this.
So, we are told that the process involved in this synonym substitutions patent is to identify categories from a knowledge base involving a number of entities where other entities within that same category could potentially be synonyms for each other in similar contexts. The process from the patent involves identifying those entities from a query stream and identifying the category as one that they call a “coherent” category.
The patent tells us that a coherent category is one in which a certain threshold of terms tends to co-occur in a query stream involving those entities. The patent tells us, for instance, that a category that might include entities that are cities, villages, and towns might see a lot of co-occurring terms involving hotels and roads. If the number of co-occurring terms appearing in that query stream meets a certain threshold, it would be considered a coherent category, and the entities from the same categories could possibly then be used as synonyms for each other.
The synonym substitutions patent in question is:
Synonym identification based on categorical contexts
Invented by: Zachary A. Garrett, Takahiro Nakajima, Tasuku Oonishi
US Patent 9,201,945
Granted December 1, 2015
Filed: March 8, 2013
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training recognition canonical representations corresponding to named-entity phrases in a second natural language based on translating a set of allowable expressions with canonical representations from a first natural language, which may be generated by expanding a context-free grammar for the allowable expressions for the first natural language.
When I wrote about the query term substitution patent I refer to at the start of this post, I included a number of examples of queries that were re-written based upon some substitutions of query terms that might seem reasonable to a search engine looking at words that tended to show up, or co-occur, in a query stream involving those search terms.
For instance, someone searching for [New York Yankees stadium] was likely searching for results that involved “baseball” since queries that included “New York Yankees” and “stadium” also often included the term “baseball.”
That patent didn’t use the term “co-occur” nor did it explain how a knowledge base might be used to substitute entities that might be in the same categories like this one does, but the idea that a shared context like entity categories can be used to trigger synonym substitutions in a query is interesting.
It’s worth spending time with both patents and reading through each of them multiple times and thinking about how they are being used.