In 2006, Google battled Yahoo! and Microsoft for an algorithm developed by an Israeli Ph.D.student in Australia. The algorithm had a semantic element to it, and advanced Google in an algorithm arms race between the search giants (one of which doesn’t even have a search engine of its own now). We’ve seen the technology described in terms of how it is displayed in search results, but not how it does what it does. Until now.
Google was awarded a patent this week that looks at search results for specific queries and the entities that appear within them, to produce query refinements. This invention is from Google, but the lead inventor behind it was part of a bidding war between Google, Yahoo!, and Microsoft. In 2009, the breakthrough was made public on Google in the form of Orion technology.
The Orion approach involved both extended snippets for queries (three or more lines of descriptive snippet instead of two for some longer queries), and “more and better query refinements.” How this technology is displayed is described in a Google Official Blog post from March 24, 2009 titled Two new improvements to Google results pages.
One of the co-authors of that post is Ori Allon, who developed the Orion Technology as a student in Australia. (Ori has been busy since then, with stints at Google and Twitter, and a new project on his own.)
If you do some of the searches at Google described in that blog post, you’ll see both extended snippets and a good number of suggested query refinements. Try a search for [earth's rotation axis tilt and distance from sun] (without the brackets), for an example. Three of the top 10 results from my search have three lines of snippets, and another has 4 lines. Here’s an example of one of those extended snippets:
The patent provides us with a better look at how the Orion technology actually works:
Refining search queries
Invented by Ori Allon, Ugo Di Girolamo, Tomer Shmiel, Alexandre Petcherski, and Tzvika Hartman
Assigned to Google
US Patent 8,392,443
Granted March 5, 2013
Filed: March 17, 2010
Methods, systems, and apparatus, including computer program products, for refining search queries.
A method includes:
- Obtaining a submitted search query, and in response to obtaining the search query:
- Obtaining search results responsive to the search query;
- Selecting a document from a group of documents identified by the search results;
- Generating from a subset of one or more entities associated with the document one or more candidates for refined search queries, including:
- Identifying one or more terms in the search query, where the one or more terms occur in the search query in a particular order relative to each other, and
- Combining the one or more terms with the entity to generate a candidate, where the one or more terms occur in the particular order relative to each other; and identifying one or more of the candidates as being refined search queries for providing with the search results.
Generating Query Refinements
The patent itself focuses upon query refinements rather than upon extended snippets, and my guess is that there’s probably another unpublished patent out there focusing upon those extended snippets. But the query refinement approach is interesting in a few ways.
It refers to entities found in documents that rank for specific queries (a co-occurrence of entities), and those entities might be used in combination with words from the original query (or synonyms of those words) to provide query refinements.
Pages returned in a search for a query could be associated with specific entities, which are included in the documents returned for that search.
Entities make up a “meaningful, self-contained concept.”
An entity could be a single word, a phrase, or other character strings. An entity might be a sequence of one or more characters that show up in previously-submitted search queries at a frequency that is greater than a certain threshold of searches over a certain period of time. A document could be associated with more than one entity.
Someone searches for [Mona Lisa] and Google returns search results pages in response. A number of other entities might appear in those search results, such as “Leonardo da Vinci”, “Louvre”, “renaissance”, and others.
These entities might be passed along to a query refinement server as parts of candidate query refinements.
Scoring Entities Associated with Documents
Refined search queries can be created in real time, because the entities that are used to generate those queries are associated with documents identified by search results in response to the query.
Entities associated with a document can also be previously-submitted search queries for which search results that identify the document have been returned more than a certain number of times.
Inverse Document Frequency (IDF) Score
Part of the process described in this patent involves filtering entities as possible query refinements that the search engine might list, to get just a small number of refinements.
An entity might have an IDF score, which is based upon counting the number of documents being searched “which contain (or are indexed by) the term in question. The intuition was that a query term which occurs in many documents is not a good discriminator, and should be given less weight than one which occurs in few documents.” See: Understanding Inverse Document Frequency: On theoretical arguments for IDF (pdf).
The score for the entity may be based on a sum of the IDFs of each word in an entity. The score of the entity “Mona Lisa” might be calculated by taking the sum of the IDF of “Mona” and the IDF of “Lisa”. (The British Rock band from the 80s, “The The” might not have the greatest IDF score in the world under that approach.)
As the IDF score of an entity increases, the likelihood that the entity is important or relevant to a document responsive to the search query also increases. Therefore, entities of a document with a higher score are also ranked higher than entities with a lower score.
If you look through the document, you may see an entity appear more than once. A score for an entity can be created from “determining a co-occurrence relationship between the entity and a search query.”
If an entity appears in a document more than once, it’s importance to the document is probably higher, and that score might be used with, or incorporated into the IDF score.
Query Click and Dwell Time Score
The score for an entity could also be increased as the number of times the entity is found in a previously-submitted query increases.
Every selection of the document for the previously-submitted query within search results is counted as a click. The amount of time someone views or “dwells” on the document may also be tracked. The more time they spend there (i.e., a long click), the more relevant the document might be seen for that previous click.
If they don’t spend much time there, that might be perceived as a lack of relevance of the document.
The score for an entity can increase based upon long clicks, and/or based on an increase of the ratio of long clicks to total clicks for queries which use that entity.
Candidate Query Refinements used in Titles
The score of an entity can be increased if the entity is found in the title of a document.
Previously Submitted Queries
This score for an entity can increase by an increase in the number of times it is found in previously-submitted queries, in the number of documents it is found in presently, in the number of times in which the entity is included in the titles of documents, and also as the number of terms (or tokens) in the entity increases.
Other Collected Information
It’s possible that some other information might be used to score an entity that might be used as part of a candidate query. Some of the information collected may include the:
- Search query
- Frequency of submission over a period of time
- Dates and times of submission
- Language of the search query, and/or
- Other information associated with the search query
Evaluating Candidate Refinements
The patent describes how these entities might be merged with the original query that refinements are being selected for, and how they might be selected from among all of the candidates. Some of this evaluation might involve looking at:
- A number of words in the candidate
- An amount of overlap between the candidate and an entity
- An amount of overlap between the candidate and the search query
- A number of times the candidate appears in the search logs
- A sum of the IDF of all the terms in the candidate
- An IDF of the most unique term in the candidate
The top 8 or so refinements might be selected to be shown in search engine results.
It’s hard to say how much of what Ori Alon developed is still in use in generating the query refinements that Google shows for queries. Ori Allon is no longer with Google, but it’s possible that others at Google have worked to improve this query refinement approach.
The”entities” described in this patent feel just a little different than the named entities described in Google’s knowledgebase approach to search. I’ve described some of the changes we’ve seen in search from keyword mapping to phrase-based indexing to concept matching in SEO is Undead Again (Profiles, Phrases, Entities, and Language Models). If we think of named entities as specific “people, places, and things,” then maybe entities found in documents to create query refinements aren’t so different, though.
If you were to take a set of query refinements suggested for a particular query, and start looking through the pages returned for that query, you might start seeing some of the entities that were used to generate those refinements.