Last year I wrote a post titled Google on Finding Entities: A Tale of Two Michael Jacksons. The post was about a Google patent that described how Google might tell different entities apart that shared the same name. The patent in it was filed in 2012 and granted in 2014. Google was also granted a new patent on disambiguating entities this week, which was originally filed in 2006. It is worth looking at this second one, given how important understanding entities is to Google.
It contains a pretty thoughtful approach to understanding and distinguishing between different entities within documents and queries.
The patent was filed in 2006, and the inventors also authored an article in 2006 about named entities that shows where their minds were at during the time:
Using Encyclopedic Knowledge for Named Entity Disambiguation, by Marius Pasca and Razvan Bunescu
The paper echoes the patent in many places, and it appears that the two documents were worked upon at the same time.
The patent begins by telling us that searches for named entities are very common; among the most popular on the Web. They include searches for persons, for places, for businesses and other organizations, for different types of products, such as books and movies. A named entity is a thing that has a proper noun or proper name associated with it. Some Microsoft research gave us impressive numbers about entities in queries:
According to an internal study of Microsoft, at least 20-30% of queries submitted to Bing search are simply named entities, and it is reported 71% of queries contain name entities.
Because of this, Search Engines are finding it useful in recognizing named entities when they see them. Having a way of understanding when a query contains a named entity and knowing which named entity is being referred to can mean smarter, better answers than just a matching of keywords in a query to keywords found in documents that match.
Named Entities with the Same Names
When someone searches for a named entity, the search results returned in response usually contain relevant information about any entities with the same name, or even a portion of that name, as the query. A focus of this patent is upon telling such searches apart, and it provides some examples of these searches for similar named entities:
Thus, a query for “Long Beach” is likely to return documents about the coastal city in Long Island, N.Y. as well as documents about the coastal city in Southern California, as well as documents that are relevant to the terms “long” and “beach”. Similarly, a query for “John Williams” will return documents about the composer as well as documents about the wrestler, and the venture capitalist, all of whom share this name; a query for “Python” will return documents pertaining to the programming language, as well as to the snake, and the movie. The underlying problem then is that queries for named entities are typically ambiguous, and may refer to different instances of the same class (e.g., different people with the same name), or to things in different classes (e.g., a type of snake, a programming language or a movie).
The patent tells us that the order of search results for named entities have been typically presented based upon the frequency of the query terms, their PageRank, or other factors, regardless of which specific named entity is being referred to by a shared name. This is one of the problems that the patent attempts to solve.
How does the patent attempt to address named entities that may share the same name, and enable us to distinguish between them?
A Knowledge Base of Named Entities
Part of the solution involves the use of a knowledge base that contains articles about named entities to use to disambiguate entity names. As we learn from the paper, the knowledge base being referred to is Wikipedia.
This knowledge base is built from a database of documents that are about named entities; entities that have proper names, such as “John Williams” (a person), “Long Beach” (a place), and “Python” (a movie, a programming language, and a deadly snake).
This knowledge base has features that help to disambiguate entity names that would otherwise be ambiguous:
- Each of these articles present a context that is associated with a particular meaning or particular sense of the name.
- These articles also contain links between instances of entity names and the article linked to the name.
- These articles also include redirected articles that associate an alternative or alias of a name to a particular named entity article, like Mark Twain (a pen name) for Samuel Langhorne Clemens.
- They include articles that disambiguate different senses of an ambiguous name, like adding a word to Danny Sullivan (technologist), to distinguish the Search Engine Land Editor from the race car driver of the same name.
In short, relationship information about a specific sense of a name for a named entity depends upon its context – how it is linked to, and which article it might be linked to as well. The relationship between a name of an entity, and the particular named entity may be determined based upon a scoring model.
A search query that includes an entity name and additional keywords can then be disambiguated by identifying the entity name within a query, and using the scoring model to identify the article(s) most closely associated with the entity name. The disambiguated name and identified article(s) can then used to augment the search results, so that they can be grouped or organized according to the entities identified.
Articles found in the knowledge base (and named entities as well) can be associated with specific categories. The strength of the relationship between a named entity and a category is learned and made part of the scoring model, and that can also be used to disambiguate queries containing entity names. Note that articles in Wikipedia are often assigned a category, too.
The patent is:
Disambiguation of named entities
Invented by: Razvan Constantin Bunescu, and Alexandru Marius Pasca
Assigned to: Google
US Patent 9,135,238
Granted September 15, 2015
Filed: June 29, 2006
Named entities are disambiguated in search queries and other contexts using a disambiguation scoring model. The scoring model is developed using a knowledge base of articles, including articles about named entities.
Various aspects of the knowledge base, including article titles, redirect pages, disambiguation pages, hyperlinks, and categories, are used to develop the scoring model.
A Dictionary of Named Entities
Part of the process involved in telling apart named entities that share the same names is using information about named entities from the knowledge base to create a named entity dictionary. The articles from Wikipedia associated with the named entities are extracted from the knowledge base, to do so.
The named entity dictionary along with information about the hyperlink structure between articles in the named entity knowledge base, and the context (features) of the named entity articles are used to create a disambiguation dataset.
This disambiguation dataset will likely also include category information identifying the categories that are associated with each named entity. This disambiguation module uses the disambiguation dataset to learn about the strength of the relationships between words from the query context and categories from category taxonomy.
Augmenting Search Results by Senses of Entity Names
The patent tells us that it might augment the search results it shows based upon an identified named entity. Augmenting the search results can mean grouping search results by the different senses of the disambiguated names, and adding annotations, snippets or other content that further identify or describe the search results (individually or in groups) based on the disambiguated names.
On a search for “John Williams”, the results shown may be for:
(1) one set of documents pertaining to the composer John Williams,
(2) a second set of documents pertaining to the wrestler,
(3) a third set of documents pertaining to the venture capitalist,
(4) and on, for any number of the different senses of the name.
Keep in mind that this patent was originally filed before Google had created their Knowledge Graph, and when they refer to a knowledge base in this patent, they are most likely referring to Wikipedia, which they call in this patent, an “exemplary knowledge base.”
The patent takes a deep dive into how Wikipedia is organized to describe how that structure can help it identify different entities that may share the same name, such as a John Williams that composed a score for Star Wars, another John Williams that was a professional Wrestler, another John Williams that was a venture Capitalist.
The patent tells us about features of Wikipedia that help it distinguish between entities, and that section of the patent is worth reading in how it covers, articles about specific entities, and what it calls redirect articles and disambiguation articles, and how categories are created for each entry in Wikipedia. The links pointed to and from articles about different named entities also helps to provide details about those that can be used to distinguish one from another.
Information found in queries that include named entities may have some correlations between them and the Wikipedia articles that can help in the selection of the right named entity.