Google on Finding Entities: A Tale of Two Michael Jacksons
I’ve been saying for at least a couple of years that Google’s local search is a proof of concept for the search giant to use on how to find and understand entities.
With local search, Google goes out and looks for a mention of a business on the Web, especially when it it accompanied by geographic location information. It collects and gathers facts related to businesses (entities are people, places, and things) and then it clusters information about the objects it finds to make sure that those mentions across the Web are all referring to the same places.
If you start reading about local search, you’ll see people referring to the importance of consistency in how you present address information for a business, and the same thing is true for entities.
When there are two well known people with the same name, such as head of Homeland Security Michael Jackson, and pop star Michael Jackson, things could get a little confusing.
That is until you start looking at the facts associated with each mention. One was a member of the group “Homeland Security” and the other was a member of the group, “The Jackson Five.” One vowed to the constitution that he would protect us from terrorists, and the other let us know, “I want you back.”
A patent granted to Google this June puts that problem into perspective for us by referring to the problem of understanding which Michael Jackson is which:
It is frequently useful to know the specific entity to which a document is referring. For example, if the goal is to extract, organize, and summarize information about Michael Jackson (the singer), one will want to look only at documents about Michael Jackson (the singer), and not at documents other Michael Jacksons. The ambiguity of language, of names, and of other common properties makes determining which entity a document is referring to a difficult task. Therefore, what is needed is a method for disambiguating references to entities in a document.
In Finding Entity Names in Google’s Knowledge Graph, I wrote about Google’s Data Janitors, which are tasked with taking data extracted from the web and performing multiple tasks with it, like cleaning it up, or collecting facts related to it. It’s ultimately their task to tell us which Michael Jackson is being referred to when one of them is mentioned on a web site.
One assumption is uncovered about this process is that “the number of documents identified as referring to an entity is used to estimate the absolute and/or relative importance of the entity.” The patent quickly follows that statement up by telling us that the importance of each of those documents might also be judged by factors such as:
(1) how likely it is that the web page might be referring to that particular entity (some kind of confidence strength), or
(2) some metric of importance of the document itself regardless of the entity, such as “what is the PageRank of the page,” or
This patent describes how Google goes about identifying entities on the Web, and figuring out which Michael Jackson is which:
Finding and disambiguating references to entities on web pages
Invented by Leonardo A. Laroco, Jr., Nikola Jevtic, Nikolai V. Yakovenko, and Jeffrey Reynar
Assigned to Google
US Patent 8,751,498
Granted June 10, 2014
Filed: February 1, 2012
A system and method for disambiguating references to entities in a document. In one embodiment, an iterative process is used to disambiguate references to entities in documents. An initial model is used to identify documents referring to an entity based on features contained in those documents. The occurrence of various features in these documents is measured.
From the number occurrences of features in these documents, a second model is constructed. The second model is used to identify documents referring to the entity based on features contained in the documents.
The process can be repeated, iteratively identifying documents referring to the entity and improving subsequent models based on those identifications. Additional features of the entity can be extracted from documents identified as referring to the entity.