Google on Finding Entities: A Tale of Two Michael Jacksons

I’ve been saying for at least a couple of years that Google’s local search is a proof of concept for the search giant to use on how to find and understand entities.

With local search, Google goes out and looks for a mention of a business on the Web, especially when it it accompanied by geographic location information. It collects and gathers facts related to businesses (entities are people, places, and things) and then it clusters information about the objects it finds to make sure that those mentions across the Web are all referring to the same places.

If you start reading about local search, you’ll see people referring to the importance of consistency in how you present address information for a business, and the same thing is true for entities.

Two different michael jacksons

When there are two well known people with the same name, such as head of Homeland Security Michael Jackson, and pop star Michael Jackson, things could get a little confusing.

That is until you start looking at the facts associated with each mention. One was a member of the group “Homeland Security” and the other was a member of the group, “The Jackson Five.” One vowed to the constitution that he would protect us from terrorists, and the other let us know, “I want you back.”

A patent granted to Google this June puts that problem into perspective for us by referring to the problem of understanding which Michael Jackson is which:

It is frequently useful to know the specific entity to which a document is referring. For example, if the goal is to extract, organize, and summarize information about Michael Jackson (the singer), one will want to look only at documents about Michael Jackson (the singer), and not at documents other Michael Jacksons. The ambiguity of language, of names, and of other common properties makes determining which entity a document is referring to a difficult task. Therefore, what is needed is a method for disambiguating references to entities in a document.

In Finding Entity Names in Google’s Knowledge Graph, I wrote about Google’s Data Janitors, which are tasked with taking data extracted from the web and performing multiple tasks with it, like cleaning it up, or collecting facts related to it. It’s ultimately their task to tell us which Michael Jackson is being referred to when one of them is mentioned on a web site.

One assumption is uncovered about this process is that “the number of documents identified as referring to an entity is used to estimate the absolute and/or relative importance of the entity.” The patent quickly follows that statement up by telling us that the importance of each of those documents might also be judged by factors such as:

(1) how likely it is that the web page might be referring to that particular entity (some kind of confidence strength), or

(2) some metric of importance of the document itself regardless of the entity, such as “what is the PageRank of the page,” or

(3) both.

When Google acquired MetaWeb’s Freebase, it was estimated to contain information about 12 million entities. The last I heard, that number is up over 250 million.

This patent describes how Google goes about identifying entities on the Web, and figuring out which Michael Jackson is which:

Finding and disambiguating references to entities on web pages
Invented by Leonardo A. Laroco, Jr., Nikola Jevtic, Nikolai V. Yakovenko, and Jeffrey Reynar
Assigned to Google
US Patent 8,751,498
Granted June 10, 2014
Filed: February 1, 2012

Abstract

A system and method for disambiguating references to entities in a document. In one embodiment, an iterative process is used to disambiguate references to entities in documents. An initial model is used to identify documents referring to an entity based on features contained in those documents. The occurrence of various features in these documents is measured.

From the number occurrences of features in these documents, a second model is constructed. The second model is used to identify documents referring to the entity based on features contained in the documents.

The process can be repeated, iteratively identifying documents referring to the entity and improving subsequent models based on those identifications. Additional features of the entity can be extracted from documents identified as referring to the entity.

Share

8 thoughts on “Google on Finding Entities: A Tale of Two Michael Jacksons”

  1. And the million dollar question is: “How many of Google’s 300+ ranking signals are the certainty scores for things like the disambiguation?”

    I think that in any of the many elements of search where Google has to deal with probabilities, the level of certainty Google is able to assign to such probabilities is highly likely to be factored into ranking somewhere, even if only a very minor factor.

    We know that in many searches, Google wants to provide a range of options, allowing the searcher to apply disambiguation and choice that the query could not indicate. You can only provide a range of Michael Jackson pages about a range of Michael Jacksons if you can be reasonably certain that they are *not* all talking about the same one.

  2. Thanks, Ammon.

    I suspect that we are in a transition stage at Google, where many of the signals we’ve been seeing for years may be one type of ranking that may be slowly being replaced by another that uses knowledge base results, searches through data about attributes for entities, and uses more of a semantic approach, to better understand queries.

    We see that Google Maps uses one type of rankings of results that is different from organic search rankings, and it might be best to say that things like disambiguation of entities follows another similar path to Maps, with an evolving approach to deciding what to display to searchers.

  3. Have a look at my blog post “Information extraction, yes … but the right way !” (http://goo.gl/R8MDWg).
    I describe the way I do entity extraction, or better: information extraction (not only entities) since 10 years.

    Not patented, but used.

    A multi-lingual semantic network (somewhat comparable to Google’s Knowledge Graph, but much older) is used at the same time as reference for previously extracted facts and a receptor of new facts.

  4. Thanks Bill –

    I did not know that about Google’s Data Janitors. It is good to know that manual action is being taken in a lot of these cases. I see this quite a bit in my own searches (can’t think of it, but I had a good example recently)

    This is another reasons why citations are so important in local search. People think it is all about the links or authority, but really it is just about creating occurrences of a name alongside the phone number, address, and other relevant info like tags and meta info.

    I am personally fighting this battle against a deceased cricket player with my same first and last name :) I gotta say, out of all the blogging and linking I’ve done, getting included in freebase.com is really what sealed the deal. In the last few months for local business clients Freebase inclusion is almost one of the first tasks we do.

    Sorry if the comment was scattered wrote this all on mobile waiting for my car to get repaired. Also by the way your site looks awesome on mobile!

    Patrick

  5. Good stuff as always B.

    As we have discussed, I think G are being too ambitious – in many ways :-)

    Understanding any text corpus or collection is very hard from many perspectives. But, there are particular challenges with entity extraction, such as geo location. We have a Michael Jackson here in the UK, an ex-army general – so is he less or more – what happens when he goes to a Michael Jackson memorial concert (unlikely). The issue I have with any central entity base is that they are not properly localised and the data points is non-equalised input – they therefore have their own inherent bias fault. So if the entity base is not weighted properly by geography, then it too becomes useless. The query is in danger of being homogenous.

    There is another problem with the disambiguation of entities. This being the dimensions of nodes. The deeper you go in the node level the more overlap you get and the less the disambiguation works. In other words , with all the combinations and permutations – there is a large chaos factorial. So Freebase is a good start, but it is probable that you need a multi-dimensional series of curated entity bases in order to checksum the decision.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>