Identifying Entity Types and the Transfiguration of Search @Google

The World Wide Web is a vast resource for information. At the same time it is extremely distributed.

A particular type of data such as restaurant lists may be scattered across thousands of independent information sources in many different formats. In this paper, we consider the problem of extracting a relation for such a data type from all of these sources automatically.

We present a technique which exploits the duality between sets of patterns and relations to grow the target relation starting from a small sample. To test our technique we use it to extract a relation of (author, title) pairs from the World Wide Web.

Sergey Brin, Extracting Patterns and Relations from the World Wide Web (pdf), Stanford University, 1999

Torpedo as Aft, in the Torpedo Factory in Alexandria
Entities Change – Torpedoes become art and Search Engines become Knowledge Repositories.

One of the early successes of Google’s search is how PageRank influenced the ordering of search results in response to queries. Google Founder Lawrence Page is credited with the invention of PageRank, but around the same time, Sergey Brin was digging into another approach for capturing and indexing content on the Web involving entities and knowledge.

The paper quoted at the start of this post contains substantially the same text as a provisional patent Brin filed with the USPTO in the year 2000, but he ended up rewriting it, and then filing 4 continuation patents of it up until 2012, when he filed the latest version of Information extraction from a database. The abstract from the versions of the patent reads:

Techniques for extracting information from a database are provided. A database such as the Web is searched for occurrences of tuples of information. The occurrences of the tuples of information that were found in the database are analyzed to identify a pattern in which the tuples of information were stored. Additional tuples of information can then be extracted from the database utilizing the pattern. This process can be repeated with the additional tuples of information, if desired.

A screenshot from Sergey Brins patent on Information extraction

The Transfiguration of Search @Google

Signs are there that this is the direction that search and SEO are headed in, where the Semantic Web plays a larger and larger role. I’ve been writing for years about topics such as Google’s acquisition of Meta-Web and Freebase, and more recently about how Google might use the Web as database to identify and disambiguate different entities with the same names, in Finding Entities: A Tale of Two Michael Jacksons.

Google is going to be changing how they collect and distribute information on the Web in many ways. Think of it as an evolution, though. I’m not saying that SEO is dead, but I am saying that Sergey Brin’s vision of how information from the Web is extracted, in tuples that indicate relationships, might become more commonplace as Google gets better at it.

And Google has been getting better at it,.

Last Monday, I asked the question which name do you prefer, Semantic Search or Semantic SEO?, while sharing links to tutorial presentations on Semantic SEO that both Barbara Starr and I had put together as an introduction to the topic, after Barbara asked me if I would be interested in joining her on the presentation.

I’m glad I did – it gave me a chance to hear from a lot of other people working on the Semantic Web, including search engineers and people offering services and tools in the field. It gave me a chance to hear Barbara’s thoughts (highly recommended). There were a lot of people from the search engines, and they were saying some interesting things. And there were a few signs of things to come at the 10th anniversary of the Semantic Technology and Business Conference.

If Google went from being a search engine tomorrow, to a knowledge engine, would it surprise you?

Entity Type Assignments

We know how difficult it can be for a search engine to identify entities on the Web and recognize different entities that share a name, even in the absence of markup such as schema on pages. Another challenge that faces a search engine or knowledge base is in defining an entity by assigning it an entity “type”.

An entity type assigned to an entity can tell us about the attributes or properties, and the values associated with them, that help us learn more about them. For example, for an actor, you might want to collect information about roles and characters played in Movies and on TV and on Stage. You might want to collect additional features, such as other actors involved in a production with them, where it was performed, and what type of medium it was performed in. Who were the people they starred with? What awards might they have won?

Entities with types and the properties that gow with those.
When types are defined for entities, those determine what attributes are collected for each

Entity assignments also indicate other information, such as when an entity type is being a “spouse,” we automatically understand that there is another entity related to it – the person they are a spouse of. If an entity is an author, there are related entities that are books or articles or blog posts or screen plays.

Google was granted a patent in 2011 that describes the assignment of entity types from models about those entities, with facts about them placed in fact repositories, and using these models to assign entity types to objects of unknown entity types in the fact repository.

Information extraction may be used to automatically identify and extract information in the form of facts, and can be performed on a variety of sources such as web pages to extract factual data.

The patent is:

Entity type assignment
Invented by Farhan Shamsi, Alex Kehlenbeck, David Vespe, and Nemanja Petrovic
Assigned to Google
US Patent 7,970,766
Granted June 28, 2011
Filed: July 23, 2007

Abstract

A repository contains objects including facts about entities. Objects may be of known or unknown entity type. An entity type assignment engine assigns entity types to objects of unknown entity type. A feature generation module generates a set of features describing the facts included with each object in the repository. An entity type model module generates an entity type model based on the sets of features generated for a subset of objects. An entity type model module generates entity type models, such as a classifier or generative models, based on the sets of features associated with objects of known entity type.

An entity type assignment module generates a value based on the sets of features associated with an object of unknown entity type and the entity type model. This value indicates whether the object of unknown entity type is of a known entity type. An object update module stores the object to which the known entity type was assigned in the repository in association with the assigned entity type.

While this patent isn’t new, it’s important in its description of how Google collects information about entities it finds on the Web and tries to understand then. The growth of a collection of facts about entities into a knowledge base requires that those objects and facts be defined in a consistent manner.

I blogged about this patent because it felt like time that I did after my “Two Michael Jackson’s post.” I also wanted you to see the words “information extraction” in relation to entities on the Web, as opposed to the “crawling” of Web pages, since that’s part of the evolution of SEO.

Entity Type Identification Problems

Sometimes entity type information might be unavailable on pages. Don’t let this happen to you and your pages.

Sometimes an entity might share a name with another entity, such as when the musician Peter Gabriel named his first four albums “Peter Gabriel.”

If you are going to name your first four collections of music after yourself, people might be a little ambiguous regarding which one they might be writing about when they are.

Listeners might be confused, as well as readers, as well as search engines that may be trying to convert information on web pages into “Knowledge.”

Some Facts about Entities, Entity Type Assignments and their Fact Repositories

A repository includes facts, and each fact includes a unique identifier for that fact, such as a fact ID. The “object” that a fact has may be given a unique Object ID as well.

A fact about an entity includes at least an attribute and a value. For example, a fact associated with an object representing George Washington may include an attribute of “date of birth” and a value of “February 22, 1732.”

As described above, each fact is associated with an object ID that identifies the object that the fact describes. Thus, each fact that is associated with a same entity (such as George Washington), will have the same object ID.

The number of facts associated with an object may be unlimited – there could be hundreds.

Values associated with Facts, such as a fact about the Economy of China, could contain a lot of information.

In addition to facts about specific objects/or entities, the fact repository may contain facts about the representation of the fact on the Web itself such as:

  • Language used to state the fact (English, etc.)
  • Importance of the fact
  • Source of the fact
  • Confidence value for the fact
  • so on

This entity type assignment engine attempts to improve the quality of knowledge contained within the fact repository by assigning entity types to objects with unknown entity types.

An entity can have more than one entity type (multiple non-conflicting entity types) when they hold more than one role or did so over different periods of time. For example, Arnold Schwarzenegger was an Actor Entity Type, a Politician Entity type, and a Weight Lifter entity type, and could have done all of those at the same time, but didn’t usually present himself as a hybrid of all three entity types.

Entity models may be built for entities once, but are usually rebuilt over some time period.

Some entity types are binary in nature, as in an entity is of a certain type, or it is not. Fido might be a dog, but the alternative as a binary entity type is a “non-dog.” A multiclass entity is also possible, and a person might be two different entity types at the same time, such as Arnold Schwarzenegger being a business person and an actor simultaneously.

The patent describes these assignments of entity types as either unsupervised or semi-supervised approaches. If the facts used to assign an entity type are weak or conflict in some way, an assignment might not be made.

The patent provides more details and a number of additional examples.

This patent is a couple of years old, and while it describes an automated way to assign an entity type, there are methods from places such as Schema.org where site owners could use markup to assign entity types, or places such as freebase.com where a person could assign one.

Or approaches such as the Open Language Extraction developed by Wavii or other extraction approaches may get involved.

Share

5 thoughts on “Identifying Entity Types and the Transfiguration of Search @Google”

  1. Bill, forgive me if this is too simplistic a question.

    Seems Google knows where they want to go with search. What’s holding them back?

    Do they have it figured it out and are slowly transitioning?
    Are there conceptual issues that still need to be worked out?
    Is it a machine problem (ie computational power)?
    Or is it something else?

  2. Hi Bill

    My post shows that the two founders of Google had different ideas about where they wanted to go with search – one focusing upon ranking pages (Lawrence Page) and the other identifying and locating objects (Sergey Brin).

    There are many other search engineers at Google that have different ideas as well. Anna Lynne Patterson, a VP of Search at Google for many years and the builder of one of the largest search engines ever, with Recall, has developed a phrase-based index approach that has a number of differences as well. There are other search engineers that have developed their own approaches, such as Ramanathan Guha who started Google Custom Search engines and was a founder of Schema.org.

    If you asked most search engineers, I bet they would tell you that search is hard. How much effort does it take to get to something like the computer in Star Trek that Amit Singhal has professed he likes so much? Likely a lot of thought and a lot of sweat may get us there, but I suspect that search has decades and generations to go.

    There’s definitely a computational problem. Computers keep on getting better, leading to the possibility to do more with them.

    How much is “good enough” when the technology keeps getting better and the math does too?

    Who would have conceived of things such as personal assistants and predictive algorithms such as Siri, except maybe for science fiction writers?

  3. I’m intrigued with the idea of Google as a “knowledge engine.” Isn’t that what searchers really want? I mean, answers or knowledge. But how will Google be different than it is now? I assume the results will be accurate, more informed and helpful. Will the knowledge be more factual using these ids and entities? I suppose with the confidence value and source being part of the calculation that will help.

  4. We are trying to use attribute extraction in eCommerce from product description which is a very defined/limited problem. But given the way people enter product details esp. in ebay and others, you start getting attributes that don’t make any sense.

    We are getting too many false positives.

    Its interesting to see this approach. Will try it out!

  5. Hi Bill,
    Brilliant post I can see how this will make the web better when Google knows what is fact and can use that to make the end-user experience much better.

    In some instances it will give legitimacy to what might be considered duplicate content because it is now considered fact and things like

    When you mention Peter Gabriel naming his first four albums after himself.

    I thought of George Foreman naming all of his male children George and one girl Georgette take a look at what occurred when I googled “George Foreman’s Children’s names”

    So anyone writing content on George Foreman’s family or children would be pushed down in the rankings if they gave incorrect information in addition to not being punished for duplicate content if what they are stating is fact and includes of course original content goes without saying..

    ( Not talking about anything more than the facts)

    Anyway I think I am onto something however I would like to ask your permission prior to completing and posting a theory I have because of what I have read here and because I do not want to do anything that would offend you or be considered the wrong thing to do.

    All I have written about an example of what you are talking about.
    You can see a photograph of George Foreman if type into Google “George Foreman’s Children’s names”
    compared to what you get for facts regarding people and some of the big differences that are yet to be as all-encompassing as what you will see here.

    http://tomzickell.com/reading-identity-types-on-seo-by-the-sea/

    If you have any issues with it please let me know and I will take it down immediately.

    Respectfully,
    Thomas

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>