Extracting Facts for Entities from Sources such as Wikipedia Titles and Infoboxes

There are a number of patents from Google, both granted patents and pending patent applications, that describe ways that Google might learn about entities and about facts associated with those by extracting the information from the Web itself instead of relying upon people submitting information to knowledge bases such as Freebase.

We learned from Google’s recent announcement that they would be replacing the Google Knowledge Base with their Knowledge Vault, and that supposedly brings a whole new set of extraction approaches with it that have high levels of confidence with them as to how accurate they might be.

It’s hard to tell exactly which approaches Google might be relying upon, and which ones that Google might have introduced through something like a patent that is no longer being used. But, it doesn’t hurt to learn some of the history and some of the approaches that might have been used in the past.

I’m blogging about a patent today that describes an approach that many of us have assumed that Google has been using for years to identify Objects or Entities and attributes about those and the values that fit those attributes.

Contextual Patterns – Titles and Infoboxes

Many sites follow certain practices that help make it easy to learn about objects and facts from them. One example is Wikipedia, which tends to follow a specific pattern in how they title pages. For example, the template that Wikipedia pages are based upon use a pattern for titles such as:

[SUBJECT]–Wikipedia, the free encyclopedia,

There is a structure to these pages that make it easy to learn what they are about, which happens to fit what their titles say they are about. Wikipedia isn’t the only site like this, and you can see other sites doing something very similar.

Here is the title for the George Washington page at Wikipedia:

George Washington – Wikipedia, the free encyclopedia

On the Wikipedia “Disambiguation” page, some other people with similar names are linked to and the titles for those pages follow a similar pattern. Having a Wikipedia disambiguation page helps Google tell entities apart when there might be confusion regarding who they are. Here are some titles for the other pages (and entities):

George Corbin Washington – Wikipedia, the free encyclopedia
George Washington Carver – Wikipedia, the free encyclopedia
George Washington (inventor) – Wikipedia, the free encyclopedia
George Washington (Washington pioneer) – Wikipedia, the free encyclopedia
George T. Washington (Liberia) – Wikipedia, the free encyclopedia
George Thomas Washington – Wikipedia, the free encyclopedia
George Washington (baseball) – Wikipedia, the free encyclopedia

Wikipedia pages often have tables or infoboxes that contain attributes related to the objects they are about, which consist of a specific label and a value. Here’s the one for George Washington:

Wikipedia infobox with facts for George Washington.

Here’s the patent

Learning objects and facts from documents
Invented by Shubin Zhao
Assigned to Google
US Patent 8,812,435
Granted August 19, 2014
Filed November 16, 2007

Abstract

A system, method, and computer program product for learning objects and facts from documents. A source object and a source document are selected and a title pattern and a contextual pattern are identified based on the source object and the source document. A set of documents matching the title pattern and the contextual pattern are selected.

For each document in the selected set, a name and one or more facts are identified by applying the title pattern and the contextual pattern to the document. Objects are identified or created based on the identified names and associated with the identified facts.

Components and Terms Used in this Extraction Process

Importers – Information from the Wikipedia (and other pages) are grabbed by an importer to read the content of pages, and extract facts from them, while determining the subjects (entity or entities) they cover, and to extract facts into “individual itmes of data.

Janitors – There are a number of different types of janitors that may perform different function, but all act to process facts extracted by the importer, including data cleansing, object merging, and fact induction. Correcting spelling and grammar, translation, normalising formats, removing duplicate facts, removing unwanted facts, and so on are within the tasks that different janitors perform.

Build engine – Builds and manages the repository.

Service engine – an interface for querying the repository. It processes queries, scores matching objects, and returns them to searchers asking for the information.

Fact repository stores factual information about entities. Each Entity, or Object, real-world or fictional person, place, or thing). associates each fact with exactly one object. Any number of facts may be associated with an individual object by including the object ID for that object in the facts.

Attributes and Values Facts associated with specific entities may include specific fact types of values associate with them. For George Washington, we have a “Date of Birth” attribute, and a value of “Feb. 22, 1732.”

Tuples – The data structure of a fact might be represented by a tuple of information that includes a fact ID, an attribute, a value, and an object ID. It may actually include more information, such as a source of the fact on the Web, a language that it is stored in, and more.

Metrics

The patent tells us that a couple of metrics, or indications of quality of the facts, might also be associated and included with a fact. These include both a confidence level and an importance level. The confidence level indicates how likely it is that a fact is true, and the importance level tells how important the fact is to the object, or “how vital a fact is to an understanding of he entity associated with the object.

A Fact includes a list of sources that include the fact and from which where it was extracted, in URL, or Web address, or any other appropriate form of identification and/or location, such as a unique document identifier.

In addition, the information associated with the fact may also include the agent type of the importer that extracted the fact. So, this agent might be one that only imports facts from Wikipedia or the IMDB or some other site that may be used as a source of facts:

The facts illustrated in FIG. 2(d) include an agent field that identifies the importer that extracted the fact. For example, the importer may be a specialized importer that extracts facts from a specific source (e.g., the pages of a particular web site, or family of web sites) or type of source (e.g., web pages that present factual information in tabular form), or an importer that extracts facts from free text in documents throughout the Web, and so forth.

Name Facts and Propery Facts

These are more specialized facts, with a name fact being one that conveys a name for the entity. For example name facts for the United States Patent and Trademark Office might be “PTO” and “USPTO” as well as the official name, “United States Patent and Trademark Office.” One might be designated as a primary name with the others a secondary names. They may also be called synonymous names of the object.

Property facts generally provide summary information about an object, such as “Bill Clinton was the 42nd President of the United States from 1993 to 2001.”

Objects may also have additional special facts aside from name facts and property facts, such as facts conveying a type or category (for example, person, place, movie, actor, organization, etc.) for categorizing the entity associated with the object. In Identifying Entity Types and the Transfiguration of Search @Google, I showed how Google might identify what type of entity and entity might be based upon the range of fact available for it. An actor might have facts about movies acted in, Television shows performed in, Plays appearing within the cast of. An athlete might be of a type that includes facts for statistic records for performing a certain type of sport and other fact that tend to fit that type.

Take aways

Google will give priority to documents from reputable websites, such as that of the Encyclopedia Britannica Online, when selecting a source document.

Of, if a certain type of fact is wanted for a display in sometime like a knowledge panel (this example does not appear in the patent), the search engine might search for the entity and words representing the fact. It may attempt to find information using criteria such as “whether the object name matches document titles and whether the rest of the search terms match document contents.”

I will be writing about some other patents that were called “related” patents by the USPTO, and some of the concepts I cover will be similar.

Share

12 thoughts on “Extracting Facts for Entities from Sources such as Wikipedia Titles and Infoboxes”

  1. Very interesting Bill – It is getting to the point where the only information regarding the future of search & SEO that I beleive is are the patents. I mean, this really makes sense.

    The part I found most interesting were the confidence factor and the importance factors. Do you think if this were applied to an algorithm adjustment via Knowledge Vault sites will be given a confidence score / importance score?

    Also do you think all of this strictly relates to the Knowledge Graph or will they be using this for the main search algorithm scoring as well? Very cool.

  2. Hi Bill,

    Need some more time to configure all that in my mind.

    But helpful :)

  3. This sounds absolutely great! Thank you very much for this input.

    Do you think that this logic differs in different markets? I mean, do the knowledge bases differ in e.g. Germany?

    Heinz

  4. Hi Patrick,

    Good questions – thanks.

    I’m really happy that I started spending a lot of time with patents beginning in 2005 or so – they’ve been a tremendous source of information about how search and search engines work, and provide a lot of insights into assumptions that search engineers hold about searchers, and search, and the Web.

    The papers and presentations that I’ve seen on Google’s knowledge vault all discuss improved confidence levels to the point where attaining those seems like one of the primary reasons why Google made the change. The whole fusion process seems to be about trying to use the best information available, and merging it in a way that helps to surface the best, most correct, and most complete information about entities/objects.

    It’s hard to say how much of what is learned putting together the Google Vault will bleed over into web search scoring, but there are aspects of it that just have to, such as getting better about understanding the intent behind searches, and using query and click log and contextual searcher information to better understand what is being asked, and how to attempt to answer it.

  5. Hi Christopher,

    I’ve been taking a lot of long walks lately, trying to fit all the pieces together myself, and there are a lot of them.

    I’m working right now on another patent from the same inventor as this one, and as I scrolled into the patent, I was met with a list of 17 more patents that were “related” to that one. I went to the USPTO PAIR database, which you can use to learn more about the pending status of a patent, and it looks like that one is just about ready to be granted (I’m guessing tonight).

    From just the title, it sounds like it takes things a couple of steps further. We’ll get to it when we do. :)

  6. Hi Heinz,

    You’re welcome. I suspect that a lot of the processes behind how Google works in Germany and in the US are very similar, though part of the knowledge base from Google seems to have included data about what people search for and what people click upon in search results. It would make sense to use data from searchers in Germany for the German Knowledge Base, and searchers from the US for the US Knowledge Base.

    References have been made about the functions of some data janitors being translation, though I have no idea how much of that might go on. I suspect that it’s possible that some people in Germany perform queries where most of the information available might be in another language, and the same for searchers in the US, too.

  7. Hi Bill,

    Your recent posts highlighting Google’s interest in entities had got my interest, and looking at their research papers as well shows just how seriously they are in developing this area.

    A cursory look through some of the more recent Data Mining papers they are publishing at http://research.google.com/pubs/DataMining.html I can already spot a few from this year that could be applicable, including one using probabilities to score the confidence of facts similar to what you have discussed.

  8. Hi Chris,

    I’m working my way there, kind of slowly, but there’s enough interesting stuff along the way that I’m trying not to cover this stuff too quickly and miss something because of it. Thanks for sharing the link so that anyone who wants can skip ahead to those papers.

    I’ve also seen a number of interesting looking papers cited as references in the patents as well, and I may start sharing those either here or on my Google+ page. I suspect that I’ll probably do that here.

  9. Fascinating stuff, Bill – thanks!

    Is Google really replacing the Knowledge Graph with the Knowledge Vault, though?

    The first table in the watershed paper from Dong et al. carries this caption: “Comparison of knowledge bases. KV, DeepDive, NELL, and PROSPERA rely solely on extraction,
    Freebase and KG rely on human curation and structured sources, and YAGO2 uses both strategies.”

    This suggests that the KV and KG are separate and not mutually exclusive knowledge bases. Or should I be reading the second KV extraction method described in the paper – “graph-based priors” – as a method by which Knowledge Graph entities are being digested (as it were) into the Knowledge Vault? I trust your reading of this paper and the patent you reference better than my own, so would value your opinion?

    As well, you say that “[w]e learned from Google’s recent announcement that they would be replacing the Google Knowledge Base with their Knowledge Vault….” What “announcement” is this? I’m familiar with the Dong paper and the New Scientist piece that talks about the Knowledge Vault, but not with any Google announcement on the subject. Thanks!

  10. Hi Aaron,

    The Knowledge Vault was presented upon at CIKM ’13 by Evgeniy Gabrilovich and also discussed at an industry session by Kevin Murphy. See: http://blog.west.uni-koblenz.de/2013-11-06/visiting-the-conference-on-information-and-knowledge-management-cikm-2013-san-francisco/ (Barbara found that one.) It was also presented upon a couple of Mondays ago by Keith Murphy at KDD ’14. According to the New Scientist article:

    Google researcher Kevin Murphy and his colleagues will present a paper on Knowledge Vault at the Conference on Knowledge Discovery and Data Mining in New York on 25 August.

    I considered the paper itself as an announcement, and with statements within it like this one, an announcement that the knowledge vault was replacing the knowledge base:

    Therefore, we believe a new approach is necessary to further scale up knowledge base construction. Such an approach should automatically extract facts from the whole Web, to augment the knowledge we collect from human input and structured data sources. Unfortunately, standard methods for this task (cf. [44]) often produce very noisy, unreliable facts. To alleviate the amount of noise in the automatically extracted data, the new approach should automatically leverage already-cataloged knowledge to build prior models of fact correctness.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>