Google Patents Extracting Facts from the Web

Sharing is caring!

When Google crawls the Web, it extracts facts from content on the pages it finds as well as links on pages. How much information does it extract about facts on the Web? Microsoft showed off an object-based search about 10 years ago, in the paper, Object-Level Ranking: Bringing Order to Web Objects..

The team from Microsoft Research Asia tells us in that paper:

Existing Web search engines generally treat a whole Web page as the unit for retrieval and consuming. However, there are various kinds of objects embedded in the static Web pages or Web databases. Typical objects are products, people, papers, organizations, etc. We can imagine that if these objects can be extracted and integrated from the Web, powerful object-level search engines can be built to meet users’ information needs more precisely, especially for some specific domains.

This patent from Google focuses upon extracting factual information about entities on the Web. It’s an approach that goes beyond making the Web index that we know Google for because it collects more information that is related to each other. The patent tells us:

Information extraction systems automatically extract structured information from unstructured or semi-structured documents. For example, some information extraction systems that exist extract facts from collections of electronic documents, with each fact identifying a subject entity, an attribute possessed by the entity, and the value of the attribute for the entity.

I’m reminded of an early Google Provisional patent that Sergy Brin came up with in the 1990s. My post about that patent I called, Google’s First Semantic Search Invention was Patented in 1999. The patent it is about was titled Extracting Patterns and Relations from Scattered Databases Such as the World Wide Web (pdf) (Skip ahead to the third page, where it becomes much more readable). This was published as a paper on the Stanford website. It describes Sergy Brin taking some facts about some books, and searching for those books on the Web; once they are found; patterns about the locations of those books are gathered, and information about other books are collected as well. That approach sounds much like the one from this patent granted the first week of this month:

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of obtaining a plurality of seed facts, wherein each seed fact identifies a subject entity, an attribute possessed by the subject entity, and an object, and wherein the object is an attribute value of the attribute possessed by the subject entity; generating a plurality of patterns from the seed facts, wherein each of the plurality of patterns is a dependency pattern generated from a dependency parse, wherein a dependency parse of a text portion corresponds to a directed graph of vertices and edges, wherein each vertex represents a token in the text portion and each edge represents a syntactic relationship between tokens represented by vertices connected by the edge, wherein each vertex is associated with the token represented by the vertex and a part of speech tag, and wherein a dependency pattern corresponds to a sub-graph of a dependency parse with one or more of the vertices in the sub-graph having a token associated with the vertex replaced by a variable; applying the patterns to documents in a collection of documents to extract a plurality of candidate additional facts from the collection of documents; and selecting one or more additional facts from the plurality of candidate additional facts.

The patent breaks the process it describes into a number of “Advantages” that are worth keeping in mind, because it sounds a lot like how people talking about the Semantic Web describe the Web as a web of data. These are the Advantages that the patent brings us:

(1) A fact extraction system can accurately extract facts, i.e., (subject, attribute, object) triples, from a collection of electronic documents to identify values of attributes, i.e., “objects” in the extracted triples, that are not known to the fact extraction system.

(2)In particular, values of long-tail attributes that appear infrequently in the collection of electronic documents relative to other, more frequently occurring attributes can be accurately extracted from the collection. For example, given a set of attributes for which values are to be extracted from the collection, the attributes in the set can be ordered by the number of occurrences of each of the attributes in the collection and the fact extraction system can accurately extract attribute values for the long-tail attributes in the set, with the long-tail attributes being the attributes that are ranked below N in the order, where N is chosen such that the total number of appearances of attributes ranked N and above in the ranking equals the total number of appearances of attributes ranked below N in the ranking.

(3)Additionally, the fact extraction system can accurately extract facts to identify values of nominal attributes, i.e., attributes that are expressed as nouns.

The patent is:

Extracting facts from documents
Inventors: Steven Euijong Whang, Rahul Gupta, Alon Yitzchak Halevy, and Mohamed Yahya
Assignee: Google Inc.
US Patent: 9,672,251
Granted: June 6, 2017
Filed: September 29, 2014

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for extracting facts from a collection of documents. One of the methods includes obtaining a plurality of seed facts; generating a plurality of patterns from the seed facts, wherein each of the plurality of patterns is a dependency pattern generated from a dependency parse; applying the patterns to documents in a collection of documents to extract a plurality of candidate additional facts from the collection of documents; and selecting one or more additional facts from the plurality of candidate additional facts.

The patent contains a list of “other references” that were cited by the applicants. These are worth spending some time with because they contain a lot of hints about the direction that Google appears to be moving towards.

The patent tells us that entities identified by this extraction process may be stored in an entity database, and they point at the old freebase site (which used to be run by Google).

They give us some insights into how the information extracted from the Web might be used by Google in a fact repository (which is the term they used to refer to an early version of their knowledge graph):

Once extracted, the fact extraction system may store the extracted facts in a facts repository or provide the facts for use for some other purpose. In some cases, the extracted facts may be used by an Internet search engine in providing formatted answers in response to search queries that have been classified as seeking to determine the value of an attribute possessed by a particular entity. For example, a received search query “who is the chief economist of example organization?” may be classified by the search engine as seeking to determine the value of the “Chief Economist” attribute for the entity “Example Organization.” By accessing the fact repository, the search engine may identify that the fact repository includes a (Example Organization, Chief Economist, Example Economist) triple and, in response to the search query, can provide a formatted presentation that identifies “Example Economist” as the “Chief Economist” of the entity “Example Organization.”

The patent tells us about how they use patterns to identify additional facts:

The system selects additional facts from among the candidate additional facts based on the scores (step 212). For example, the system can select each candidate additional fact having a score above a threshold value as an additional fact. As another example, the system can select a predetermined number of highest-scoring candidate additional facts as additional facts. The system can store the selected additional facts in a fact repository, e.g., the fact repository of FIG. 1, or provide the selected additional facts to an external system for use for some immediate purpose.

The patent also describes the process that might be followed to score candidate additional facts.

This fact extraction process does appear to be aimed towards building a repository that might be capable of answering a lot of questions, using a machine learning approach and the kind of semantic vectors that the Google Brain team may have used to develop Google’s Rank Brain approach.

Sharing is caring!

16 thoughts on “Google Patents Extracting Facts from the Web”

  1. You have provided an nice article, Thank you very much for this one. And i hope this will be useful for many people.. and i am waiting for your next post keep on updating these kinds of knowledgeable things…

  2. Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! I cannot even comprehend complexity of machine learning algorithms 😀

  3. Another great write-up Bill. 🙂

    Seems to remind me a bit of the patent “Question answering using entity references in unstructured data” – at least at it’s core. A tough nut to crack to be sure but they do seem to be making some great progress and I can only imagine what we’ll be seeing in a couple more years.

  4. Hi Dave,

    Thank you. I thought it was really important to include links to the references that were cited in the patent because they do point out how this approach has grown and evolved. The open language learning paper and the Word Representations in vector space paper fit in well with an evolved approach. I was also a little surprised that this paper wasn’t cited in this patent either (but it’s worth pointing out):

    Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources
    http://www.vldb.org/pvldb/vol8/p938-dong.pdf

  5. is there any kind of different with ‘using a machine learning approach and the kind of semantic vectors that the Google Brain team’ ?

  6. Hi Omediapc News,

    If you read the patent (I’ve provided a link to it in the post) or the 5 papers that are cited as other references; you will find out more about the machine learning approach and Semantic Vectors. Those are new things in this patent.

  7. I am trying to know about seo, actually i want to know how to rank a website properly, and i have get some idea from here and i have inspired.Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! I cannot even comprehend complexity of machine learning algorithms.

  8. You have provided a nice article, Thank you very much for this one. And I hope this will be useful for many people… Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! Seems to remind me a bit of the patent “Question answering using entity references in unstructured data” – at least at its core. I am trying to know about seo, actually I want to know how to rank a website properly, and I have get some idea from here and I have inspired.

  9. Hi Deanna,

    It has some similarities to the “Question Answering using entity references in unstructured data” patent but it does build upon what is in that patent, and uses machine learning, open language learning, and semantic vector processes to better understand what to extract from content found on the Web. This patent’s focus upon extracting facts is different from a common SEO approach to optimizing pages based upon information retrieval scores and link analysis approaches. I’ve written about those things on this site. SEO is evolving and transforming so that ranking web pages in response to a query and anwering questions in response to a question asked of a search engine are different but related goals of a search engine now.

  10. I am not a techie so could not understand much of the jargon involved, but definitely was curious after reading this, does this mean it treats every word or phrase as an independent entity and then creates and records multiple versions of it, to create a more intelligent extraction algorithm?

  11. It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read! I’ve saved your site and I’m adding your RSS feeds to my website.

  12. Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! I cannot even comprehend complexity of machine learning algorithms

  13. Hi Rummytoday,

    That is exactly why I wrote a post today about Rankbrain and artificial intelligence and machine learning. Hopefully the papers I linked to will give us more ideas about how machine learning works.

  14. I am trying to know about seo, actually i want to know how to rank a website properly, and i have get some idea from here and i have inspired.Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! I cannot even comprehend complexity of machine learning algorithms.

Comments are closed.