Google Extracts Facts from the Web to Provide Fact Answers

Sharing is caring!

When Google Crawls the Web, It Extracts Facts From Content on Pages It Finds

When Google crawls the Web, it extracts content on the pages it finds and links on pages. How much information does it extract about facts on the Web? In Providing fact answers? Microsoft showed off an object-based search about 10 years ago, in the paper, Object-Level Ranking: Bringing Order to Web Objects..

The team from Microsoft Research Asia tells us in that paper:

Existing Web search engines generally treat a whole Web page as the unit for retrieval and consuming. However, there are various kinds of objects embedded in static Web pages or Web databases. Typical objects are products, people, papers, organizations, etc. We can imagine that if these objects can get extracted and integrated from the Web, powerful object-level search engines can meet users’ information needs more precisely, especially for some specific domains.

Extracting Factual Information About Entities on the Web to Provide Fact Answers

This patent from Google focuses upon extracting factual information about entities on the Web to provide fact answers. It’s an approach that goes beyond making the Web index that we know Google for because it collects more information related to each other. The patent tells us:

Information extraction systems automatically extract structured information from unstructured or semi-structured documents. For example, some information extraction systems extract facts from collections of electronic documents. Each fact identifies a subject entity, an attribute possessed by the entity, and the attribute’s value for the entity.

Google’s First Semantic Search Invention From 1999

I’m reminded of an early Google Provisional patent that Sergey Brin came up with within the 1990s. My post about that patent I called, Google’s First Semantic Search Invention was Patented in 1999. The patent it is about was titled Extracting Patterns and Relations from Scattered Databases Such as the World Wide Web (pdf) (Skip ahead to the third page, where it becomes much more readable).

This was published as a paper on the Stanford website. It describes Sergey Brin taking some facts about some books and searching for those books on the Web; once they are found, patterns about the locations of those books are gathered, and information about other books is collected. That approach sounds much like the one from this patent granted the first week of this month:

In general, one innovative aspect of the subject matter described in this specification can get embodied in methods that include the actions of obtaining a plurality of seed facts, wherein each seed fact identifies a subject entity, an attribute possessed by the subject entity, and an object, and wherein the object is an attribute value of the attribute possessed by the subject entity; generating a plurality of patterns from the seed facts, wherein each of the plurality of patterns is a dependency pattern generated from a dependency parse, wherein a dependency parse of a text portion corresponds to a directed graph of vertices and edges, wherein each vertex represents a token in the text portion and each edge represents a syntactic relationship between tokens represented by vertices connected by the edge, wherein each vertex is associated with the token represented by the vertex and a part of speech tag, and wherein a dependency pattern corresponds to a sub-graph of a dependency parse with one or more of the vertices in the sub-graph having a token associated with the vertex replaced by a variable; applying the patterns to documents in a collection of documents to extract a plurality of candidate additional facts from the collection of documents; and selecting one or more additional facts from the plurality of candidate additional facts.

Advantages Of The Fact Extraction Patent

The patent breaks the process it describes into several “Advantages” that are worth keeping in mind because it sounds like how people talking about the Semantic Web describe the Web as a web of data. These are the Advantages that the patent brings us about providing fact answers:

(1) A fact extraction system can accurately extract facts, i.e., (subject, attribute, object) triples, from a collection of electronic documents to identify values of attributes, i.e., “objects” in the extracted triples, that are not known to the fact extraction system.

(2) In particular, values of long-tail attributes that infrequently appear in the collection of electronic documents relative to other, more frequently occurring attributes can accurately get extracted from the collection. For example, given a set of attributes for which values are extracted from the collection, the attributes in the set can get ordered by the number of occurrences of each of the attributes in the collection. The fact extraction system can accurately extract attribute values for the long-tail attributes in the set, with the long-tail attributes being the attributes that are ranked below N in the order, where N is chosen such that the total number of appearances of attributes ranked N and above in the ranking equals the total number of appearances of attributes ranked below N.

(3) Additionally, the fact extraction system can accurately extract facts to identify values of nominal attributes, i.e., attributes that are expressed as nouns.

Selecting Facts to Provide Fact Answers

The patent is:

Extracting facts from documents
Inventors: Steven Euijong Whang, Rahul Gupta, Alon Yitzchak Halevy, and Mohamed Yahya
Assignee: Google Inc.
US Patent: 9,672,251
Granted: June 6, 2017
Filed: September 29, 2014

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, extract facts from a collection of documents. One of the methods includes obtaining a plurality of seed facts; generating a plurality of patterns from the seed facts, wherein each of the plurality of patterns is a dependency pattern generated from a dependency parse; applying the patterns to documents in a collection of documents to extract a plurality of candidate additional facts from the collection of documents, and selecting one or more additional facts from the plurality of candidate additional facts.

Other Citations Listed in This Patent About Extracting Facts

The patent contains a list of “other references” that the applicants cited. These are worth spending some time with because they contain many hints about the direction that Google appears to extract facts from the web to provide factual answers for questions.

The patent tells us that entities identified by this extraction process may get stored in an entity database. They point at the old freebase site (which was run by Google).

Information Extracted From The Web

They give us some insights into how Google might use the information extracted from the Web in a fact repository. This is the term they used to refer to an early version of their knowledge graph:

Once extracted, the fact extraction system may store the extracted facts in a facts repository or provide the facts for use for some other purpose. In some cases, the extracted facts may get used by an Internet search engine may use the extracted facts to provide formatted answers in response to search queries that have been classified as seeking to determine the value of an attribute possessed by a particular entity. For example, a received search query “who is the chief economist of example organization?” may be classified by the search engine as seeking to determine the value of the “Chief Economist” attribute for the entity “Example Organization.” By accessing the fact repository, the search engine may identify that the fact repository includes an (Example Organization, Chief Economist, Example Economist) triple and, in response to the search query, can provide a formatted presentation that identifies “Example Economist” as the “Chief Economist” of the entity “Example Organization.”

The patent tells us about how they use patterns to identify additional facts when providing fact answers:

Facts From Among Candidate Additional Facts

The system selects additional facts from the candidate additional facts based on the scores (step 212). For example, the system can select each candidate’s additional fact having a score above a threshold value as an additional fact. As another example, the system can select a predetermined number of highest-scoring candidate additional facts. The system can store the selected additional facts in a fact repository, e.g., the fact repository of FIG. 1, or provide the selected additional facts to an external system for use for some immediate purpose.

The patent also describes the process that might be followed to score candidates additional facts to show fact answers about.

This fact extraction process does appear to build a repository capable of answering many questions. It uses a machine learning approach and the kind of semantic vectors that the Google Brain team may have used to develop Google’s Rank Brain approach.

Some posts I’ve written about patents involving question answering:

Last Updated July 11, 2019.

Sharing is caring!

16 thoughts on “Google Extracts Facts from the Web to Provide Fact Answers”

  1. You have provided an nice article, Thank you very much for this one. And i hope this will be useful for many people.. and i am waiting for your next post keep on updating these kinds of knowledgeable things…

  2. Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! I cannot even comprehend complexity of machine learning algorithms 😀

  3. Another great write-up Bill. 🙂

    Seems to remind me a bit of the patent “Question answering using entity references in unstructured data” – at least at it’s core. A tough nut to crack to be sure but they do seem to be making some great progress and I can only imagine what we’ll be seeing in a couple more years.

  4. Hi Dave,

    Thank you. I thought it was really important to include links to the references that were cited in the patent because they do point out how this approach has grown and evolved. The open language learning paper and the Word Representations in vector space paper fit in well with an evolved approach. I was also a little surprised that this paper wasn’t cited in this patent either (but it’s worth pointing out):

    Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources
    http://www.vldb.org/pvldb/vol8/p938-dong.pdf

  5. is there any kind of different with ‘using a machine learning approach and the kind of semantic vectors that the Google Brain team’ ?

  6. Hi Omediapc News,

    If you read the patent (I’ve provided a link to it in the post) or the 5 papers that are cited as other references; you will find out more about the machine learning approach and Semantic Vectors. Those are new things in this patent.

  7. I am trying to know about seo, actually i want to know how to rank a website properly, and i have get some idea from here and i have inspired.Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! I cannot even comprehend complexity of machine learning algorithms.

  8. You have provided a nice article, Thank you very much for this one. And I hope this will be useful for many people… Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! Seems to remind me a bit of the patent “Question answering using entity references in unstructured data” – at least at its core. I am trying to know about seo, actually I want to know how to rank a website properly, and I have get some idea from here and I have inspired.

  9. Hi Deanna,

    It has some similarities to the “Question Answering using entity references in unstructured data” patent but it does build upon what is in that patent, and uses machine learning, open language learning, and semantic vector processes to better understand what to extract from content found on the Web. This patent’s focus upon extracting facts is different from a common SEO approach to optimizing pages based upon information retrieval scores and link analysis approaches. I’ve written about those things on this site. SEO is evolving and transforming so that ranking web pages in response to a query and anwering questions in response to a question asked of a search engine are different but related goals of a search engine now.

  10. I am not a techie so could not understand much of the jargon involved, but definitely was curious after reading this, does this mean it treats every word or phrase as an independent entity and then creates and records multiple versions of it, to create a more intelligent extraction algorithm?

  11. It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read! I’ve saved your site and I’m adding your RSS feeds to my website.

  12. Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! I cannot even comprehend complexity of machine learning algorithms

  13. Hi Rummytoday,

    That is exactly why I wrote a post today about Rankbrain and artificial intelligence and machine learning. Hopefully the papers I linked to will give us more ideas about how machine learning works.

  14. I am trying to know about seo, actually i want to know how to rank a website properly, and i have get some idea from here and i have inspired.Machine learning is totally a game changer lately. It’s still developing and so far the results are awesome! I cannot even comprehend complexity of machine learning algorithms.

Comments are closed.