When Google crawls the Web, it extracts facts from content on the pages it finds as well as links on pages. How much information does it extract about facts on the Web? Microsoft showed off an object-based search about 10 years ago, in the paper, Object-Level Ranking: Bringing Order to Web Objects..
The team from Microsoft Research Asia tells us in that paper:
Existing Web search engines generally treat a whole Web page as the unit for retrieval and consuming. However, there are various kinds of objects embedded in the static Web pages or Web databases. Typical objects are products, people, papers, organizations, etc. We can imagine that if these objects can be extracted and integrated from the Web, powerful object-level search engines can be built to meet users’ information needs more precisely, especially for some specific domains.
This patent from Google focuses upon extracting factual information about entities on the Web. It’s an approach that goes beyond making the Web index that we know Google for because it collects more information that is related to each other. The patent tells us:
Information extraction systems automatically extract structured information from unstructured or semi-structured documents. For example, some information extraction systems that exist extract facts from collections of electronic documents, with each fact identifying a subject entity, an attribute possessed by the entity, and the value of the attribute for the entity.
I’m reminded of an early Google Provisional patent that Sergy Brin came up with in the 1990s. My post about that patent I called, Google’s First Semantic Search Invention was Patented in 1999. The patent it is about was titled Extracting Patterns and Relations from Scattered Databases Such as the World Wide Web (pdf) (Skip ahead to the third page, where it becomes much more readable). This was published as a paper on the Stanford website. It describes Sergy Brin taking some facts about some books, and searching for those books on the Web; once they are found; patterns about the locations of those books are gathered, and information about other books are collected as well. That approach sounds much like the one from this patent granted the first week of this month:
In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of obtaining a plurality of seed facts, wherein each seed fact identifies a subject entity, an attribute possessed by the subject entity, and an object, and wherein the object is an attribute value of the attribute possessed by the subject entity; generating a plurality of patterns from the seed facts, wherein each of the plurality of patterns is a dependency pattern generated from a dependency parse, wherein a dependency parse of a text portion corresponds to a directed graph of vertices and edges, wherein each vertex represents a token in the text portion and each edge represents a syntactic relationship between tokens represented by vertices connected by the edge, wherein each vertex is associated with the token represented by the vertex and a part of speech tag, and wherein a dependency pattern corresponds to a sub-graph of a dependency parse with one or more of the vertices in the sub-graph having a token associated with the vertex replaced by a variable; applying the patterns to documents in a collection of documents to extract a plurality of candidate additional facts from the collection of documents; and selecting one or more additional facts from the plurality of candidate additional facts.
The patent breaks the process it describes into a number of “Advantages” that are worth keeping in mind, because it sounds a lot like how people talking about the Semantic Web describe the Web as a web of data. These are the Advantages that the patent brings us:
(1) A fact extraction system can accurately extract facts, i.e., (subject, attribute, object) triples, from a collection of electronic documents to identify values of attributes, i.e., “objects” in the extracted triples, that are not known to the fact extraction system.
(2)In particular, values of long-tail attributes that appear infrequently in the collection of electronic documents relative to other, more frequently occurring attributes can be accurately extracted from the collection. For example, given a set of attributes for which values are to be extracted from the collection, the attributes in the set can be ordered by the number of occurrences of each of the attributes in the collection and the fact extraction system can accurately extract attribute values for the long-tail attributes in the set, with the long-tail attributes being the attributes that are ranked below N in the order, where N is chosen such that the total number of appearances of attributes ranked N and above in the ranking equals the total number of appearances of attributes ranked below N in the ranking.
(3)Additionally, the fact extraction system can accurately extract facts to identify values of nominal attributes, i.e., attributes that are expressed as nouns.
The patent is:
Extracting facts from documents
Inventors: Steven Euijong Whang, Rahul Gupta, Alon Yitzchak Halevy, and Mohamed Yahya
Assignee: Google Inc.
US Patent: 9,672,251
Granted: June 6, 2017
Filed: September 29, 2014
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for extracting facts from a collection of documents. One of the methods includes obtaining a plurality of seed facts; generating a plurality of patterns from the seed facts, wherein each of the plurality of patterns is a dependency pattern generated from a dependency parse; applying the patterns to documents in a collection of documents to extract a plurality of candidate additional facts from the collection of documents; and selecting one or more additional facts from the plurality of candidate additional facts.
The patent contains a list of “other references” that were cited by the applicants. These are worth spending some time with because they contain a lot of hints about the direction that Google appears to be moving towards.
- Finkel et al., Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling In Proceedings of the 43rd Annual Meeting of the ACL, Ann Arbor, Michigan, USA, Jun. 2005, pp. 363-370. cited by applicant .
- Gupta et al, Biperpedia: An Ontology for Search Applications In Proceedings of the VLDB Endowment, 2014, pp. 505-516. cited by applicant .
- Haghighi and Klein, Simple Coreference Resolution with Rich Syntactic and Semantic Features In Proceedings of Empirical Methods in Natural Language Processing, Singapore, Aug. 6-7, 2009, pp. 1152-1161. cited by applicant .
- Madnani and Dorr, Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods In Computational Linguistics, 2010, 36(3):341-387. cited by applicant .
- de Marneffe et al., Generating Typed Dependency Parses from Phrase Structure Parses In Proceedings of Language Resources and Evaluation, 2006, pp. 449-454. cited by applicant .
- Mausam et al., Open Language Learning for Information Extraction In Proceedings of Empirical Methods in Natural Language Processing, 2012, 12 pages. cited by applicant .
- Mikolov et al., Efficient Estimation of Word Representations in Vector Space International Conference on Learning Representations (ICLR), Scottsdale, Arizona, USA, 2013, 12 pages. cited by applicant .
- Mintz et al, Distant Supervision for Relation Extraction Without Labeled Data In Proceedings of the Association for Computational Linguistics, 2009, 9 pages. cited by applicant.
The patent tells us that entities identified by this extraction process may be stored in an entity database, and they point at the old freebase site (which used to be run by Google).
They give us some insights into how the information extracted from the Web might be used by Google in a fact repository (which is the term they used to refer to an early version of their knowledge graph):
Once extracted, the fact extraction system may store the extracted facts in a facts repository or provide the facts for use for some other purpose. In some cases, the extracted facts may be used by an Internet search engine in providing formatted answers in response to search queries that have been classified as seeking to determine the value of an attribute possessed by a particular entity. For example, a received search query “who is the chief economist of example organization?” may be classified by the search engine as seeking to determine the value of the “Chief Economist” attribute for the entity “Example Organization.” By accessing the fact repository, the search engine may identify that the fact repository includes a (Example Organization, Chief Economist, Example Economist) triple and, in response to the search query, can provide a formatted presentation that identifies “Example Economist” as the “Chief Economist” of the entity “Example Organization.”
The patent tells us about how they use patterns to identify additional facts:
The patent also describes the process that might be followed to score candidate additional facts.
This fact extraction process does appear to be aimed towards building a repository that might be capable of answering a lot of questions, using a machine learning approach and the kind of semantic vectors that the Google Brain team may have used to develop Google’s Rank Brain approach.