Exploring how the Google Knowledge Graph works provides insights into how it is growing and improving and may influence what we see on the web. A newly granted Google patent from last month is about how Google improves the amount of data the Google Knowledge Graph contains.
The process in that patent differs from the patent I wrote about in How the Google Knowledge Graph Updates Itself by Answering Questions. Taken together, they tell us about how the knowledge graph is growing and improving. Part of the process involves entity extraction which I covered in Entity Extractions for Knowledge Graphs at Google.
This new patent tells us that information making its way into the knowledge graph is not limited to content from the Web, but also can “originate from another document corpus, such as internal documents not available over the Internet or another private corpus, from a library, from books, from a corpus of scientific data, or some other large corpus.
What Google Knowledge Graph Reconciliation is?
The patent tells us how a knowledge graph is constructed and how it may update and improve itself.
The site WordLift shows us some definitions related to Entities and the Semantic Web. Their Definition for reconciling entities means “providing computers with unambiguous identifications of the entities we talk about.” Google’s patent covers a broader use of the word “Reconciliation” and how that applies to knowledge graphs, to make sure that a knowledge graph uses all of the information from the web about those entities.
This may involve finding missing entities and missing facts about entities from a knowledge graph by using web-based sources to add information to a knowledge graph.
Problems with Knowledge Graphs
Large data graphs such as the Google Knowledge Graph store data and follow rules that describe knowledge about the data in a way that allows the information they provide to be built upon. A patent granted to Google tells us how Google may build upon that data within a knowledge graph to allow it to contain more information. The patent doesn’t just cover information from within the knowledge graph itself but can look to sources such as online news
Tuples as Units of Knowledge Graphs
The patent presents some definitions that are worth learning. One of those is about facts involving entities:
A fact for an entity is an object related to the entity by a predicate. A fact for a particular entity may thus be described or represented as a predicate/object pair.
The relationship between an Entity (a subject) and a fact about that entity (a predicate/object pair) is known as a tuple.
In the Google Knowledge Graph, entities, such as people, places, things, concepts, etc., are stored as nodes and the connecting edges between those nodes indicate the relationship between the nodes.
For example, the nodes “Maryland” and “United States” are linked with the edges of “in-country” and/or “has state.”
A basic unit of a data graph can be a tuple including two entities, a subject entity and an object entity, and a relationship between the entities.
Tuples often represent real-world facts, such as “Maryland is a state in the United States.” (A Subject, A Verb, and an Object.)
A tuple may also include information, such as:
- Context information
- Statistical information
- Audit information
- Metadata about the edges
When a knowledge graph knows about a tuple, it may also know about the source of that tuple and include a score for the originating source of the tuple.
A knowledge graph may lack information about some entities. Those entities are from sources such as web pages, but a manual addition of that entity information can be slow and does not scale.
This is a common problem for knowledge graphs – missing entities and missing relationships to other entities reduces the usefulness of querying the data graph. Knowledge graph reconciliation can make a knowledge graph richer and stronger.
The patent also tells us about inverse tuples, which reverse subject and object entities.
For example, if the potential tuples include the tuple
the system may generate an inverse tuple of .
Sometimes inverse tuples may be generated for some predicates but not for others. For example, tuples with a date or measurement as the object may not be good candidates for inverse occurrences, and may not have many inverse occurrences.
For example, the tuple
is not likely to have an inverse occurrence of <2001, is the year of release, Planet of the Apes> in the target data graph.
Clustering of Tuples is also discussed in the patent. We are told that the system may then cluster the potential tuples by:
- subject entity type
- subject entity name
This kind of clustering takes place to generate source data graphs.
The Process Behind the Google Knowledge Graph Reconciliation Patent:
- Potential entities may be identified from facts generated from web-based sources
- Those Facts can be analyzed and cleaned, generating a small source data graph that includes entities and facts from those sources
- The source graph may be generated for a potential source entity that does not have a matching entity in the target data graph
- The system may repeat the analysis and generation of source data graphs for many source documents, generating many source graphs, each for a particular source document
- The system may cluster the source data graphs together by type of source entity and source entity name
- The entity name may be a string extracted from the text of the source
- Thus, the system generates clusters of source data graphs of the same source entity name and type
- The system may split a cluster of source graphs into buckets based on the object entity of one of the relationships, or predicates
- The system may use a predicate that is determinative for splitting the cluster
- A determinative predicate generally has a unique value, e.g., object entity, for a particular entity
- The system may repeat the dividing a predetermined number of times, for example using two or three different determinative predicates, splitting the buckets into smaller buckets. When the iteration is complete, graphs in the same bucket share two or three common facts
- The system may discard buckets without sufficient reliability and discard any conflicting facts from graphs in the same bucket
- The system may merge the graphs in the remaining buckets, and use the merged graphs to suggest new entities and new facts for the entities for inclusion in a target data graph
How Googlebot may be Crawling Facts to Build the Google Knowledge Graph
This is where some clustering comes into play. Imagine that the web sources are about science fiction movies, and they contain information about movies involving the “Planet of the Apes.” series, which has been remade at least once, and there are several related movies in the series and movies with the same names. The information about those movies may be found from sources on the Web, and clustered together and go through a reconciliation process because of the similarities. Relationships between the many entities involved may be determined and captured. We are told about the following steps:
- Each source data graph is associated with a source document, includes a source entity with an entity type that exists in the target data graph, and includes fact tuples
- The fact tuples identify a subject entity, a relationship connecting the subject entity to an object entity, and the object entity
- The relationship is associated with the entity type of the subject entity in the target data graph
- The computer system also includes instructions that, when executed by the at least one processor, cause the computer system to perform operations that include generating a cluster of source data graphs, the cluster including source data graphs associated with a first source entity of a first source entity type that share at least two fact tuples that have the first source entity as the subject entity and a determinative relationship as the relationship connecting the subject entity to the object entity
- The operations also include generating a reconciled graph by merging the source data graphs in the cluster when the source data graphs meet a similarity threshold and generating a suggested new entity and entity relationships for the target data graph based on the reconciled graph
More Features to Google Knowledge Graph Reconciliation
There appear to be 9 movies in the Planet of the Apes Series and the rebooted series. The first “Planet of the Apes” was released in 1968, and the second “Planet of the Apes” was released in 2001. Since they have the same name, things could get confusing if they weren’t separated from each other, and using facts about those movies to break the cluster about “Planet of the Apes” down into buckets based upon facts that tell us that there was an original series, and a rebooted series involving the “Planet of the Apes.”
I’ve provided details of an example that Google pointed out, but here is how they describe this breaking a cluster down into buckets based on facts:
For example, generating the cluster can include generating the first bucket for source data graphs associated with the first source entities and the first source entity type, splitting the first bucket into second buckets based on a first fact tuple, the first fact tuple having the first source entity as the subject entity and a first determinative relationship, so that source data graphs sharing the first fact tuple are in a same second bucket; and generating final buckets by repeating the splitting many times, each iteration using another fact tuple for the first source entity that represents a distinct determinative relationship, so that source data graphs sharing the first fact tuple and the other fact tuples are in the same final bucket, wherein the cluster is one of the final buckets.
So this aspect of knowledge graph reconciliation involves understanding related entities, including some that may share the same name, and removing ambiguity from how they might be presented within a knowledge graph.
Another aspect of knowledge graph reconciliation may involve merging data, such as seeing when one of the versions of the movie “Planet of the Apes” has more than one actor who is in the movie and merging that information to make the knowledge graph more complete. The image below from the patent shows how that can be done:
The patent also tells us that discarding fact tuples that represent conflicting facts from a particular data source may take place also. Some types of facts about entities have only one answer, such as a birthdate of a person, or the launch date of a movie. If there is more than one of those appearing, they will be checked to see if one of them is wrong and should be removed It is also possible that this may happen with inverse tuples, which the patent also tells us about.
Inverse Tuples Generated and Discarded
When a tuple is a subject-verb-object, what is known as inverse tuples may be generated? If we have fact tuples such as “Maryland is a state in the United States of America,” and “California is a state in the United States of America,” we may generate inverse tuples such as “The United States of America has a state named Maryland,” and “The United States of America has a state named California.”
Sometimes tuples may be generated from one source and conflict when they are clustered by topic from another source. An example might be because of the recent trade deadline in Major League Baseball where the right fielder Yasul Puig was traded from the Cincinnati Reds to the Cleveland Indians. The tuple “Yasul Puig plays for the Cincinnati Reds” conflicts with the tuple “The Cleveland Indians have a player named Yasul Puig.” One of those tuples may be discarded during the knowledge graph reconciliation.
There is a reliability threshold for tuples, and tuples that don’t meet it may be discarded as having insufficient evidence. For instance, a tuple that is only from one source may not be considered reliable and may be discarded. If there are three sources for a tuple that are all from the same domain, that may also be considered insufficient evidence, and that tuple may be discarded.
Advantages of the Google Knowledge Graph Reconciliation Patent Process
- A data graph may be extended more quickly by identifying entities in documents and facts concerning the entities
- The entities and facts may be of a high quality due to the corroborative nature of the graph reconciliation process
- The identified entities may be identified from news sources, to more quickly identify new entities to be added to the data graph
- Potential new entities and their facts may be identified from thousands or hundreds of thousands of sources, providing potential entities on a scale that is not possible with a manual evaluation of documents
- Entities and facts added to the data graph can be used to provide more complete or accurate search results
The Knowledge Graph Reconciliation Patent can be found here:
Automatic discovery of new entities using graph reconciliation
Inventors: Oksana Yakhnenko and Norases Vesdapunt
Assignee: GOOGLE LLC
US Patent: 10,331,706
Granted: June 25, 2019
Filed: October 4, 2017
Systems and methods can identify potential entities from facts generated from web-based sources. For example, a method may include generating a source data graph for a potential entity from a text document in which the potential entity is identified. The source data graph represents the potential entity and facts about the potential entity from the text document. The method may also include clustering a plurality of source data graphs, each for a different text document, by entity name and type, wherein at least one cluster includes the potential entity. The method may also include verifying the potential entity using the cluster by corroborating at least a quantity of determinative facts about the potential entity and storing the potential entity and the facts about the potential entity, wherein each stored fact has at least one associated text document.
It is interesting seeing how Google can use sources such as news sources to add new entities and facts about those entities.
Being able to use web-based news to add to the knowledge graph means that it isn’t relying upon human-edited sources such as Wikipedia to grow, and the knowledge graph reconciliation process was interesting to learn about as well.