Google Knowledge Graph Reconciliation

Google Knowledge Graph Reconciliation Patent Flowchart

Sharing is caring!

Exploring how the Google Knowledge Graph works can provide some insights into how is growing and improving and may influence what we see on the web. A newly granted Google patent from the end of last month tells us about one way that Google is using to improve the amount of data that the Google Knowledge Graph contains.

The process involved in that patent doesn’t work quite the same way as the patent I wrote about in the post How the Google Knowledge Graph Updates Itself by Answering Questions but taken together, they tell us about how the knowledge graph is growing and improving. But part of the process involves the entity extraction that I wrote about in Entity Extractions for Knowledge Graphs at Google.

This patent tells us that information that may make its way into Google’s knowledge graph isn’t limited to content on the Web, but can also may “originate from another document corpus, such as internal documents not available over the Internet or another private corpus, from a library, from books, from a corpus of scientific data, or from some other large corpus.”

What Google Knowledge Graph Reconciliation is?

The patent tells us about how a knowledge graph is constructed and processes that it follows to update and improve itself.

The site Wordlift includes some defintions related to Entities and the Semantic Web. The Definition that they provide for reconciling entities means “providing computers with unambiguous identifications of the entities we talk about.” This patent from Google focuses upon a broader use of the word “Reconciliation” and how it applies to knowledge graphs, to make sure that those take advantage of all of the information from web sources that may be entered into those about entities.

This process involves finding missing entities and missing facts about entities from a knowledge graph by using web-based sources to add information to a knowledge graph.

Problems with knowledge graphs

Large data graphs like the Google Knowledge Graph store data and rules that describe knowledge about the data in a way that allows the information they provide to be built upon. A patent granted to Google describes how Google may build upon data within a knowledge graph so that it contains more information. The patent doesn’t just cover information from within the knowledge graph itself, but can look to sources such as online news

Tuples as Units of Knowledge Graphs

The patent presents some definitions that are worth learning. One of those is about facts involving entities:

A fact for an entity is an object related to the entity by a predicate. A fact for a particular entity may thus be described or represented as a predicate/object pair.

The relationship between the Entity (a subject) and a fact about the entity (a predicate/object pair) is known as a tuple.

In the Google Knowledge Graph, entities, such as people, places, things, concepts, etc., may be stored as nodes and the edges between those nodes may indicate the relationship between the nodes.

For example, the nodes “Maryland” and “United States” may be linked by the edges of “in country” and/or “has state.”

A basic unit of such a data graph can be a tuple that includes two entities, a subject entity and an object entity, and a relationship between the entities.

Tuples often represent real-world facts, such as “Maryland is a state in the United States.” (A Subject, A Verb, and an Object.)

A tuple may also include information, such as:

  • Context information
  • Statistical information
  • Audit information
  • Metadata about the edges
  • etc.

When a knowledge graph contains information about a tuple, it may also know about the source of that tuple and a score for the originating source of the tuple.

A knowledge graph may lack information about some entities. Those entities may be described in document sources, such as web pages, but manual addition of that entity information can be slow and does not scale.

This is a problem facing knowledge graphs – missing entities and their relationships to other entities can reduce the usefulness of querying the data graph. Knowledge graph reconciliation provides a way to make a knowledge graph richer and stronger.

The patent tells us about inverse tuples, which reverses the subject and object entities.

For example, if the potential tuples include the tuple the system may generate an inverse tuple of .

Sometimes inverse tuples may be generated for some predicates but not for others. For example, tuples with a date or measurement as the object may not be good candidates for inverse occurrences, and may not have many inverse occurrences.

For example, the tuple is not likely to have an inverse occurrence of <2001, is the year of release, Planet of the Apes> in the target data graph.

Clustering of Tuples is also discussed in the patent. We are told that the system may then cluster the potential tuples by:

  • source
  • provenance
  • subject entity type
  • subject entity name

This kind of clustering takes place in order to generate source data graphs.

The process behind the Google Knowledge Graph reconciliation patent:

  1. Potential entities may be identified from facts generated from web-based sources
  2. Facts from those sources are analyzed and cleaned, generating a small source data graph that includes entities and facts from those sources
  3. The source graph may be generated for a potential source entity that does not have a matching entity in the target data graph
  4. The system may repeat the analysis and generation of source data graphs for many source documents, generating many source graphs, each for a particular source document
  5. The system may cluster the source data graphs together by type of source entity and source entity name
  6. The entity name may be a string extracted from the text of the source
  7. Thus, the system generates clusters of source data graphs of the same source entity name and type
  8. The system may split a cluster of source graphs into buckets based on the object entity of one of the relationships, or predicates
  9. The system may use a predicate that is determinative for splitting the cluster
  10. A determinative predicate generally has a unique value, e.g., object entity, for a particular entity
  11. The system may repeat the dividing a predetermined number of times, for example using two or three different determinative predicates, splitting the buckets into smaller buckets. When the iteration is complete, graphs in the same bucket share two or three common facts
  12. The system may discard buckets without sufficient reliability and discard any conflicting facts from graphs in the same bucket
  13. The system may merge the graphs in the remaining buckets, and use the merged graphs to suggest new entities and new facts for the entities for inclusion in a target data graph

How Googlebot may be Crawling Facts to Build the Google Knowledge Graph

This is where some clustering comes into play. Imagine that the web sources are about science fiction movies, and they contain information about movies involving the “Planet of the Apes.” series, which has been remade at least once, and there are a number of related movies in the series, and movies with the same names. The information about those movies may be found from sources on the Web, and clustered together and go through a reconciliation process because of the similarities. Relationships between the many entities involved may be determined and captured. We are told about the following steps:

  1. Each source data graph is associated with a source document, includes a source entity with an entity type that exists in the target data graph, and includes fact tuples
  2. The fact tuples identify a subject entity, a relationship connecting the subject entity to an object entity, and the object entity
  3. The relationship is associated with the entity type of the subject entity in the target data graph
  4. The computer system also includes instructions that, when executed by the at least one processor, cause the computer system to perform operations that include generating a cluster of source data graphs, the cluster including source data graphs associated with a first source entity of a first source entity type that share at least two fact tuples that have the first source entity as the subject entity and a determinative relationship as the relationship connecting the subject entity to the object entity
  5. The operations also include generating a reconciled graph by merging the source data graphs in the cluster when the source data graphs meet a similarity threshold and generating a suggested new entity and entity relationships for the target data graph based on the reconciled graph
  6. More Features to Google Knowledge Graph Reconciliation

    There appear to be 9 movies in the Planet of the Apes Series and the rebooted series. The first “Planet of the Apes” was released in 1968, and the second “Planet of the Apes” was released in 2001. Since they have the same name, things could get confusing if they weren’t separated from each other, and using facts about those movies to break the cluster about “Planet of the Apes” down into buckets based upon facts that tell us that there was an original series, and a rebooted series involving the “Planet of the Apes.”

    Google Knowledge Graph reconciliation planet of the apes

    I’ve provided details of an example that Google pointed out, but here is how they describe this breaking a cluster down into bucked based on facts:

    For example, generating the cluster can include generating a first bucket for source data graphs associated with the first source entities and the first source entity type, splitting the first bucket into second buckets based on a first fact tuple, the first fact tuple having the first source entity as the subject entity and a first determinative relationship, so that source data graphs sharing the first fact tuple are in a same second bucket; and generating final buckets by repeating the splitting a quantity of times, each iteration using another fact tuple for the first source entity that represents a distinct determinative relationship, so that source data graphs sharing the first fact tuple and the other fact tuples are in the same final bucket, wherein the cluster is one of the final buckets.

    So this aspect of knowledge graph reconciliation involves understanding related entities, including some that may share the same name, and removing ambiguity from how they might be presented within a knowledge graph.

    Another aspect of knowledge graph reconciliation may involve merging data, such as seeing when one of the versions of the movie “Planet of the Apes” has more than one actor who is in the movie and merging that information together to make the knowledge graph more complete. The image below from the patent shows how that can be done:

    Knowledge graph reconciliation actors from Planet of the Apes

    The patent also tells us that discarding fact tuples that represent conflicting facts from a particular data source may take place also. Some types of facts about entities have only one answer, such as a birthdate of a person, or the launch date of a movie. If there is more than one of those appearing, they will be checked to see if one of them is wrong, and should be removed It is also possible that this may happen with inverse tuples, which the patent also tells us about.

    Inverse Tuples Generated and Discarded

    Knowledge Graph Reconciliation - Reverse Tuples

    When a tuple is a subject-verb-object, what is known as inverse tuples may be generated? If we have fact tuples such as “Maryland is a state in the United States of America,” and “California is a state in the United States of America,” we may generate inverse tuples such as “The United States of America has a state named Maryland,” and “The United States of America has a state named California.”

    Sometimes tuples may be generated from one source and conflict when they are clustered by topic from another source. An example might be because of the recent trade deadline in Major League Baseball where the right fielder Yasul Puig was traded from the Cincinnati Reds to the Cleveland Indians. The tuple “Yasul Puig plays for the Cincinnati Reds” conflicts with the tuple “The Cleveland Indians have a player named Yasul Puig.” One of those tuples may be discarded during the knowledge graph reconciliation.

    There is a reliability threshold for tuples, and tuples that don’t meet it may be discarded as having insufficient evidence. For instance, a tuple that is only from one source may not be considered reliable and may be discarded. If there are three sources for a tuple that are all from the same domain, that may also be considered insufficient evidence, and that tuple may be discarded.

    Advantages of the Google Knowledge Graph Reconciliation Patent Process

  1. A data graph may be extended more quickly by identifying entities in documents and facts concerning the entities
  2. The entities and facts may be of high quality due to the corroborative nature of the graph reconciliation process
  3. The identified entities may be identified from news sources, to more quickly identify new entities to be added to the data graph
  4. Potential new entities and their facts may be identified from thousands or hundreds of thousands of sources, providing potential entities on a scale that is not possible with manual evaluation of documents
  5. Entities and facts added to the data graph can be used to provide more complete or accurate search results

The Knowledge Graph Reconciliation Patent can be found here:

Automatic discovery of new entities using graph reconciliation
Inventors: Oksana Yakhnenko and Norases Vesdapunt
Assignee: GOOGLE LLC
US Patent: 10,331,706
Granted: June 25, 2019
Filed: October 4, 2017

Abstract

Systems and methods can identify potential entities from facts generated from web-based sources. For example, a method may include generating a source data graph for a potential entity from a text document in which the potential entity is identified. The source data graph represents the potential entity and facts about the potential entity from the text document. The method may also include clustering a plurality of source data graphs, each for a different text document, by entity name and type, wherein at least one cluster includes the potential entity. The method may also include verifying the potential entity using the cluster by corroborating at least a quantity of determinative facts about the potential entity and storing the potential entity and the facts about the potential entity, wherein each stored fact has at least one associated text document.

Takeaways

The patent points out at one place, that human evaluators may review additions to a knowledge graph. It is interesting seeing how it can use sources such as news sources to add new entities and facts about those entities. Being able to use web-based news to add to the knowledge graph means that it isn’t relying upon human-edited sources such as Wikipedia to grow, and the knowledge graph reconciliation process was interesting to learn about as well.

Sharing is caring!

14 thoughts on “Google Knowledge Graph Reconciliation”

  1. I love your stuff. Kinda mind-bending but I like that kinda thing.
    Thank you.
    I tell my clients about the entity thing and their eyes glaze over. You have provided share-worthy material. I can say, “You don’t believe me? Read this.”
    I don’t anticipate that they will be able to follow along. Instead, they may just think I’m a genius.
    Thanks

  2. Hi Jennifer,

    Happy you like this post. It is intriguing the directions that Google is taking with their knowledge path, but it makes sense if they are going to be able to provide answers to the kinds of questions that people ask – the knowledge graph needs to be able to update itself.

  3. Thank you, Bill. Another very interesting read.

    I interpret the knowledge graph as Google´s path from being a search engine to becoming an (all-knowing) answer machine. This even more so, as with this patent it isn’t limited to content on the Web. All questions oriented towards the past (meaning established and well documented facts) could be directly answered based on a broad set of data sources. This, by the way, also puts Google´s efforts to scan entire libraries into a new perspective.

    Projecting this into the future, this might lead to a situation, were displaying facts-based websites on SERPs is no longer needed, as their content is already represented in the knowledge graph. Thus, less and less traffic would be sent to those sources. Only up-to-the-minute and previously unknown information would have the chance to surface through searches, before it also will be incorporated into the graph.

    Brave New World …

  4. Hi Larry,

    That is part of the reason why I like going through newly granted Google patents every week – to get a glimpse behind the curtain as to what Google is doing.

  5. Hi Bill
    I think Google Advanced AI system is mainly focused to help the knowledge graph. As google new ai can easily find any data and combine them. As a result, Google Lens can understand ( if I am not wrong) 1 billion abstract. So I think they are trying to updating their knowledge graph.
    What do you think? Please share your opinion

  6. Hi Shahed,

    Improving the knowledge graph does appear to be a strong goal behind what Google is doing with search. There are ways to use the knowledge graph to return search results or to augment search results if a query has an entity in it that Google may be aware of because it is in Google’s knowledge graph. Improving the knowledge graph is a goal worth pursuing on Google’s part.

  7. Thanks, for the blog…
    I like your knowledge about this topic and most important, way you explain things are really great. Believe me before reading your blog I have no idea about the things related to knowledge graph reconciliation but now reading this I am sure that I can explain someone else too. As I Love your posts which give a better understanding of SEO practices. Keep up the good work..

  8. What a fantastic article.

    I think this is for sure my new favourite SEO blog.

    I’ve set aside my afternoon to read through your content and make notes (and my notebook is already half full).

  9. Hi David,

    You’re welcome. I appreciate you saying that about being able to explain what knowledge graph reconciliation is now. I’m trying to learn by trying to explain as best as I can. When we all learn, the industry gets better as a whole.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.