knowledge graph reconciliation

Google Knowledge Graph Reconciliation

Sharing is caring!

Exploring how the Google Knowledge Graph works provides insights into how it is growing and improving and may influence what we see on the web. A newly granted Google patent from last month is about how Google improves the amount of data the Google Knowledge Graph contains.

The process in that patent differs from the patent I wrote about in How the Google Knowledge Graph Updates Itself by Answering Questions. Taken together, they tell us about how the knowledge graph is growing and improving. Part of the process involves entity extraction which I covered in Entity Extractions for Knowledge Graphs at Google.

This new patent tells us that information making its way into the knowledge graph is not limited to content from the Web, but also can “originate from another document corpus, such as internal documents not available over the Internet or another private corpus, from a library, from books, from a corpus of scientific data, or some other large corpus.

What Google Knowledge Graph Reconciliation is?

The patent tells us how a knowledge graph is constructed and how it may update and improve itself.

The site WordLift shows us some definitions related to Entities and the Semantic Web. Their Definition for reconciling entities means “providing computers with unambiguous identifications of the entities we talk about.” Google’s patent covers a broader use of the word “Reconciliation” and how that applies to knowledge graphs, to make sure that a knowledge graph uses all of the information from the web about those entities.

This may involve finding missing entities and missing facts about entities from a knowledge graph by using web-based sources to add information to a knowledge graph.

Problems with Knowledge Graphs

Large data graphs such as the Google Knowledge Graph store data and follow rules that describe knowledge about the data in a way that allows the information they provide to be built upon. A patent granted to Google tells us how Google may build upon that data within a knowledge graph to allow it to contain more information. The patent doesn’t just cover information from within the knowledge graph itself but can look to sources such as online news

Tuples as Units of Knowledge Graphs

The patent presents some definitions that are worth learning. One of those is about facts involving entities:

A fact for an entity is an object related to the entity by a predicate. A fact for a particular entity may thus be described or represented as a predicate/object pair.

The relationship between an Entity (a subject) and a fact about that entity (a predicate/object pair) is known as a tuple.

In the Google Knowledge Graph, entities, such as people, places, things, concepts, etc., are stored as nodes and the connecting edges between those nodes indicate the relationship between the nodes.

For example, the nodes “Maryland” and “United States” are linked with the edges of “in-country” and/or “has state.”

A basic unit of a data graph can be a tuple including two entities, a subject entity and an object entity, and a relationship between the entities.

Tuples often represent real-world facts, such as “Maryland is a state in the United States.” (A Subject, A Verb, and an Object.)

A tuple may also include information, such as:

  • Context information
  • Statistical information
  • Audit information
  • Metadata about the edges
  • etc.

When a knowledge graph knows about a tuple, it may also know about the source of that tuple and include a score for the originating source of the tuple.

A knowledge graph may lack information about some entities. Those entities are from sources such as web pages, but a manual addition of that entity information can be slow and does not scale.

This is a common problem for knowledge graphs – missing entities and missing relationships to other entities reduces the usefulness of querying the data graph. Knowledge graph reconciliation can make a knowledge graph richer and stronger.

The patent also tells us about inverse tuples, which reverse subject and object entities.

For example, if the potential tuples include the tuple the system may generate an inverse tuple of .

Sometimes inverse tuples may be generated for some predicates but not for others. For example, tuples with a date or measurement as the object may not be good candidates for inverse occurrences, and may not have many inverse occurrences.

For example, the tuple is not likely to have an inverse occurrence of <2001, is the year of release, Planet of the Apes> in the target data graph.

Clustering of Tuples is also discussed in the patent. We are told that the system may then cluster the potential tuples by:

  • source
  • provenance
  • subject entity type
  • subject entity name

This kind of clustering takes place to generate source data graphs.

The Process Behind the Google Knowledge Graph Reconciliation Patent:

  1. Potential entities may be identified from facts generated from web-based sources
  2. Those Facts can be analyzed and cleaned, generating a small source data graph that includes entities and facts from those sources
  3. The source graph may be generated for a potential source entity that does not have a matching entity in the target data graph
  4. The system may repeat the analysis and generation of source data graphs for many source documents, generating many source graphs, each for a particular source document
  5. The system may cluster the source data graphs together by type of source entity and source entity name
  6. The entity name may be a string extracted from the text of the source
  7. Thus, the system generates clusters of source data graphs of the same source entity name and type
  8. The system may split a cluster of source graphs into buckets based on the object entity of one of the relationships, or predicates
  9. The system may use a predicate that is determinative for splitting the cluster
  10. A determinative predicate generally has a unique value, e.g., object entity, for a particular entity
  11. The system may repeat the dividing a predetermined number of times, for example using two or three different determinative predicates, splitting the buckets into smaller buckets. When the iteration is complete, graphs in the same bucket share two or three common facts
  12. The system may discard buckets without sufficient reliability and discard any conflicting facts from graphs in the same bucket
  13. The system may merge the graphs in the remaining buckets, and use the merged graphs to suggest new entities and new facts for the entities for inclusion in a target data graph

How Googlebot may be Crawling Facts to Build the Google Knowledge Graph

This is where some clustering comes into play. Imagine that the web sources are about science fiction movies, and they contain information about movies involving the “Planet of the Apes.” series, which has been remade at least once, and there are several related movies in the series and movies with the same names. The information about those movies may be found from sources on the Web, and clustered together and go through a reconciliation process because of the similarities. Relationships between the many entities involved may be determined and captured. We are told about the following steps:

  1. Each source data graph is associated with a source document, includes a source entity with an entity type that exists in the target data graph, and includes fact tuples
  2. The fact tuples identify a subject entity, a relationship connecting the subject entity to an object entity, and the object entity
  3. The relationship is associated with the entity type of the subject entity in the target data graph
  4. The computer system also includes instructions that, when executed by the at least one processor, cause the computer system to perform operations that include generating a cluster of source data graphs, the cluster including source data graphs associated with a first source entity of a first source entity type that share at least two fact tuples that have the first source entity as the subject entity and a determinative relationship as the relationship connecting the subject entity to the object entity
  5. The operations also include generating a reconciled graph by merging the source data graphs in the cluster when the source data graphs meet a similarity threshold and generating a suggested new entity and entity relationships for the target data graph based on the reconciled graph

More Features to Google Knowledge Graph Reconciliation

There appear to be 9 movies in the Planet of the Apes Series and the rebooted series. The first “Planet of the Apes” was released in 1968, and the second “Planet of the Apes” was released in 2001. Since they have the same name, things could get confusing if they weren’t separated from each other, and using facts about those movies to break the cluster about “Planet of the Apes” down into buckets based upon facts that tell us that there was an original series, and a rebooted series involving the “Planet of the Apes.”

Google Knowledge Graph reconciliation planet of the apes

I’ve provided details of an example that Google pointed out, but here is how they describe this breaking a cluster down into buckets based on facts:

For example, generating the cluster can include generating the first bucket for source data graphs associated with the first source entities and the first source entity type, splitting the first bucket into second buckets based on a first fact tuple, the first fact tuple having the first source entity as the subject entity and a first determinative relationship, so that source data graphs sharing the first fact tuple are in a same second bucket; and generating final buckets by repeating the splitting many times, each iteration using another fact tuple for the first source entity that represents a distinct determinative relationship, so that source data graphs sharing the first fact tuple and the other fact tuples are in the same final bucket, wherein the cluster is one of the final buckets.

So this aspect of knowledge graph reconciliation involves understanding related entities, including some that may share the same name, and removing ambiguity from how they might be presented within a knowledge graph.

Another aspect of knowledge graph reconciliation may involve merging data, such as seeing when one of the versions of the movie “Planet of the Apes” has more than one actor who is in the movie and merging that information to make the knowledge graph more complete. The image below from the patent shows how that can be done:

Knowledge graph reconciliation actors from Planet of the Apes

The patent also tells us that discarding fact tuples that represent conflicting facts from a particular data source may take place also. Some types of facts about entities have only one answer, such as a birthdate of a person, or the launch date of a movie. If there is more than one of those appearing, they will be checked to see if one of them is wrong and should be removed It is also possible that this may happen with inverse tuples, which the patent also tells us about.

Inverse Tuples Generated and Discarded

Knowledge Graph Reconciliation - Reverse Tuples

When a tuple is a subject-verb-object, what is known as inverse tuples may be generated? If we have fact tuples such as “Maryland is a state in the United States of America,” and “California is a state in the United States of America,” we may generate inverse tuples such as “The United States of America has a state named Maryland,” and “The United States of America has a state named California.”

Sometimes tuples may be generated from one source and conflict when they are clustered by topic from another source. An example might be because of the recent trade deadline in Major League Baseball where the right fielder Yasul Puig was traded from the Cincinnati Reds to the Cleveland Indians. The tuple “Yasul Puig plays for the Cincinnati Reds” conflicts with the tuple “The Cleveland Indians have a player named Yasul Puig.” One of those tuples may be discarded during the knowledge graph reconciliation.

There is a reliability threshold for tuples, and tuples that don’t meet it may be discarded as having insufficient evidence. For instance, a tuple that is only from one source may not be considered reliable and may be discarded. If there are three sources for a tuple that are all from the same domain, that may also be considered insufficient evidence, and that tuple may be discarded.

Advantages of the Google Knowledge Graph Reconciliation Patent Process

  1. A data graph may be extended more quickly by identifying entities in documents and facts concerning the entities
  2. The entities and facts may be of a high quality due to the corroborative nature of the graph reconciliation process
  3. The identified entities may be identified from news sources, to more quickly identify new entities to be added to the data graph
  4. Potential new entities and their facts may be identified from thousands or hundreds of thousands of sources, providing potential entities on a scale that is not possible with a manual evaluation of documents
  5. Entities and facts added to the data graph can be used to provide more complete or accurate search results

The Knowledge Graph Reconciliation Patent can be found here:

Automatic discovery of new entities using graph reconciliation
Inventors: Oksana Yakhnenko and Norases Vesdapunt
Assignee: GOOGLE LLC
US Patent: 10,331,706
Granted: June 25, 2019
Filed: October 4, 2017

Abstract

Systems and methods can identify potential entities from facts generated from web-based sources. For example, a method may include generating a source data graph for a potential entity from a text document in which the potential entity is identified. The source data graph represents the potential entity and facts about the potential entity from the text document. The method may also include clustering a plurality of source data graphs, each for a different text document, by entity name and type, wherein at least one cluster includes the potential entity. The method may also include verifying the potential entity using the cluster by corroborating at least a quantity of determinative facts about the potential entity and storing the potential entity and the facts about the potential entity, wherein each stored fact has at least one associated text document.

Takeaways

It is interesting seeing how Google can use sources such as news sources to add new entities and facts about those entities.

Being able to use web-based news to add to the knowledge graph means that it isn’t relying upon human-edited sources such as Wikipedia to grow, and the knowledge graph reconciliation process was interesting to learn about as well.

Sharing is caring!

18 thoughts on “Google Knowledge Graph Reconciliation”

  1. I love your stuff. Kinda mind-bending but I like that kinda thing.
    Thank you.
    I tell my clients about the entity thing and their eyes glaze over. You have provided share-worthy material. I can say, “You don’t believe me? Read this.”
    I don’t anticipate that they will be able to follow along. Instead, they may just think I’m a genius.
    Thanks

  2. Hi Jennifer,

    Happy you like this post. It is intriguing the directions that Google is taking with their knowledge path, but it makes sense if they are going to be able to provide answers to the kinds of questions that people ask – the knowledge graph needs to be able to update itself.

  3. Thank you, Bill. Another very interesting read.

    I interpret the knowledge graph as Google´s path from being a search engine to becoming an (all-knowing) answer machine. This even more so, as with this patent it isn’t limited to content on the Web. All questions oriented towards the past (meaning established and well documented facts) could be directly answered based on a broad set of data sources. This, by the way, also puts Google´s efforts to scan entire libraries into a new perspective.

    Projecting this into the future, this might lead to a situation, were displaying facts-based websites on SERPs is no longer needed, as their content is already represented in the knowledge graph. Thus, less and less traffic would be sent to those sources. Only up-to-the-minute and previously unknown information would have the chance to surface through searches, before it also will be incorporated into the graph.

    Brave New World …

  4. Hi Larry,

    That is part of the reason why I like going through newly granted Google patents every week – to get a glimpse behind the curtain as to what Google is doing.

  5. Hi Bill
    I think Google Advanced AI system is mainly focused to help the knowledge graph. As google new ai can easily find any data and combine them. As a result, Google Lens can understand ( if I am not wrong) 1 billion abstract. So I think they are trying to updating their knowledge graph.
    What do you think? Please share your opinion

  6. Hi Shahed,

    Improving the knowledge graph does appear to be a strong goal behind what Google is doing with search. There are ways to use the knowledge graph to return search results or to augment search results if a query has an entity in it that Google may be aware of because it is in Google’s knowledge graph. Improving the knowledge graph is a goal worth pursuing on Google’s part.

  7. Thanks, for the blog…
    I like your knowledge about this topic and most important, way you explain things are really great. Believe me before reading your blog I have no idea about the things related to knowledge graph reconciliation but now reading this I am sure that I can explain someone else too. As I Love your posts which give a better understanding of SEO practices. Keep up the good work..

  8. What a fantastic article.

    I think this is for sure my new favourite SEO blog.

    I’ve set aside my afternoon to read through your content and make notes (and my notebook is already half full).

  9. Hi David,

    You’re welcome. I appreciate you saying that about being able to explain what knowledge graph reconciliation is now. I’m trying to learn by trying to explain as best as I can. When we all learn, the industry gets better as a whole.

  10. This article was really informative Bill. This is one of the most technical article I’ve seen so far regarding ranking signals in Google. Appreciate your hard work compiling all these stuff to provide a solid content for your readers.

  11. Hi Bill,
    This article was an extremely instructive Bill. This is one of the most specialized articles I’ve seen so far with respect to positioning signs in Google. Value your difficult work assembling all this stuff to give a strong substance to your perusers.

  12. Thanks Travis,

    Happy that you liked this post. I thought it was pretty informative on how Google was expending information in their knowledge graph.

Comments are closed.