How Google Might Identify Synonyms for Entities Using Anchor Text

When Google indexes the Web, it’s often been convenient to think about the search engine running two different methods or approaches that seem to run in parallel. One of those involves the crawling and indexing and ranking of pages on the web (and images, videos, news, podcasts, and other documents).

The other approach doesn’t look at pages as much as it indexes objects it finds on the Web, or what we often refer to as named entities, which are specific people, places, or things – real or fictional. We see this second kind of crawling often referred to as fact extraction and see the results of such extraction as Knowledge Panel results or even things like Google’s OneBox Question & Answer results.

Not a web-based robot, but this image is from a Boston Robotics patent.

When SEOs talk about Google and the programs it uses to crawl and index pages on the Web, we usually refer to those crawlers as robots or spiders or even Googlebot, and don’t differentiate these crawling programs much. Not the kind of robot above (which is a new twist from Google), but it’s probably time to start thinking of Googlebot differently.

I’ve written about both types of crawling, and for the second type of crawling and indexing, I’ve been placing those posts in my Fact Extraction and Knowledge Graphs category. (I added the “Knowledge Graphs” part of that a year or two ago, because it seemed to make sense.)

A newly granted patent for Google brings the differences between the two types of crawlers closer together, by having one of the fact extraction crawlers pay more attention to links and anchor text to learn more about entities that might be referred to in those links, including new (synonymous) names for those entities. These fact extraction crawlers have been referred to by Google as “janitors” in the past, and here are some posts I’ve written that talks more about how these janitors work:

If you want even more, following the category link above to “Fact Extraction”. A few years ago, Google acquired the patents from a company called MetaWeb. I wrote the post Google Gets Smarter with Named Entities: Acquires MetaWeb. The newly granted patent talks about how it uses a feature of one of MetaWeb’s patents – assigning a unique ID for each named entity, so that it there were multiple names for the same specific entity, they can each be associated with that unique ID.

This patent describes how Google uses janitors to identify new names for an entity, and assigns them a unique ID so that Google understands that the names are synonyms for the same entities. An example of an entity in the patent that has multiple names is “International Business Machines Corporation” otherwise known as “IBM” or “Big Blue”.

The knowledge panel that Google shows for IBM

The patent is:

Learning synonymous object names from anchor texts
Invented by Krzysztof Czuba, Jonathan T. Betz, Jeffrey C. Reynar
Assigned to Google
US Patent 8,738,643
Granted May 27, 2014
Filed: August 2, 2007

Abstract

A repository contains objects representing entities. The objects also include facts about the represented entities. The facts are derived from source documents.

A synonymous name of an object is determined by:

  • Identifying a source document from which one or more facts of the entity represented by the object were derived,
  • Identifying a plurality of linking documents that link to the source document through hyperlinks, each hyperlink having an anchor text,
  • Processing the anchor texts in the plurality of linking documents to generate a collection of synonym candidates for the entity represented by the object, and
  • Selecting a synonymous name for the entity represented by the object from the collection of synonym candidates.

I’ll be breaking the processes behind the patent down into specifics with my next post, but I’d definitely recommend getting your head around fact extraction, and the idea that Googlebot’s fact extracting cousins are known as “janitors,” and there are multiple kinds of janitors, including some that look at the anchor text in links pointing to pages about entities to find synonyms for entities.

Share

12 thoughts on “How Google Might Identify Synonyms for Entities Using Anchor Text”

  1. I’m trying to get my head around exactly where the knowledge graph janitors would be extracting facts. Are they working with Google’s index? Or out on the Web?

    This might be naive, but if they are on the Web, how do you see them working with nofollow (particularly thinking of the KG favorite – Wikipedia)?

  2. Interesting stuff, but I agree you’d expect Google to do this to make their data work better. I’m expecting Panda 4.0 to make some big changes (other than slamming eBay) in terms of keyword usage.

  3. Hi Nate,

    The different kinds of Janitors would be extracting facts from web sites and web pages, like web indexing crawlers do, though it’s possible that some of the different types of crawlers might process the content that they find from copies of pages on google servers instead of while connected to those sites themselves.

    I don’t know if we can get away with calling them “knowledge graph” janitors, since they work to collect data about named entities not only for the knowledge graph, but also to provide answers and results for Google OneBox type results as well, such as definitions, weather updates, sports scores and schedules, Q&A (question and answer) results, and others.

    The “nofollow” link rel attribute value works to tell search engines whether or not a web page owner wants PageRank and Hypertext relevance to be passed along by a link. If Google wanted that to apply to the use of anchor text as a source of alternative names for entities, it’s something they could likely do, but they don’t necessarily have to honor it. I’m not sure that I would worry too much about it.

  4. Hi Jam-Willem,

    I think you may have shared that link to the Microsoft paper yesterday on Twitter, and I may have responded to your tweet with something about the interestingness of entity linking. Given that I was working on this post, the timing was pretty good. I’d definitely recommend that people read that paper, too

    What I wrote about so far regarding the patent is something that you would expect Google to do to enrich their data, but I shared it because it’s not so obvious that they are doing this entity crawling that people are writing about it – I would suspect that the majority of blogs and article sites and forums that focus upon SEO have little to nothing on them about Google Janitor and how they work. I don’t have to suspect – I just have to do a little searching. It’s not that obvious.

  5. Hi Alex

    I don’t think that you would have to expect that Google would do something like this to make their data better. I suspect that most people wouldn’t, even people who do SEO. :)

  6. Hi JithinC

    Google started having patents granted around that time (2007) that discussed data janitors, so I have more then a couple of back then. :)

  7. Two Important Updates that Rocked the entire SEO World –
    1. Humming Bird [Giving Emphasis to Long Tail Search]
    2. Knowledge Graph [Extracting "Delicate" + Authentic Information]

    Enjoyed reading your post! :-)

  8. Fascinating stuff, Bill – thanks for the great preliminary analysis!

    “I don’t know if we can get away with calling them ‘knowledge graph’ janitors, since they work to collect data about named entities not only for the knowledge graph…”

    Do you think, though, we might be able to get away with calling them “entity janitors”? Whether or not the entity in question is named, I can’t imagine that Google would disambiguate the same entity to a different URIs depending on which location data about that entity was to be presented.

    I don’t think it’ll be too long (if we haven’t arrived at that point already) where it will be impossible to distinguish Knowledge Graph results from any other sort of result that uses entity-based data.

    With Hummingbird Google is certainly now using query entity extraction to fuel multiple types of results. For example, while Indepth Article results are quite different in multiple respects to an accompanying Knowledge Graph vertical, one wouldn’t see Indepth Article results unless Google was able to refashion the text query as an entity reference – almost certainly disambiguated to the same URI used by the Knowledge Graph results.

  9. Brilliant analysis. Especially, “the search engine running two different methods or approaches” -thanks for sharing this valuable stuff. Also, yours 2007 posts are fascinating even to-day.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>