When Google indexes the Web, it’s often been convenient to think about the search engine running two different methods or approaches that seem to run in parallel. One of those involves the crawling and indexing and ranking of pages on the web (and images, videos, news, podcasts, and other documents).
The other approach doesn’t look at pages as much as it indexes objects it finds on the Web, or what we often refer to as named entities, which are specific people, places, or things – real or fictional. We see this second kind of crawling often referred to as fact extraction and see the results of such extraction as Knowledge Panel results or even things like Google’s OneBox Question & Answer results.
When SEOs talk about Google and the programs it uses to crawl and index pages on the Web, we usually refer to those crawlers as robots or spiders or even Googlebot, and don’t differentiate these crawling programs much. Not the kind of robot above (which is a new twist from Google), but it’s probably time to start thinking of Googlebot differently.
I’ve written about both types of crawling, and for the second type of crawling and indexing, I’ve been placing those posts in my Fact Extraction and Knowledge Graphs category. (I added the “Knowledge Graphs” part of that a year or two ago, because it seemed to make sense.)
A newly granted patent for Google brings the differences between the two types of crawlers closer together, by having one of the fact extraction crawlers pay more attention to links and anchor text to learn more about entities that might be referred to in those links, including new (synonymous) names for those entities. These fact extraction crawlers have been referred to by Google as “janitors” in the past, and here are some posts I’ve written that talks more about how these janitors work:
- June 29, 2007 – Google Janitors Clean Up Facts on the Web
- August 5, 2007 – Google on the Extraction and Visualization of Facts
- January 11, 2013 – Building Google’s Knowledge Base and Identifying Locations in Web Pages
If you want even more, following the category link above to “Fact Extraction”. A few years ago, Google acquired the patents from a company called MetaWeb. I wrote the post Google Gets Smarter with Named Entities: Acquires MetaWeb. The newly granted patent talks about how it uses a feature of one of MetaWeb’s patents – assigning a unique ID for each named entity, so that it there were multiple names for the same specific entity, they can each be associated with that unique ID.
This patent describes how Google uses janitors to identify new names for an entity, and assigns them a unique ID so that Google understands that the names are synonyms for the same entities. An example of an entity in the patent that has multiple names is “International Business Machines Corporation” otherwise known as “IBM” or “Big Blue”.
The patent is:
Learning synonymous object names from anchor texts
Invented by Krzysztof Czuba, Jonathan T. Betz, Jeffrey C. Reynar
Assigned to Google
US Patent 8,738,643
Granted May 27, 2014
Filed: August 2, 2007
A repository contains objects representing entities. The objects also include facts about the represented entities. The facts are derived from source documents.
A synonymous name of an object is determined by:
- Identifying a source document from which one or more facts of the entity represented by the object were derived,
- Identifying a plurality of linking documents that link to the source document through hyperlinks, each hyperlink having an anchor text,
- Processing the anchor texts in the plurality of linking documents to generate a collection of synonym candidates for the entity represented by the object, and
- Selecting a synonymous name for the entity represented by the object from the collection of synonym candidates.
I’ll be breaking the processes behind the patent down into specifics with my next post, but I’d definitely recommend getting your head around fact extraction, and the idea that Googlebot’s fact extracting cousins are known as “janitors,” and there are multiple kinds of janitors, including some that look at the anchor text in links pointing to pages about entities to find synonyms for entities.