Google on Crawling the Web of Data

A patent granted to Google this past fall explores how the search engine looks for patterns on Web pages to use to find facts on the Web to fill up Google’s data repository (Knowledge Base).

An image from a local park in Carlsbad symbolizing the Sun.
An image from a local park in Carlsbad symbolizing the Sun.

I recently wrote a series of posts about Google collecting data to enable them to answer Direct answers. starting with one titled Direct Answers – Natural Language Search Results for Intent Queries.

In one of those posts, I write about a paper (pdf) that the inventors of that patent co-authored which describes ways that Google was finding and extracting facts from pages to include in a repository of facts.

That post is Direct Answers: How Answers are Extracted from Web Pages. Since then, I came across a patent granted to Google which describes this pattern-matching and extraction process.

In Google’s Financial Statement 10-K Filing for 2014 Google explained why they were adding more direct answers to their search results, telling us:

Imagining the ways things could be — without constraint — is the process we use to look for better answers to some of life’s everyday problems. It’s about starting with the “What if?” and then working relentlessly to see if we can find the answer.

It’s been that way from the beginning; providing ways to access knowledge and information has been core to Google and our products have come a long way in the last decade. We used to show just ten blue links in our results. You had to click through to different websites to get your answers, which took time. Now we are increasingly able to provide direct answers — even if you’re speaking your question using Voice Search — which makes it quicker, easier and more natural to find what you’re looking for.

The patent shows one path to collecting a fact database of the kind that can be used to provide direct answers to pages, and focuses upon extracting facts from documents.That patent is:

Unsupervised extraction of facts
Invented by Jonathan T. Betz and Shubin Zhao
Assigned to Google
United States Patent 8,825,471
Granted September 2, 2014
Filed: March 31, 2006

Abstract

A system and method for extracting facts from documents. A fact is extracted from a first document. The attribute and value of the fact extracted from the first document are used as a seed attribute-value pair. A second document containing the seed attribute-value pair is analyzed to determine a contextual pattern used in the second document. The contextual pattern is used to extract other attribute-value pairs from the second document. The extracted attributes and values are stored as facts.

The idea behind looking for patterns to find facts at Google dates back to what may be their second patent – one filed by Sergey Brin in 1999 which involved looking for pages on the web that contained infomation about specific books and the patterns that information might be contained within – being able to find similar patterns on other pages on the Web allowed for the collection of a fact repository that contains many pages on the web where that data is collected.

Being able to find lots of pages on the Web where that data is the same on those pages allows the facts about that data to be corroborated.

Here’s an example of a pattern that this patent is looking for, a colon-delimited pair of an attribute name followed by a value:

According to one embodiment of the present invention, the importer identifies a predefined pattern in the document and applies the predefined pattern to extract attribute-value pairs. The extracted attribute-value pairs are then stored as facts. A predefined pattern defines specific, predetermined sections of the document which are expected to contain attributes and values.

when items on a page are separated by colons like this, they are often related.
when items on a page are separated by colons like this, they are often related.

For example, in an HTML document, the presence of a text block such as “<BR>*:*<BR>” (where `*` can be any string) may indicate that the document contains an attribute-value pair organized according to the pattern “<BR>(attribute text):(value text)<BR>”.

Such a pattern is predefined in the sense that it is one of a known list of patterns to be identified and applied for extraction in documents. Of course, not every predefined pattern will necessarily be found in every document;

identifying the patterns contained in a document determines which (if any) of the predefined patterns may be used for extraction on that document with a reasonable expectation of producing valid attribute-value pairs. The extracted attribute-value pairs are stored in the facts.

Patterns identified on pages like this might be collected as Google crawls web pages and collects data about facts. The programs that check upon and extract facts from pages are known as Data Janitors at Google.

In addition to colon-delimited lists, Google might also look for attribute value pairs set up in 2 column tables, like infoboxes you might see at Wikipedia. They tell us in the patent:

FIG. 4 is an example of a document containing attribute-value pairs organized according to a predefined pattern. According to one embodiment of the present invention, document may be analogous to the document described herein with reference to FIG. 3.

Document includes information about Britney Spears organized according to a two column table. According to one embodiment of the present invention, the two column table is a predefined pattern recognizable by the unsupervised fact extractor.

The pattern specifies that attributes will be in the left column and that corresponding values will be in the right column. Thus the unsupervised fact extractor may extract from the document the following attribute-value pairs using the predefined pattern: (name; Britney Spears), (profession; actress, singer), (date of birth; Dec. 2, 1981), (place of birth; Kentwood, La.), (sign; Sagittarius), (eye color; brown), and (hair color; brown). These attribute-value pairs can then be stored as facts, associated with an object, used as seed attribute-value pairs, and so on.

This Wikipedia infobox for San Diego shows the kind of pattern being talked about above as part of a two column table with attributes and values that could be pulled into a fact repository:

Attributes and values about San Diego can be extracted from an infobox like this one.
Attributes and values about San Diego can be extracted from an infobox like this one.

Take-Aways

This type of pattern-matching and extraction of facts is part of how Google uses the Web as a database of information. By extracting facts and storing them in a data repository, like Google’s knowledge graph, it makes those facts available as direct answers.

population of san diego-city

Summary
Article Name
Google on Crawling the Web of Data
Description
A Google patent that shows how the search engine may look for patterns that indicate the existence of attibute and values for objects it finds on the Web to build a fact repository.
Author

31 thoughts on “Google on Crawling the Web of Data”

  1. Great effort by Google, it will make our lives easier. But how Google will verify the correctness of the facts , what if wrong information is given in more than 1 websites?

  2. Hi Cathy,

    Google will look for the same attributes, values pairs on other websites, and attempt to corroborate those facts by finding them in multiple places. When they are the same in more than one place, that tends to increase confidence in the correctness of those facts.

  3. Reminds me of, a few months ago, Google saying they were going farther into a privately-compiled Knowledge Vault model because they couldn’t fully trust user generated annotation/input to provide good information. At that time, it was pointed out that they would pay lots of attention to extracting information from a page’s DOM.

  4. Hi Clay,

    There was some mention a few months back about a knowledge Vault project from Google that pointed out a number of approaches that potentially might be used to improve the quality of facts coming out of Google. An article about it published at Search Engine Land contained a followup contact from Google which pointed out that the knowledge Vault project was one of many projects goiing on at Google and that it shouldn’t be taken as a replacement for Google’s knowledge graph. I did some research on some of the approches described under the knowldge Vault,(http://gofishdigital.com/good-bye-knowledge-graph-hello-google-knowledge-vault/) and what you mention about paying more attention to a Pasge’s DOM when extracting information from a page would make it more likely that information like user generated content on a page would be less likely to be extracted as data.

  5. Agreed this release is related to Knowledge Vault which is a complement to, not a replacement of, KG sourced content. Found their paper (http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf) that says ” To create a knowledge base of such size, we extract facts from a large variety of sources of Web data, including free text, HTML DOM trees, HTML Web tables, and human annotations of Web pages. ” In my opinion, this is also related to why Google recently made an announcement about allowing CSS and JS to be crawled – because JS is rendered as part of the DOM (and, therefore, grist for the mill of Knowledge Vault). I did a few posts about those things around the time they were announced http://www.pmdigital.com/blog/2014/09/authorship-dead-long-live-knowledge-vault/ & http://www.pmdigital.com/blog/2014/11/google-want-crawl-css-js/

  6. Finding a pattern is different than finding a fact and having a model of the world we live in, or the way humans use knowledge to understand data coming their way. I did not see mention of organized creation and use of knowledge base by Google until Amit Singhal et al’s mention of “things not string” in 2013 (what we in research used to talk about as keyword -> entity -> relationship -> event progression in 1990s). They used Feebase to bootstrap their knowledge graph. [As I see it, 1999 Brin patent talks about stuff at IR/keywords and statistical support level, and does not attempt to understand real-world entities/things, as Google now (since 2013 does).] Freebase (which Google acquired) was manually built so naturally Google wants a scalable approach (now called Knowledge Vault). Never-Ending Language Learner (NELL), Yago, etc have demonstrated fact extraction from unstructured data, and Google is building on that lineage. I have not looked at this patent closely to judge what novelty there is with regards to many other machine learning approaches to extract facts from Wikipedia and general Web pages.

    The first patent that talks about concerted effort to create a large (domain specific) background knowledge base of facts (was called WorldModel just as Google use term Knowledge Graph for what many researchers called ontologies) and use it for semantic search, browsing, personalization, advertisement was discussed in http://www.google.com/patents/US6311194 (filed 2000/awarded 2001). Here is one look at evolution of semantic search and other efforts that particularly rely on creating knowledge bases and using to improve information processing (search and beyond): http://amitsheth.blogspot.com/2014/09/15-years-of-semantic-search-and.html

  7. I understand it is in Google’s interest to provide accurate data and facts ASAP to the searcher. But what is the impact on webmasters and their sites ? I know you have been preaching the importance of going with Google’s plan, and help them find the answer they need through your content, but what kind of impact or repercussions do you see on a site organic trafic if the site’s content is displayed in the SERPS ? Do you think the user will click the website’s link anyway, or will he/she be satisfied with the answer directely in the SERP ?

  8. Wow! Google is really doing their work just to provide a better service. I really love the techniques they have been using for better search answers and I found this article also very informative. Keep up the good work mate !

  9. well written.thats nice for they been working that is really great.i always hope that while im searching on google SERPs that must be helpful for me and for now i found google is the best place for know bout anything. 🙂

  10. Hi Amit,,

    Thank you for your detailed comment. Google has been working towards the creation of something like a knowledge graph since before they acquired Meta-web and Freebase. They had a project that was being called the “Annotation Framework” which involve the creation of a Fact Repository, that this patent is part of. Yes, this patent focuses upon identifying patterns to extract attribute/value pairs from pages. It’s not a complete description of how a knowledge base might be constructed – it’s only part of it. Google had to build towards their “things – not strings” approach, and this was one of the steps in their process, as was the Brin patent.

  11. Hi David,

    It appears as if Google believes that searchers are sometimes better served by getting answers to some questions they ask without having to search through multiple sites to find those. What else can those site owners offer searchers? I don’t think in many cases that searchers will click through after having their answers presented to them – they may just be satisfied with the answers directly in the SERPS.

  12. Yes there might be different information available for a single fact in more than one websites. But will i go further once i found my desired result in very first search and that too on 1 ranked url.
    Yes most of the common users don’t do that. They just google, find and go. Even they are not aware of the word SERP. So user thinks Google knows better than anybody else. 😀

  13. It seems Google is now planning to show more efficient results. This will be certainly a good idea so that user can get complied SERPs results.

  14. I really like your blog.. very nice colors and theme. Did you make this website yourself or did you hire someone to do it for you? Plz reply as I’m looking to create my own blog and would like to know where u got this from. appreciate it

  15. Hi Bill,

    Do you ever analyze patent about how to become the reference site when Google give direct answer?

    Best,
    Tommy

  16. i think it is the best effort of google. Bill Slawski you also write well. Thanks

  17. Hmmm. If they master this, it seems to me that it will force all companies or businesses to buy Google adwords to get exposure, not a practice many small fledgeling businesses can immediately employ. The joy of the internet has been that it can level the playing field for companies large and small – Google wants to be a bit careful that it does not kill the goose that laid it the golden egg.

  18. So what do you see if Google using such wild cards to answer queries in search results. Will there be a day when Google will slowly accumulate all the vital data and will be giving direct answers to user queries? That would be an end to SEO in a way. What do you think?

  19. Ho! Google is indeed doing their task just to give a better service. I actually choose the techniques that they have been using for better inquiry answers and I think that this article also very instructive.

  20. It’s all fine and well that Google is shifting it’s algorythm in order to then try and get more “truthful” content for it’s users. That’s definitely a positive.

    The problem I have with this though is we are relying on what Google say here is “truthful” which will of course be opinion based.

  21. This was an awesome piece of info you wrote there.
    I actually couldn’t leave your blog without reading
    each and every words of your article. Keep it up blogger.

    Regards
    Ajay Kumar

  22. Grate work by Google..!! I like this technique they have been using.
    Google is best search engine for knowing anything.
    It’s a very good article by you.
    thanks

  23. This post is full of details and I am a fan of details. Google’s develop his service to provide a better service. However, if he succeed to filter all the gathered data and get only what is true and influent 100%. I think that too many websites will be hurt. Especially, if they continue providig inforamtions on their SERPs. I must admit that google is really doing a perfect job to select only the right inforamtion by comparing before stocking and showing.

  24. well written.thats nice for they been working that is really great.i always hope that while im searching on google SERPs that must be helpful for me and for now i found google is the best place for know bout anything. 🙂

  25. I’m so glad I stumbled on your blog! I’ve been devouring this Google patent stuff. Really interesting!

Comments are closed.