Fact extraction is growing as a method that search engines can use to identify and understand what pages on website are about, and to collect facts about subjects and answer questions posed by people submitting queries to a search engine.
A recent paper from Google provides a nice overview of some methods being used for fact extraction. A Google patent application published last week explores looking at titles on pages, and anchor text in related pages on the same domain to determine a subject for a document.
The paper is Corroborate and Learn Facts from the Web (pdf), and the process described within it is has been called GRAZER. Here’s a little about how it works:
It starts with facts imported from one website and takes them as known facts (seed facts). Then it tries to find mentions of the seed facts on other web sites. This involves retrieving relevant pages for each entity and then corroborates facts in them.
Once it finds mentions of facts in a page, a high-precision pattern discovery is applied to the surrounding area to find repeated HTML patterns. If a pattern can be found and it contains one of the example facts, GRAZER will extract all the facts that match the pattern and add them into the known fact set.
The enlarged known fact set will be used in the next learning step. This is a bootstrapping process and the known fact set keeps growing larger. The learning process continues until a stopping criterion is satisfied.
The patent application that I mentioned shares an author with the paper in Shubin Zhao. It’s focus is upon trying to find a subject for documents that have had facts extracted from them. It approaches doing this by looking at titles on the page and anchor text from links pointing to the page from related pages within the same domain.
The paper provides a nice introduction to the methods described within the patent application.
The patent application:
Determining document subject by using title and anchor text of related documents
Invented by Shubin Zhao
US Patent Application 20070240031
Published October 11, 2007
Filed: March 31, 2006
A system and method identifies a subject for a source document. The system and method identifies a collection of peer documents from the same domain as the source document. For each of the peer documents, a collection of linking documents containing a hyperlink to the peer document is identified. For each of the peer documents, a label is generated by choosing the longest-match anchor text of the linking documents.
A pattern between the labels and the titles of the collection of peer documents is deduced. The subject of the source document is identified by applying the pattern to the title of the source document.
Some Related Posts on Fact Extraction
This patent application goes into a lot of detail on the components of a fact repository. I’ve gone into a lot of detail on that subject in a previous post at: Google on the Extraction and Visualization of Facts
It also talks about Google janitors, which are software programs used to process data found on the Web. I came up with a list of some of the different types of janitor programs that might be used by Google in Google Janitors Clean Up Facts on the Web
This past summer saw a lot of patent applications published by Google involving fact extraction, and I created a list of many of them at: Google & Fact Extraction, Normalization, and Visualization