Google & Fact Extraction, Normalization, and Visualization

When we talk about how a search engine like Google crawls and indexes information from websites, it’s often in the context of the Web results that the search engine shows to searchers.

Facts in Web Results

But, with Universal Search and blended search results showing information from local search, question answering, definitions, and others, it may make sense to start paying more attention to how the search engine is extracting facts from pages, creating “objects” from those facts, and ranking those objects.

In a post from last September, I went into a lot of detail on how a Google patent application focusing upon data practices with Local Search, titled Generating Structured Information, discussed how facts and information were taken from the Web and included in a local search repository.

Explosion of Patent Filings

That “Structured Information” document has been cited as a related patent filing for a number of new patent applications filed this week on extracting facts, normalizing them, and visualizing them in different ways.

I’ve located a number of other additional related filings to try to get a broad overview of this data extraction approach, which is less concerned with indexing keywords that can be searched in a Web index, and more concerned with creating “objects” constructed of facts for specific people, places, businesses, and other entities.

Another older patent application that might add some context to these newer filings is Learning facts from semi-structured text (20060293879), which was filed back in 2005 and published this past December. The abstract from that document describes one approach to collecting facts from web pages:

A method and system of learning, or bootstrapping, facts from semi-structured text is described. Starting with a set of seed facts associated with an object, documents associated with the object are identified. The identified documents are checked to determine if each has at least a first predefined number of seed facts.

If a document does have at least a first predefined number of seed facts, a contextual pattern associated with the seed facts is identified and other instances of content in the document matching the contextual pattern are identified.

If the document includes at least a second predefined number of the other instances of content matching the contextual pattern, then facts may be extracted from the other instances.

Patent Applications on Objects and Facts

I’ll likely be spending some time going over the following documents in the months to come, and I’m likely not going to summarize or analyze these in depth here, but from what I’ve skimmed and read, they present some ideas worth studying closer.

Annotation Framework,
Filed on February 17, 2006, Published August 23, 2007, by Tom Richford and Jonathan T. Betz (20070198499)

Automatic Object Reference Identification and Linking in a Browseable Fact Repository,
Filed on Feb. 17, 2006, Published August 23, 2007, by Andrew W. Hogue and Jonathan T. Betz (20070198481)

Browseable fact repository,
Filed on Feb. 17, 2006, Published August 23, 2007, by Andrew W. Hogue and Jonathan T. Betz (20070198503)

Query Language,
Filed on February 17, 2006, Published August 23, 2007, by Andrew W. Hogue and Doug Rhode (20070198480)

Support for Object Search,
Filed on February 17, 2006, Published August 23, 2007, by Alex Kehlenbeck, Andrew W. Hogue, and Jonathan T. Betz (20070198451)

Unsupervised extraction of facts
Filed March 31, 2006, Published June 28, 2007, by Jonathan T. Betz and Shubin Zhao (20070150800)

Mechanism for managing facts in a fact repository
Filed April 7, 2006, Published June 21, 2007, by Andrew W. Hogue and Jonathan T. Betz (20070143317)

Anchor text summarization for corroboration
Filed on March 31, 2006, Published on June 21, 2007, by Shubin Zhao and Jonathan T. Betz (20070143282)

Visualizing Data Objects

User interface for facts query engine with snippets from information sources that include query terms and answer terms
Filed March 31, 2005, Published October 5, 2006, by Andrew William Hogue (20060224582)

Data Object Visualization Using Graphs,
Filed on Jan. 27, 2006, Published August 9, 2007, by Andrew W. Hogue, David Vespe, Alex Kehlenbeck, Mike Gordon, Jeffrey C. Reynar, David Alpert (20070185870)

Data Object Visualization Using Maps,
Filed on Jan. 27, 2006, Published August 9, 2007, by Andrew W. Hogue, David Vespe, Alex Kehlenbeck, Mike Gordon, Jeffrey C. Reynar, David Alpert (20070185895)

Designating Data Objects for Analysis,
Filed on Jan. 27, 2006, Published August 2, 2007, by Andrew W. Hogue, David Vespe, Alex Kehlenbeck, Mike Gordon, Jeffrey C. Reynar, David Alpert (20070179965)

Displaying Facts on a Linear Graph
Filed September 27, 2006, Published August 2, 2007, by David J. Vespe, Andrew W. Hogue, Alexander Kehlenbeck, Michael Gordon, Jeffrey C. Reynar, and David B. Alpert (20070179952)

Data Object Visualization,
Filed on Jan. 27, 2006, Unpublished, by Andrew W. Hogue, David Vespe, Alex Kehlenbeck, Mike Gordon, Jeffrey C. Reynar, David Alpert (Attorney Docket No. 24207-10946)

Object Categorization for Information Extraction,
Filed on Jan. 27, 2006, Unpublished, by Jonathan T. Betz (Attorney Docket No. 24207-10952)

Normalizing Data

Attribute Entropy as a Signal in Object Normalization,
Filed February 17, 2006, Published August 23, 2007, by Jonathan T. Betz, Vivek Menezes (20070198597)

Entity normalization via name normalization,
Filed March 31, 2006, Published August 23, 2007, by Jonathan T. Betz (20070198600)

ID Persistence Through Normalization,
Filed February 17, 2006, Published August 23, 2007, by Jonathan T. Betz, Andrew W. Hogue (20070198577)

Modular Architecture for Entity Normalization,
Filed February 17, 2006, Published August 23, 2007, by Jonathan T. Betz, Farhan Shamsi (20070198598)

Share

2 thoughts on “Google & Fact Extraction, Normalization, and Visualization”

Comments are closed.