Google on the Extraction and Visualization of Facts

Yesterday, I wrote about how Google might present facts extracted from pages in timelines or maps, according to patent application filed last week.

It wasn’t the only piece of intellectual property coming out of the US Patent and Trademark Office for Google on the extraction and visualization of facts. Another that maybe even more interesting describes the possibility of a user extracting facts found in a query of the fact database, and choosing to present those facts in a number of ways.

Designating data objects for analysis
Invented by Andrew W. Hogue, David J. Vespe, Alexander Kehlenbeck, Michael Gordon, Jeffrey C. Reynar, and David B. Alpert
US Patent Application 20070179965
Published August 2, 2007
Filed: January 27, 2006

Abstract

A fact repository stores objects. Each object includes a collection of facts, where a fact comprises an attribute and a value. An object access module receives objects from the fact repository. The objects can result from multiple different queries executed against the fact repository. A user interface (UI) generation module provides a UI enabling an end-user to designate objects from multiple different queries for subsequent analysis by storing the objects in a virtual collection.

Components of a fact repository

The patent application goes into a lot of detail on how this system might work. Here are some of those about the mechanical aspects of a fact repository.

The components used to manage facts in a fact repository include importers, janitors, a build engine, a service engine, and, a fact repository. These can all be implemented as software modules (or programs).

Importers – process documents received from web pages by reading the data content of those pages, and extractings facts from them. Importers also determine the subject or subjects with which the facts are associated, and extract such facts into individual items of data for storage in the fact repository. There may be different types of importers for different types of documents, depending upon the format or document type.

Janitors – process facts extracted by importer, in areas like data cleansing, object merging, and fact induction. It’s possible to have a number of janitors that perform different types of data management operations on the facts, such as:

– Finding duplicate facts (that is, facts that convey the same factual information), to merge them.
– Normalize facts into standard formats.
– Removing unwanted facts from repository (pornographic content, for instance).
– Other janitors performing data management functions such as translation, compression, spelling or grammar correction.

Example of normalization – One page may have Britney Spears’ “birthday” as “12/2/1981″ while another page that her “date of birth” listed at “Dec. 2, 1981.” One janitor could rewrite both “birthday” and “date of birth” as “birthdate.” Another janitor may notice that “12/2/1981″ and “Dec. 2, 1981″ are the same day, and could choose the preferred form, remove the other fact and combine the source lists for the two facts. Looking at source pages for facts, some may be exact matches, while others may present the information in different forms.

Build engine – builds and manages the repository.

Service engine – an interface used to query the repository. It processes queries, scores matching objects, and returns them to the caller.

Repository – stores facts extracted from a number of pages. A page from which a particular fact may be extracted is a considered a source document (or “source”) of that particular fact. In a repository, each fact may be associated with exactly one object, with an object ID that uniquely identifies the object of the association. This way, any number of facts may be associated with an individual object, by including the object ID for that object in the facts.

Visualization of Facts

I’m going to let the pictures tell the story on this one. A collection of facts might be gathered after a search, or a combination of searches, and used to put together a table of data, like the following table of facts about atomic properties.

A Collection of Facts Involving Atomic Properties

Given a choice of ways to present the information about the facts collected, the following shows a visualization of the atomic mass of those elements:

A Visualization of Atomic Mass Facts

I can see how this would be fun to use in a lot of different ways, from grabbing and displaying baseball statistics to website visit information, from looking at the rise and fall of stock prices to building historical timelines and maps.

Share

4 thoughts on “Google on the Extraction and Visualization of Facts”

  1. This is becoming the concept of the a ‘search’ engine really is. Looking for real information from t’internet.

    This would have the effect, of these things:
    1) Cease trawling through pages based on some keyword phrase, rather than the actual content of the site

    2) Reinforce urban myths. A lot of the criticisims of Wikipedia can be brought to this, because after all, the Internet is one big repository of people opinions, rather than facts. Another form of Google bombing could take place:

    If you created many sites saying “George Bush is a monkey”, then did a search “What is George Bush?”, based on these algorithms, it would come up “George Bush is a monkey”.

  2. Pingback: NimrodJO Blog |Reviews|Edu|Tech|Blogging » Blog Archive » It’s Been A While…

Comments are closed.