IBM’s Unstructured Information Management Patent

What are the differences between enterprise search and web search? Will developments in enterprise search someday enable search engines to be created that might index the web as well, or better than present web search engines?

IBM was granted a patent today on their Unstructured Information Management Architecture, which was made available to open source developers last summer. Sourceforge has more information about the open source nature of UIMA, as does IBM. IBM recently decided to move this open source development over to Apache.

Unstructured Information Management was the subject of an IBM Systems Journal in 2004, which contains some detailed articles on the topic. One by A. Z. Broder and A. C. Ciccolo is highly recommended, if you would like to get a grasp of the potential of this approach to indexing unstructured information – Towards the next generation of enterprise search technology. It describes some of the differences between enterprise search and web search, and provides summaries of the other articles in the issue. I found this snippet interesting:

The field of UIM may come full circle: while the unstructured search paradigm on the Web exploded in the consumer sphere before being adopted in the enterprise, we believe that the combination of semantic and linguistic annotations with unstructured search will follow the more conventional path of first being developed in the enterprise sphere before becoming pervasive in the Web world.

Both authors are also listed amongst the co-inventors on the newly granted patent, and Dr. Broder is now at Yahoo!

Some information about the patent:

System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
Invented by Andrei Z Broder, David Carmel, Arthur C. Ciccolo, David Ferrucci, Yoelle Maarek, Yosi Mass, Aya Soffer, and Wlodek W. Zadrozny
Assigned to IBM
US Patent 7,139,752
Granted November 21, 2006
Filed May 30, 2003

Abstract

Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. Also disclosed is system, method and computer program product to process document data. The method includes inputting a document and operating at least one text analysis engine that comprises a plurality of coupled annotators for tokenizing document data for identifying and annotating a particular type of semantic content. Operating the at least one text analysis engine generates a plurality of views of a document, where each of the plurality of views are derived from a different tokenization of the document. The method further includes storing the plurality of views in a common data structure associated with the document.

While IBM initially developed this technology, the release of some or all of it to open source developers is interesting in that it may help spur growth of this unstructured information management architecture.

One person behind this research, Dr. Broder, is now with Yahoo, and another, Dr. Maarek, now heads Google’s Haifa research center. It’s possible that ideas developed while this architecture was created are amongst those being considered while both search engines move towards the future.

Will technology developed first in the enterprise field overtake that developed on the web, or merge with it? It’s possible.