Take a collection of documents, say about the size of the Web, and try to organize them based upon textual similarities between them. Can that organization provide a useful way to index the web?
A new patent application from Microsoft explores the idea, and points to a paper from the year 2000 on the subject as an influence – Self Organization of a Massive Document Collection. The authors of the paper created an index using this method to organize and index a sample set of documents – a group of 6,840, 568 patent abstracts.
The benefit from this process, the authors note, would be that it do more than rely upon matching keywords in a document. It would allow a closer matching of a searcher’s intent. Here’s an example from the patent application of the issue at hand:
One limitation of keyword searching is the difficulty in providing a context for the keywords.
For example, consider a search query containing the word “pizza.”
Documents that typically contain this word also have other words in common such as “delivery”, “pepperoni”, “sauce”, “restaurant” etc.
However, it is quite possible that there are documents that contain the word “pizza” prominently, but have nothing to do with the more common use of the word pizza.
For instance, a new software technology called “pizza” might be invented by a startup and, therefore, be featured prominently on that companies web page.
If this invention is new and not well known then this use of the word “pizza” will not be the likely intent of users when they enter the query pizza, so the results for this search query should not feature this page prominently.
Unfortunately, a conventional search engine does not have the ability to distinguish between the new, uncommon usage of the word “pizza” and the usage that is probably desired by the person submitting the search query.
The concept of using self organizing maps to address indexing is an interesting one. More information on the subject: Self-Organizing Maps: A Tourist’s Guide to Neural Network (re)Presentation(s)
Here’s a link and some information about the patent application:
System and method for improving search relevance
Invented by Christopher Weare
Assigned to Microsoft
US Patent Application 20060218138
Published September 28, 2006
Filed on March 25, 2005
A system and method for performing context based document searching is provided. A grid of content tiles is constructed corresponding to a desired concept space. Each content tile is assigned a content tag and is associated with a series of feature values. The feature values are trained to correspond to various regions of the content space. Documents are associated with one or more content tags based on a comparison of document feature values with content tile feature values. A search query is modified to include one or more content tags based on the terms in the search query and/or user preferences. The search query is then matched to documents associated with content tags contained in the search query.