Maybe there’s a little irony to the date that United States Patent 6,990,628, Method and apparatus for measuring similarity among electronic documents, was granted – January 24, 2006.
After all, that’s the day when many were saying that Yahoo! was giving up on matching or beating Google in the field of search, based upon some comments from the company’s Chief Financial Officer. Does this new patent assigned to Yahoo! hold hope for them to keep up, and maybe even surpass the present king of the search mountain, Google?
We may only find that out in the future, but it is an interesting document, and it covers a lot of ground. It’s worth poking through, and getting a sense of what it covers. A little more about the patent itself, first. The named inventors are Michael E. Palmer, Gordon Sun, and Hongyuan Zha. While it was granted on January 24, 2006, it was originally filed on June 14, 1999.
That file date may be a little misleading. From the patents and other documents referenced in the patent application (which I’ve provided links to at the end of this post), it appears that the document evolved over time from when it was originally filed until when it was granted this week.
Here’s the abstract, which gives a taste of what this is about:
A method and apparatus are provided for determining when electronic documents stored in a large collection of documents are similar to one another. A plurality of similarity information is derived from the documents.
The similarity information may be based on a variety of factors, including hyperlinks in the documents, text similarity, user click-through information, similarity in the titles of the documents or their location identifiers, and patterns of user viewing.
The similarity information is fed to a combination function that synthesizes the various measures of similarity information into combined similarity information. Using the combined similarity information, an objective function is iteratively maximized in order to yield a generalized similarity value that expresses the similarity of particular pairs of documents.
In an embodiment, the generalized similarity value is used to determine the proper category, among a taxonomy of categories in an index, cache or search system, into which certain documents belong.
Here are some of the concepts express in the Patent:
It tries to categorize pages based upon their similarity with hand-classified training pages from different categories using a document similarity matrix which combines two or more different measures of document similarity to do so. This type of categorization should improve the relevance of results returned from a search.
These measures of similarity include looking at similarities in:
— Text within the documents
— Comparing text word vectors
— Patterns in how people click through the pages
- frequency of clicks,
- click context,
- duration of viewing,
- proximity in time to other clicks, or
- proximity in context to other clicks.
Patterns in how people view the pages:
- frequency of viewing,
- viewing context,
- duration of viewing,
- proximity in time to other documents viewed by the same user, or
- similarity of patterns of viewing by all users.
— URL sub-components
— Multipmedia components (audio, video, images. etc.)
— Titles of pages
— Cache log information from a web cache system
— Combinations of the above factors
The categories used will come from manually defined taxonomies, and from logs of user queries. The patent application doesn’t give any sense of whether those taxonomies are of the type that might be created from social search services, like a Flickr or Del.icio.us, or from older forms such as a Yahoo! Directory or Open Directory Project or a Cadê? (which was merged into Yahoo! Brazil in 2002, but was hand-edited). Something to think about though.
It’s not easy to tell whether or not Yahoo! is presently using the process described in this patent without spending some time looking at it much closer. But it seems to have had a fair amount of effort and time put into it.
On the same day that the statement I mentioned above from Yahoo!’s CFO was reported, two Yahoo! Vice Presidents responded in the Yahoo! search blog with a rebuttal to the claims that Yahoo! had given up on the possibility of being first in search, with a post titled Are you kidding?! They cited a good number of reasons that show that Yahoo! is committed to doing well in search, but this patent published on the same day came into the world quietly.
I don’t know if this patent is a step in the right direction for supremacy in search or not. But I started clicking through some of the documents that it referenced when published, and found some of those documents pretty interesting, so I’ve included links to them in this post:
U.S. Patents Cited:
- 5,745,893 – April, 1998 – Process and system for arrangement of documents
- 5,799,292 – August, 1998 – Adaptive hypermedia presentation method and system
- 5,835,905 – November, 1998 – System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents
- 5,895,470 – April, 1999 – System for categorizing documents in a linked collection of documents
- 5,920,859 – July, 1999 – Hypertext document retrieval system and method
- 5,931,907 – August, 1999 – Software agent for comparing locally accessible keywords with meta-information and having pointers associated with distributed information
- 5,960,422 – September, 1999 – System and method for optimized source selection in an information retrieval system
- 6,052,676 – April, 2000 – Adaptive hypermedia presentation method and system
- 6,112,203 – August, 2000 – Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis
- 6,115,718 – September, 2000 – Method and apparatus for predicting document access in a collection of linked documents featuring link proprabilities and spreading activation
- 6,128,606 – October, 2000 – Module for constructing trainable modular network in which each module inputs and outputs data structured as a graph
- 6,182,066 – January, 2001 – Category processing of query topics and electronic document content topics
- 6,182,091 – January, 2001 – Method and apparatus for finding related documents in a collection of linked documents using a bibliographic coupling link analysis
- 6,282,549 – August, 2001 – A method and apparatus for searching for multimedia files in a distributed database and for displaying results of the search based on the context and content of the multimedia files.
- 6,321,220 – November, 2001 – Method and apparatus for preventing topic drift in queries in hyperlinked environments
- 6,338,060 – January, 2002 – Data processing apparatus and method for outputting data on the basis of similarity
- 6,360,215 – March, 2002 – Method and apparatus for retrieving documents based on information other than document content
- 6,389,436 – May, 2002 – Enhanced hypertext categorization using hyperlinks
- 6,397,212 – May, 2002 – Self-learning and self-personalizing knowledge search engine that delivers holistic results
- 6,405,188 – June, 2002 – Information retrieval system
- 6,418,431 – July, 2002 – Information retrieval and speech recognition based on language models
- 6,418,432 – July, 2002 – System and method for finding information in a distributed information system using query learning and meta search
- 6,446,099 – September, 2002 – Document matching using structural information
Other References Cited:
- Hermann Kaindl, Stefan Kramer, Luis Miguel Afonso, Combining Structure Search and Content Search for the World-Wide Web
- Ron Weiss, Bienvenido Vélez, Mark A. Sheldon, HyPursuit: A Hierarchical Network Search Engine That Exploits Content-Link Hypertext Clustering