New Yahoo! Patent on Search and Similarity

Maybe there’s a little irony to the date that United States Patent 6,990,628, Method and apparatus for measuring similarity among electronic documents, was granted – January 24, 2006.

After all, that’s the day when many were saying that Yahoo! was giving up on matching or beating Google in the field of search, based upon some comments from the company’s Chief Financial Officer. Does this new patent assigned to Yahoo! hold hope for them to keep up, and maybe even surpass the present king of the search mountain, Google?

We may only find that out in the future, but it is an interesting document, and it covers a lot of ground. It’s worth poking through, and getting a sense of what it covers. A little more about the patent itself, first. The named inventors are Michael E. Palmer, Gordon Sun, and Hongyuan Zha. While it was granted on January 24, 2006, it was originally filed on June 14, 1999.

That file date may be a little misleading. From the patents and other documents referenced in the patent application (which I’ve provided links to at the end of this post), it appears that the document evolved over time from when it was originally filed until when it was granted this week.

Here’s the abstract, which gives a taste of what this is about:

A method and apparatus are provided for determining when electronic documents stored in a large collection of documents are similar to one another. A plurality of similarity information is derived from the documents.

The similarity information may be based on a variety of factors, including hyperlinks in the documents, text similarity, user click-through information, similarity in the titles of the documents or their location identifiers, and patterns of user viewing.

The similarity information is fed to a combination function that synthesizes the various measures of similarity information into combined similarity information. Using the combined similarity information, an objective function is iteratively maximized in order to yield a generalized similarity value that expresses the similarity of particular pairs of documents.

In an embodiment, the generalized similarity value is used to determine the proper category, among a taxonomy of categories in an index, cache or search system, into which certain documents belong.

Here are some of the concepts express in the Patent:

It tries to categorize pages based upon their similarity with hand-classified training pages from different categories using a document similarity matrix which combines two or more different measures of document similarity to do so. This type of categorization should improve the relevance of results returned from a search.

These measures of similarity include looking at similarities in:

– Hyperlinks

– Text within the documents

– Comparing text word vectors

– Patterns in how people click through the pages

  • frequency of clicks,
  • click context,
  • duration of viewing,
  • proximity in time to other clicks, or
  • proximity in context to other clicks.

Patterns in how people view the pages:

  • frequency of viewing,
  • viewing context,
  • duration of viewing,
  • proximity in time to other documents viewed by the same user, or
  • similarity of patterns of viewing by all users.

– URL sub-components

– Multipmedia components (audio, video, images. etc.)

– Titles of pages

– Cache log information from a web cache system

– Combinations of the above factors

The categories used will come from manually defined taxonomies, and from logs of user queries. The patent application doesn’t give any sense of whether those taxonomies are of the type that might be created from social search services, like a Flickr or Del.icio.us, or from older forms such as a Yahoo! Directory or Open Directory Project or a Cadê? (which was merged into Yahoo! Brazil in 2002, but was hand-edited). Something to think about though.

It’s not easy to tell whether or not Yahoo! is presently using the process described in this patent without spending some time looking at it much closer. But it seems to have had a fair amount of effort and time put into it.

On the same day that the statement I mentioned above from Yahoo!’s CFO was reported, two Yahoo! Vice Presidents responded in the Yahoo! search blog with a rebuttal to the claims that Yahoo! had given up on the possibility of being first in search, with a post titled Are you kidding?! They cited a good number of reasons that show that Yahoo! is committed to doing well in search, but this patent published on the same day came into the world quietly.

I don’t know if this patent is a step in the right direction for supremacy in search or not. But I started clicking through some of the documents that it referenced when published, and found some of those documents pretty interesting, so I’ve included links to them in this post:

U.S. Patents Cited:

Other References Cited:

Share

4 thoughts on “New Yahoo! Patent on Search and Similarity”

  1. Can we have a deep contact?Drop me a email please!please!pleeeeaaaaaaaaaaase!!!!!!!You know my gmail,wahahaha.

Comments are closed.