Move over pagerank: Google’s looking at phrases?

Google isn’t the biggest search engine that Anna Lynn Patterson has worked upon. That distinction probably falls to the Internet Archives, which she worked on before joining Google, and likely has a few billion more pages in its database than Google (the archive has 55 billion web pages right now).

In addition to that feat, Anna is the writer of a pretty good article on search engines, over at ACM Queue, titled Why Writing Your Own Search Engine is Hard.

The latest search engine description from Anna Patterson, published yesterday, involves a search engine immune from Google Bombing. It could be said to reward authors for well written html, and good punctuation. It can find relevant pages that don’t include the query terms on those pages, even though immune to Google Bombing. She also finds a way to perform personalization with the search engine, and detect and eliminate duplicates.

The search engine that she has conceived of can also be set to serve a mix of relevant pages from different topics in search results to searchers. For example, a search for “Blues” could easily be set to display pages on the first page of search results that lead to:

  • Information about the hockey team in Saint Louis
  • Articles on the medical condition
  • Essays on the music type
  • The color Blue
  • News on the cricket team in Australia
  • Other relevant topics

It can also create document descriptions for a page – both general ones, and personalized descriptions.

This search engine and its related algorithms are described in a number of patent applications from Anna Patterson (some published, and some still unpublished). This one was assigned to Google on December 6, 2004:

Phrase-based searching in an information retrieval system

Inventor: Anna Lynn Patterson
US Patent Application 20060031195
Published February 9, 2006
Filed July 26, 2004

Abstract:

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

Some of the ideas behind the patent application are similar to those explored in a patent application from Yahoo! published last April, titled Systems and methods for search processing using superunits, though that one focused more upon concepts within queries people use to search with than upon phrases used within documents on the web.

According to the patent application, the need that it addresses is for an information retrieval system and methodology that can:

(a) comprehensively identify phrases in a large scale corpus,
(b) index documents according to phrases,
(c) search and rank documents in accordance with their phrases, and
(d) provide additional clustering and descriptive information about the documents.

Is this something that Google will use, or have started to use? It’s difficult to tell. A search for “miserable failure” still returns results that don’t have those words on their pages, so it’s possibly not in use right now. Will it be something that Google might turn to in the future? There’s a possibility that they could.

It is worth looking at, and thinking about. After all, the inventor of the system and method described has built at least one search engine bigger than Google.

Share

10 thoughts on “Move over pagerank: Google’s looking at phrases?”

  1. MSN should look into that too. Many times, you see pages ranking for keywords that are nowhere on their pages. Again, it shows that links only, cannot prove the relevance of a page.

  2. Is Google trying to make it harder for companies to dominate search, or making it harder for the smaller businesses with all their expectations?

  3. Hi Jared,

    Google is trying to improve search results for people performing searches on Google, and this phrase-based indexing approach is an intelligently crafted method that could help more relevant and meaningful pages show up higher in those results.

    They aren’t putting this out there as an “expectation” that site owners have to meet, but if you write a highly informational page about a particular topic, chances are that there are certain topics and terms that you would include upon that page that show that you’re providing quality content.

    Instead of Google showing searchers pages where certain words may just happen to appear somewhere on a page, the phrase-based indexing approach tries to make it more likely that results being returned to searchers are about what they are looking for.

Comments are closed.