Will Phrase-Based Searching Become an Important Ranking Approach at Google?
Google isn’t the biggest search engine that Anna Lynn Patterson has worked upon. That distinction probably falls to the Internet Archives, which she worked on before joining Google, and likely has a few billion more pages in its database than Google (the archive has 55 billion web pages right now).
In addition to that feat, Anna is the writer of a pretty good article on search engines, over at ACM Queue, titled Why Writing Your Own Search Engine is Hard.
The latest search engine description from Anna Patterson, published yesterday, involves a search engine immune from Google Bombing. It could be said to reward authors for well-written HTML and good punctuation. It can find relevant pages that don’t include the query terms on those pages, even though immune to Google Bombing. She also finds a way to perform personalization with the search engine, and detect and eliminate duplicates.
The search engine that she has conceived of can also be set to serve a mix of relevant pages from different topics in search results to searchers. For example, a search for “Blues” could easily be set to display pages on the first page of search results that lead to:
- Information about the hockey team in Saint Louis
- Articles on the medical condition
- Essays on the music type
- The color Blue
- News on the cricket team in Australia
- Other relevant topics
It can also create document descriptions for a page – both general ones, and personalized descriptions.
This search engine and its related algorithms are described in several patent applications from Anna Patterson (some published, and some still unpublished). This one, on phrase-based searching, was assigned to Google on December 6, 2004:
Phrase-based searching in an information retrieval system
Inventor: Anna Lynn Patterson
US Patent Application 20060031195
Published February 9, 2006
Filed July 26, 2004
An information retrieval system uses phrases to index, retrieve, organize, and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and the index.
Some of the ideas behind the phrase-based searching patent application are similar to those explored in a patent application from Yahoo! published last April, titled Systems and methods for search processing using superunits, though that one focused more upon concepts within queries people use to search with than upon phrases used within documents on the web.
According to the patent application, the need that it addresses is for an information retrieval system and methodology that can:
(a) comprehensively identify phrases in a large scale corpus,
(b) index documents according to phrases,
(c) search and rank documents under their phrases, and
(d) provide additional clustering and descriptive information about the documents.
Is this something that Google will use, or have started to use? It’s difficult to tell. A search for the googlebomb “miserable failure” still returns results that don’t have those words on their pages, so it’s possibly not in use right now. Will phrase-based searching be something that Google might turn to in the future? There’s a possibility that they could.
It is worth looking at and thinking about. After all, the inventor of the phrase-based searching system and method described has built at least one search engine bigger than Google.