The First Phrase-Based Indexing Patent Application
Google isn’t the biggest search engine that Anna Lynn Patterson has worked upon. That distinction probably falls to the Internet Archives, which she worked on before joining Google. It likely has a few billion more pages in its database than Google. The archive has 55 billion web pages right now.
Besides to that feat, Anna is the writer of a pretty good article on search engines. That is at the ACM Queue, titled Why Writing Your Own Search Engine is Hard.
There are Many Benefits From Using a Search Engine that Ranks Pages Using Phrase-Based Indexing
The latest search engine description from Anna Patterson, published yesterday, involves a search engine immune from Google Bombing. It could reward authors for well-written HTML and good punctuation. It can find relevant pages that don’t include the query terms on those pages, even though immune to Google Bombing. She also finds a way to perform personalization with the search engine and detect and drop duplicates. It introduces the concept of phrase-based indexing to search as well.
The search engine she conceived of can serve a mix of relevant pages from different topics in search results to searchers. For example, a search for “Blues” could easily be set to display pages on the first page of search results that lead to:
- Information about the hockey team in Saint Louis
- Articles on the medical condition
- Essays on the music type
- The color Blue
- News on the cricket team in Australia
- Other relevant topics
It can create document descriptions for a page, These can be both general descriptions and personalized ones.
The Patent was filed in 2004.
This search engine and its related algorithms are in several patent applications from Anna Patterson (some published, and some still unpublished). This one, on phrase-based indexing, assigned to Google on December 6, 2004:
Phrase-based searching in an information retrieval system
Inventor: Anna Lynn Patterson
US Patent Application 20060031195
Published February 9, 2006
Filed July 26, 2004
An information retrieval system uses phrases to index, retrieve, organize, and describe documents. Phrases that predict the presence of other phrases in documents. Indexed documents according to their included phrases. Related phrases and phrase extensions are also identified. Identified phrases in a query get used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, anddrope duplicate documents from the search results and the index.
Information Retrieval Using Phrase-Based Indexing
Some of the ideas behind the phrase-based indexing patent application are like those explored in a patent application from Yahoo! published last April, titled Systems and methods for search processing using superunits. That one focused more upon concepts within queries that people use to search with than upon phrases used within documents on the web.
According to the patent application, the need that it addresses is for an information retrieval system and method that can:
(a) Identify phrases in a large scale corpus,
(b) Index documents according to phrases,
(c) Search and rank documents under their phrases, and
(d) Provide more clustering and descriptive information about the documents.
Is this something that Google will use or have started to use? It isn’t easy to tell. A search for the googlebomb “miserable failure” still returns results that don’t have those words on their pages, so it’s possibly not in use right now. Will phrase-based indexing be something that Google might turn to in the future? There’s a possibility that they could.
It is worth looking at and thinking about. After all, the inventor of the phrase-based indexing system and method described has built at least one search engine bigger than Google
10 thoughts on “Move over PageRank: Google is Using Phrase-Based Indexing?”
MSN should look into that too. Many times, you see pages ranking for keywords that are nowhere on their pages. Again, it shows that links only, cannot prove the relevance of a page.
I’d imagine that the folks at Microsoft likely have some folks looking at ideas like this, like their group working on Text Mining Search and Navigation Research.
Some of the papers listed from that group, and from some of the other research groups are interesting.
Yes, they look smart! lol. Let’s hope they’ll surprise us very soon.
Is Google trying to make it harder for companies to dominate search, or making it harder for the smaller businesses with all their expectations?
Google is trying to improve search results for people performing searches on Google, and this phrase-based indexing approach is an intelligently crafted method that could help more relevant and meaningful pages show up higher in those results.
They aren’t putting this out there as an “expectation” that site owners have to meet, but if you write a highly informational page about a particular topic, chances are that there are certain topics and terms that you would include upon that page that show that you’re providing quality content.
Instead of Google showing searchers pages where certain words may just happen to appear somewhere on a page, the phrase-based indexing approach tries to make it more likely that results being returned to searchers are about what they are looking for.
Comments are closed.