The builder of the largest search engine in the World during the first decade of the 21st century joined Google shortly after building that search engine, and possibly licensed the technology behind it to Google. She worked for Google for a number of years, creating a way of indexing pages based upon the meaningful phrases that appear upon those pages, looking at how phrases co-occur on pages to cluster and rerank those pages, using the phrases to identify spam pages and pages with duplicate content, and creating taxonomies and snippets for pages using phrases. This phrase-based indexing system provided a way to defeat Googlebombing, and to determine how much anchor text relevance should be passed along with links.
Then Anna Patterson left Google to start the search engine Cuil, which was supposed to be a Google killer. Except it wasn’t. Now she’s back at Google, and looks to be working on phrases again.
There could be said to be three generations of her phrase-based indexing system, described in three generations of patents.
The first generation of this patent family was filed on July 26, 2004 or within the next couple of years afterwards.
A second generation of phrase-based indexing patents appears to have been filed on March 30, 2007, and describe how phrase-based indexing could be implemented into a large scale data system. There are a few of these second generation patents that appear to be still pending and haven’t been made public yet.
A third generation of phrase-based indexing patents is starting to make it on to the scene with the refiling and recent granting of a continuation version of one of the original first generation patents.
Single Word Indexing
In addition to ranking documents based upon the quality and quantity of links pointing to a page, Google also looks at whether or not the query terms searched for also appear upon specific pages. Google’s Matt Cutts wrote one of the best descriptions of how Google may do this in the first Google Librarian Newsletter, which appears to have disappeared from the Web not too long ago. I found a copy on the University of Michigan website, and it’s a highly recommended document which I’ll build upon with the rest of this post.
That first newsletter asked and answered the question, How does Google collect and rank results? As you read it, pay special attention to where it talks about “posting lists.” If you start reading through the second generation of phrase based indexing patents, you’ll see references to how phrases may be included in posting lists as well.
Phrase Based Indexing
A number of the first generation phrase based indexing patents were filed on July 26, 2004, and the descriptions of most of those patents are substantially the same, though the claims differ.
I’ve written a number of posts about phrase based indexing, and the one that provides the most detailed look at this approach was one I published on December 29, 2006 – Phrase Based Information Retrieval and Spam Detection. (Highly recommended that you stop and go read that post before moving on with this one.)
SEO by the Sea posts on the first generation of phrase-based indexing patents:
- February 10, 2006 – Move over pagerank: Google’s looking at phrases?
- May 19, 2006 – Google Aiming at 100 Billion Pages?
- September 16, 2009 – Google Phrase Based Indexing Patent Granted
SEO by the Sea posts on the second generation of phrase-based indexing patents:
- March 15, 2009 – What are the Top Phrases for Your Website?
- April 7, 2010 Phrasification and Revisiting Google’s Phrase Based Indexing
Assumptions and Approaches behind Phrase Based Indexing
1) It’s possible to distinquish between a good phrase and one that isn’t so helpful. A good phrase has meaning in itself, like “ice cream” meaning something different than just “ice” and “cream.” A good phrase is a complete phrase, like “president of the United States,” as opposed to “President of the.” A phrase can be one word long. A phrase can have more than one meaning, such as “German Shepard,” which can mean a sheep herder in Germany, or a specific breed of dog.
2) Certain phrases tend to co-occur with other phrases, So for instance, if you did a search for “President of the United States,” and looked at the top 10, or top 100, or top 1,000 pages in that search, you would probably see a number of related terms that appear regularly on those pages, such as “whitehouse, “vice president,” “Oval Office,” “Washington, DC,” and so on. It might be possible to rerank those search results to boost ones that tend to have more of these commonly occurring related phrases. Pages that statistically have more of these phrases than they should might be considered spam.
3) Where there is a phrase that has more than one meaning, there might be “clusters” of related phrases of different types. So, when the phrase is “German Shepard,” and one set of related phrases that appear in the top (10, 100, 1,000) search results involve terms like “kennel,” “dog collar,” “dog house,” “obedience training,” etc., that might indicate one meaning of the phrase. When a second group of documents that rank for the phrase “German Shepard,” include terms like “sheep herding,” “Germany,” “large flock,” and “Grazing space,” those phrases may indicate a second meaning, describing a person from Germany who herds sheep.
4) Anchor text in links pointing to a page that include the phrase or a related phrase (one that tends to co-occur on pages that rank for that phrase) should be given more weight than anchor text that doesn’t. So a page that includes the biography of the President of the United States, that is the target of a Googlebomb using text like “miserable failure” won’t help that page rank for the term “miserable failure” unless the page is somehow relevant for the term. A few years back, Google announced that they had defeated a specific Googlebomb for the biography page of George W. Bush using the phrase “miserable failure,” and it stopped ranking for the term, at least until someone at the Whitehouse inadvertently caused the Googlebomb to return by adding the word “failure,” to the page during an update.
5) Google could also purposefully get a page to stop ranking for a specific phrase by removing the connection between the page and the phrase in its index, which might be a way to penalize a page for spam type practices.
First Generation Phrase-Based Indexing Patent Filings
- Phrase-based indexing in an information retrieval system (US Patent No. 7,536,408)
- Phrase-based detection of duplicate documents in an information retrieval system (US Patent No. 7,711,679)
- Information retrieval system for archiving multiple document versions (US Patent No. 7,702,618)
- Detecting spam documents in a phrase based information retrieval system (US Patent No. 7,603,345)
- Phrase-based searching in an information retrieval system (US Patent No. 7,599,914)
- Phrase-based generation of document descriptions (US Patent No. 7,584,175)
- Phrase-based personalization of searches in an information retrieval system (US Patent No. 7,580,929)
- Phrase identification in an information retrieval system (US Patent No. 7,580,921)
- Multiple index based information retrieval system (US Patent No. 7,567,959)
- Automatic taxonomy generation in search results using phrases (US Patent No. 7,426,507)
Second Generation Phrase-Based Indexing Patent filings
A number of these patent filings haven’t been published yet by the USPTO, and may not be until they are granted. While a couple of the published patents include Anna Patterson as an inventor, a number of them don’t. The first listed is a pending patent application, though many of the phrase-based indexing patents weren’t published until they were granted.
- Integrating External Related Phrase Information into a Phrase-based Indexing Information Retrieval System (US Patent Application 20090070312)
- Index server architecture using tiered and sharded phrase posting lists (US Patent 7,693,813)
- Index updating using segment swapping (US Patent 7,702,614)
- Query scheduling using hierarchical tiers of index servers (US Patent 7,925,655)
- Phrase Extraction Using Subphrase Scoring, filed Mar. 30, 2007 (unpublished)
- Bifurcated Document Relevance Scoring, filed Mar. 30, 2007 (unpublished)
- Inde server Architectures in Tiered and Sharded Phrase Posting Lists, filed Mar. 30, 2007 (unpublished)
- Query Phrasification, Ser. No. 11/694,845, filed Mar. 30, 2007 (unpublished)
Third Generation Phrase Based Indexing
There only appears to be one of these at this point, though it’s possible that Google might file more continuation patent filings which add additional claims to existing patents in this series, or divisional patent filings which might separate out some claims from a specific patent and focus upon expanding those claims.
Detecting spam documents in a phrase based information retrieval system (US Patent No. 8,078,629)
Invented by Anna Lynn Patterson
Granted December 13, 2011
Filed: October 13, 2009
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.
There are a lot of reasons to believe that Google is using Phrase Based Indexing beyond the sheer number of patents, and it’s worth spending some time experimenting with phrases to get an idea of how Google treats them.
If you perform keyword research, optimize web pages, and do link building, you’ll find that understanding how phrase-based indexing works will be helpful.
On the plus side, even if Google isn’t doing phrase-based indexing quite like what is described in these patents, understanding things like what terms might be “related” to terms or phrases that you might want to optimize a page for and working to include those related phrases on your page can result in richer and higher quality pages.
All parts of the 10 Most Important SEO Patents series:
Part 1 – The Original PageRank Patent Application
Part 2 – The Original Historical Data Patent Filing and its Children
Part 3 – Classifying Web Blocks with Linguistic Features
Part 4 – PageRank Meets the Reasonable Surfer
Part 5 – Phrase Based Indexing
Part 6 – Named Entity Detection in Queries
Part 7 – Sets, Semantic Closeness, Segmentation, and Webtables
Part 8 – Assigning Geographic Relevance to Web Pages
Part 9 – From Ten Blue Links to Blended and Universal Search
Part 10 – Just the Beginning