Google Phrase Based Indexing Patent Granted

Before becoming a co-founder of the new search engine Cuil, Anna Lynn Patterson worked at Google upon a way of looking at how often different phrases appeared together on pages on the Web, described in a series of patent applications which share a common description, with different claims sections that itemize different parts of that description.

I summarized the description from one of the patent filings in my post from December 29, 2006, in Phrase Based Information Retrieval and Spam Detection

One of the patent applications from that series, Automatic taxonomy generation in search results using phrases, which I hadn’t originally come across back in 2006, was granted today and covers the idea of taking documents that share related phrases and clustering them with the related phrases to provide search results that might cover a range of categories related to search queries.


What do I mean by clustering?

I watched a great example of clustering this weekend during a television show on Global Warming.

The show used a golfing analogy to describe the differences between attempting to predict weather patterns for a few days in the future and for a much longer period of time.

Using different theories to try to forecast the weather for three or four days in the future is a little like trying to put a golf ball into the hole from about ten feet away. Imagine each variation of a theory to predict the weather as a swing and a golfball approaching the hole as a prediction.

Because the distance isn’t too far, most of the golfballs hit will come very close to the hole, clustering around it. So if you have 50 different theories, you may end up with 50 golf balls closely together around the hole.

Predicting the weather a few years in the future based upon different theories may be more like trying to hit a golf ball into a hole from 250 feet away. Hit a few thousand golfballs towards the hole, and they will be spread further apart. You may see a pattern emerge, with some balls clustered together and some clusters closer and further to one another.

Similar golf swings (or similar theories) may result in some golf balls being clustered closer together. When you have clusters of golfballs farther apart, they may be the result of golfball swings (or theories) that are very different. The clusters could be said to cover different categories of theories related to predicting the weather.

Clusters and Phrases

One of the main ideas behind the phrase-based indexing system is to explore documents found on the Web and see how often the same phrases co-occur within the same documents. So this indexing system may go out on the Web and identify how frequently certain phrases tend to co-occur in the same documents and mark them as related.

So, pages that include the phrase “baseball stadium” may include other phrases such as “ball game,” “bleachers,” “home plate,” and other related phrases.

Once that is done, clustering may be used to decide which pages to show in search results based on the phrases included in a search query and phrases related to those query phrases.

The patent provides an example of how clustering might be used to decide which pages to show and the order in which those are presented.

Someone searches for the query “blue merle agility training,” which is made up of the phrases “blue merle” and “agility training.” The search engine returns 100 results. In addition, clusters may be created based upon related phrases that may have been identified previously by the phrase-based indexing system.

Related phrases for “blue merle” might be “Australian Shepherd,” “red merle,” “tricolor,” and “Aussie.”

Related phrases for “agility training” might be “weave poles,” “teeter,” “tunnel,” “obstacle,” and “border collie.”

In the example, the patent tells us that the system will count the number of documents containing each related phrase. So, for example, if the phrase “weave poles” shows up in 75 of the 100 documents, and “teeter” appears in 60 documents, and “red merle” appears in 50 documents, then we have three clusters (or the first three clusters).

The first cluster would be named “weave poles,” and a certain number of documents from that cluster are presented in the search results. The second cluster would be named “teeter,” with a selected number of results presented from that cluster. Finally, the third would be named “red merle,” and many documents would be included in search results from that cluster.

We have several documents taken from the different clusters based upon how frequently related phrases show up in those documents using clusters. The most popular can be presented first, and either in proportion to how large or small those clusters might be or the same number of documents can be presented to a searcher from each cluster.

The idea behind this approach is to provide searchers with results that contain the query terms that they searched for in a taxonomy – a classification of results based on the different clusters created from the related phrases.

The phrase-based indexing system can perform other functions, such as helping to find duplicate content on the Web and filtering out spam in search results. However, those aspects of the indexing system are still under review by the patent office.

14 thoughts on “Google Phrase Based Indexing Patent Granted”

  1. This makes a lot more sense. If we look at a page dedicated to pilot training, a good page will have content related to different types of licenses, different type of aircraft training, flying club directory and instructor etc.

    If SE can understand what constitute a good page vs a page with just 20 instances of pilot training, it should go a long way in serving relevant pages.

    As always, thanks for pointing out this information.


  2. Hi Rajat,

    I like the ideas behind these patent filings, too. It’s an approach that does seem to have some common sense behind it, and should return relevant results during searches. If it’s been used, I’d love to see some data behind how effective it has been in satisfying searchers’ queries.

    You’re welcome, and thanks sharing your thoughts on this topic.

  3. Thanks for the golfing analogy, I rather enjoyed it and it makes perfect sense… now.

    I imagine that this has been used to some degree so far by many of the larger search engines. After all, isn’t anchor text to a related page supposed to contain keywords? Related pages, related keywords in an attempt to return the most relevant page.

  4. Hi Robert,

    Glad that you enjoyed the golfing analogy – I was impressed with the show it was presented upon.

    The phrase based indexing system that is covered in the description of the patent filings also pay attention to phrases in anchor text, and how related they might be to the page they are pointed at. Outside of this set of patent applications, we really haven’t seen anything from the search engines such as white papers or patents that describe the engines using anchor text in quite that manner.

  5. Hi Peoplefinder,

    There is a wisdom of the crowds element to this technology. The associations that the authors of pages make by choosing to include phrases together on pages can influence how search results are sorted and ranked. It’s an approach worth thinking about.

  6. Hi Robert,

    I know what you mean. The language in search patents can be a pretty arcane mix of legalese, math, and boilerplate language. I do like trying to go through them and attempting to translate them to a plainer version of English. The difficulty with that is trying to decide how much of the original language to keep sometimes.

    Thank you for your twitter follow. 🙂

  7. William,

    Sounds interesting. I must confess that I’ve read a few of the patents that have been filed and they make little or no sense to me. The smaller the words used the easier I find it 😉

    oh… thanks for the follow on twitter.

  8. Dare I say we need an algorithm to figure out their algorithm? 😐

    But then again I guess it’s really a case of weeding out the few points that make sense to us, working on that and then hoping that what we pulled out was the good stuff. Which takes me to another point, I wonder how often they file patents that they have no intention of ever making use of just to throw a red herring in once in a while?

  9. Hi Robert,

    The idea of a red herring has crossed my mind before. 🙂

    There may be a number of reasons why one of the search engines may file for a patent, though I suspect that the majority of them are applied for to protect intellectual property created during the development of processes that a search engine is exploring, and possible outshoots of those explorations.

    Not every patent filed for by Google or Yahoo or Microsoft or Ask will be used. Sometimes though, it’s pretty exciting to see one of the search engines release a new program or application or start a new process, and recognize that we’ve seen it in a patent filing first.

    Looking at these patent applications gives us a chance to think about some of the assumptions that search engineers make, and some of the perspectives that come from search engines on the issues they face involving providing search to the public.

