Google Phrase-Based Indexing Patent Granted

Sharing is caring!

A Related Patent to Phrase-Based Indexing that Clusters Concepts Those Pages Rank For

Before becoming a co-founder of the new search engine Cuil, Anna Lynn Patterson worked at Google on a way of looking at how often different phrases appeared together on pages of the Web. She worked on a series of patent applications with a common description, with different claims sections that itemize different parts of phrase-based indexing.

I summarized the description from one of the patent filings in my post from December 29, 2006, in Phrase Based Information Retrieval and Spam Detection

One of the patent applications from that series, Automatic taxonomy generation in search results using phrases, which I hadn’t originally come across back in 2006, come out today. It covers the idea of taking documents that share related phrases and clustering them with related phrases to provide search results that might cover a range of categories related to search queries.

Clustering under Phrase-Based Indexing

What do I mean by clustering?

I watched a great example of clustering this weekend during a television show on Global Warming.

The show used a golfing analogy to describe the differences between attempting to predict weather patterns for a few days in the future and for a much longer period of time.

Using different theories to try to forecast the weather for three or four days in the future is a little like trying to put a golf ball into the hole from about ten feet away. Imagine each variation of a theory to predict the weather as a swing and a golfball approaching the hole as a prediction.

Because the distance isn’t too far, most of the golf balls hit will come very close to the hole. They will cluster around it. So if you have 50 different theories, you may end up with 50 golf balls close together around the hole.

Predicting the weather a few years in the future based upon different theories may be more like trying to hit a golf ball into a hole from 250 feet away. Hit a few thousand golfballs towards the hole, and they will be further apart. You may see a pattern emerge. Some balls will be closer together and some clusters will be further from another.

Similar golf swings (or similar theories) may result in some golf balls closer together. When you have clusters of golfballs farther apart, they may result from golfball swings (or theories) that are very different. The clusters could cover different categories of theories related to predicting the weather.

Clusters and Phrases

One of the main ideas behind the phrase-based indexing system is to explore documents found on the Web and see how often the same phrases co-occur within the same documents. So this indexing system may go out on the Web. It may identify how frequently certain phrases tend to co-occur in the same documents and mark them as related.

So, pages that include the phrase “baseball stadium” may include other phrases such as “ball game,” “bleachers,” “home plate,” and other related phrases.

Once locations of phrases get found, clustering may decide which pages to show in search results. Those would come from the phrases included in a search query and phrases related to those query phrases.

Clusters On the Web


The patent provides an example of how clustering might decide which pages to show. It would affect the order those display in.

Someone searches for the query “blue merle agility training,” it comes from the phrases “blue merle” and “agility training.” The search engine returns 100 results. Besides, clusters may work with related phrases found earlier by the phrase-based indexing system.

Related phrases for “blue merle” might be “Australian Shepherd,” “red merle,” “tricolor,” and “Aussie.”

Related phrases for “agility training” might be “weave poles,” “teeter,” “tunnel,” “obstacle,” and “border collie.”

In the example, the patent tells us that the system will count the number of documents containing each related phrase. So, for example, if the phrase “weave poles” shows up in 75 of the 100 documents, and “teeter” appears in 60 documents, and “red merle” appears in 50 documents, then we have three clusters (or the first three clusters).

The first cluster would have the name “weave poles.” A certain number of documents from that cluster get presented in the search results. The second cluster would have the have name “teeter.” A selected number of results would get presented from that cluster. Finally, the third would have the name “red merle.” Many documents would be in search results from that cluster.

How Clusters Could Show in SERPs in Phrase-Based Indexing

We have several documents taken from the different clusters based upon how frequently related phrases show up in those documents using clusters. The most popular can be first. They would be either in proportion to how large or small those clusters might be. Or the same number of documents can present to a searcher from each cluster.

This approach aims to provide searchers with results containing the query terms they searched for in a taxonomy. This would be a classification of results based on the different clusters created from the related phrases.

The phrase-based indexing system can perform other functions. It could help find duplicate content on the Web and filtering out spam in search results. But, those aspects of the phrase-based indexing system are still under review by the patent office.

Sharing is caring!

14 thoughts on “Google Phrase-Based Indexing Patent Granted”

  1. This makes a lot more sense. If we look at a page dedicated to pilot training, a good page will have content related to different types of licenses, different type of aircraft training, flying club directory and instructor etc.

    If SE can understand what constitute a good page vs a page with just 20 instances of pilot training, it should go a long way in serving relevant pages.

    As always, thanks for pointing out this information.

    Thanks,
    Rajat

  2. Hi Rajat,

    I like the ideas behind these patent filings, too. It’s an approach that does seem to have some common sense behind it, and should return relevant results during searches. If it’s been used, I’d love to see some data behind how effective it has been in satisfying searchers’ queries.

    You’re welcome, and thanks sharing your thoughts on this topic.

  3. Thanks for the golfing analogy, I rather enjoyed it and it makes perfect sense… now.

    I imagine that this has been used to some degree so far by many of the larger search engines. After all, isn’t anchor text to a related page supposed to contain keywords? Related pages, related keywords in an attempt to return the most relevant page.

  4. Hi Robert,

    Glad that you enjoyed the golfing analogy – I was impressed with the show it was presented upon.

    The phrase based indexing system that is covered in the description of the patent filings also pay attention to phrases in anchor text, and how related they might be to the page they are pointed at. Outside of this set of patent applications, we really haven’t seen anything from the search engines such as white papers or patents that describe the engines using anchor text in quite that manner.

  5. Hi Peoplefinder,

    There is a wisdom of the crowds element to this technology. The associations that the authors of pages make by choosing to include phrases together on pages can influence how search results are sorted and ranked. It’s an approach worth thinking about.

  6. Hi Robert,

    I know what you mean. The language in search patents can be a pretty arcane mix of legalese, math, and boilerplate language. I do like trying to go through them and attempting to translate them to a plainer version of English. The difficulty with that is trying to decide how much of the original language to keep sometimes.

    Thank you for your twitter follow. 🙂

  7. William,

    Sounds interesting. I must confess that I’ve read a few of the patents that have been filed and they make little or no sense to me. The smaller the words used the easier I find it 😉

    oh… thanks for the follow on twitter.

  8. Dare I say we need an algorithm to figure out their algorithm? 😐

    But then again I guess it’s really a case of weeding out the few points that make sense to us, working on that and then hoping that what we pulled out was the good stuff. Which takes me to another point, I wonder how often they file patents that they have no intention of ever making use of just to throw a red herring in once in a while?

  9. Hi Robert,

    The idea of a red herring has crossed my mind before. 🙂

    There may be a number of reasons why one of the search engines may file for a patent, though I suspect that the majority of them are applied for to protect intellectual property created during the development of processes that a search engine is exploring, and possible outshoots of those explorations.

    Not every patent filed for by Google or Yahoo or Microsoft or Ask will be used. Sometimes though, it’s pretty exciting to see one of the search engines release a new program or application or start a new process, and recognize that we’ve seen it in a patent filing first.

    Looking at these patent applications gives us a chance to think about some of the assumptions that search engineers make, and some of the perspectives that come from search engines on the issues they face involving providing search to the public.

Comments are closed.