Before becoming a co-founder of the new search engine Cuil, Anna Lynn Patterson worked at Google upon a way of looking at how often different phrases appeared together on pages on the Web, described in a series of patent applications which share a common description, with different claims sections that itemize different parts of that description.
I summarized the description from one of the patent filings in my post from December 29, 2006, in Phrase Based Information Retrieval and Spam Detection
One of the patent applications from that series, Automatic taxonomy generation in search results using phrases, which I hadn’t originally come across back in 2006, was granted today, and covers the idea of taking documents that share related phrases, and clustering them with the related phrases to provide search results that might cover a range of categories related to search queries.
What do I mean by clustering?
I watched a great example of clustering this weekend during a television show on Global Warming.
The show used a golfing analogy to describe the differences between attempting to predict weather patterns for a few days in the future, and for a much longer period of time.
Using different theories to try to forecast the weather for three for four days in the future is a little like trying to putt a golfball into the hole from about ten feet away. Imagine each variation of a theory to predict the weather as a swing, and a golfball approaching the hole as a prediction.
Because the distance isn’t too far, most of the golfballs hit will come very close to the hole, clustering around it. If you have 50 different theories, you may end up with 50 golfballs closely together around the hole.
Predicting the weather a few years in the future based upon different theories may be more like trying to trying to hit a golfball into a hole from 250 feet away. Hit a few thousand golfballs towards the hole, and they will be spread further apart. You may see a pattern emerge, with some balls clustered together, and some clusters closer and futher to one another.
Similar golf swings (or similar theories) may result in some golf balls being clustered closer together. When you have clusters of golfballs farther apart, they may be the result of golfball swings (or theories) that are very different. The clusters could be said to cover different categories of theories related to predicting the weather.
Clusters and Phrases
One of the main ideas behind the phrase based indexing system is to explore documents found on the Web, and see how often the same phrases co-occur within the same documents. This indexing system may go out on the Web and identify how frequently certain phrases tend to co-occur in the same documents, and mark them as being related.
So, pages that include the phrase “baseball stadium” may include other phrases such as “ball game,” “bleachers,” “home plate,” and other related phrases.
Once that is done, clustering may be used to decide which pages to show in search results in what order based upon the phrases included in a search query, and phrases that are related to those query phrases.
The patent provides an example of how clustering might be used to decide which pages to show, and the order in which those are presented.
Someone searches for the query “blue merle agility training,” which is made up of the phrases “blue merle” and “agility training.” The search engine returns 100 results. Clusters may be created based upon related phrases that may have been identified previouslyh by the phrase-based indexing system.
Related phrases for “blue merle” might be “Australian Shepherd,” “red merle,” “tricolor,” and “aussie”
Related phrases for “agility training” might be “weave poles,” “teeter,” “tunnel,” “obstacle,” and “border collie”.
In the example, the patent tells us that the system will look at a count of the number of documents containing each related phrase. If the phrase “weave poles” shows up in 75 of the 100 documents, and “teeter” appears in 60 documents, and “red merle” appears in 50 documents, then we have three clusters (or the first three clusters).
The first cluster would be named “weave poles” and a certain number of documents from that cluster are presented in the search results. The second cluster would be named “teeter,” with a selected number of results presented from that cluster. The third would be named “red merle” and a number of documents would be included in search results from that cluster.
By using clusters, we have a number of documents taken from the different clusters based upon how frequently related phrases show up in those documents, and the most popular can be presented first, and either in proportion to how large or small those clusters might be, or the same number of documents can be presented to a searcher from each cluster.
The idea behind this approach is to provide searchers with results that contain the query terms that they searched for in a taxonomy – a classification of results based on the different clusters created from the related phrases.
The phrase based indexing system can perform other functions, such as helping to find duplicate content on the Web, and filtering out spam in search results, but those aspects of the indexing system are still under review by the patent office.