These two patents granted today to Google look like they hold some interesting approaches to using large amounts of data about pages and queries and user interactions to rank pages in search results.
Ranking documents based on large data sets
Invented by Jeremy Bem, Georges R. Harik, Joshua L. Levenberg, Noam Shazeer, and Simon Tong
Assigned to Google
US Patent 7,231,399
Granted June 12, 2007
Filed: November 14, 2003
Abstract
A system ranks documents based, at least in part, on a ranking model. The ranking model may be generated to predict the likelihood that a document will be selected. The system may receive a search query and identify documents relating to the search query. The system may then rank the documents based, at least in part, on the ranking model and form search results for the search query from the ranked documents.
Method and apparatus for learning a probabilistic generative model for text
Invented by Georges Harik and Noam Shazeer
Assigned to Google
US Patent 7,231,393
Granted June 12, 2007
Filed: February 26, 2004
Abstract
One embodiment of the present invention provides a system that learns a generative model for textual documents. During operation, the system receives a current model, which contains terminal nodes representing random variables for words and cluster nodes representing clusters of conceptually related words. Within the current model, nodes are coupled together by weighted links, so that if a cluster node in the probabilistic model fires, a weighted link from the cluster node to another node causes the other node to fire with a probability proportionate to the link weight. The system also receives a set of training documents, wherein each training document contains a set of words. Next, the system applies the set of training documents to the current model to produce a new model.
I’m working my way through them in more depth, but I’m in agreement with you there.
Some fun stuff.
Generative model, training documents, producing new models …
Very interesting. I think this patent confirms their use of machine learning to improve their ranking algorithms.
This one looks like it’s designed to improve Personalized Search to favor user preferences.
Nice one, Shor.
There is some pretty good stuff in these. For instance, there’s a section of the second patent that uses an alternative version of this process to look at clustering with queries.
The example uses the word “jaguar” as a query term, which could mean the animal or the car. the model being used could identify clusters relating to both of those meanings when someone conducts that search.
Knowing that there is a “car” cluster and an “animal” cluster, pages of both results could be returned to the searcher in a ratio that matches the probability of the queries – allowing for a diversity in the search result set.
Understanding the possible clusters that results might be found within may also influence which advertisements are shown or targeted, could allow similar results to be grouped together, or could be used when applying an adult filter for some searches, for some clusters of results by not others.
The patent tells us that using clusters related to the different senses of the query terms might allow for expansion of results from a “parent” cluster, or perhaps help when there has been a misspelling.
There’s a lot to these patents.
Michael,
I think that the potential to match users who exhibit similar behavior around the same or similar pages, to identify shared interests is one of the possibilities described.
The first patent tells us that they might use multiple elements in the data collected, which they refer to as instances in the patent filing:
The kinds of data that might be associated with each of those triples of data cover a lot of ground:
So, how might all of that information be used?
We’ve seen the use of triples of information in another patent filing from Google which I described here:
Refining Queries Using Category Synonyms for Local and Other Searches
Those triples involve query term/result business name/result business category instead of user information/query data/document data to find business categories for businesses being searched for.
More stuff to read while pretending to work 😉