New Google Ranking Patents

These two patents granted today to Google look like they hold some interesting approaches to using large amounts of data about pages and queries and user interactions to rank pages in search results.

Ranking documents based on large data sets
Invented by Jeremy Bem, Georges R. Harik, Joshua L. Levenberg, Noam Shazeer, and Simon Tong
Assigned to Google
US Patent 7,231,399
Granted June 12, 2007
Filed: November 14, 2003

Abstract

A system ranks documents based, at least in part, on a ranking model. The ranking model may be generated to predict the likelihood that a document will be selected. The system may receive a search query and identify documents relating to the search query. The system may then rank the documents based, at least in part, on the ranking model and form search results for the search query from the ranked documents.

Method and apparatus for learning a probabilistic generative model for text
Invented by Georges Harik and Noam Shazeer
Assigned to Google
US Patent 7,231,393
Granted June 12, 2007
Filed: February 26, 2004

Abstract

One embodiment of the present invention provides a system that learns a generative model for textual documents. During operation, the system receives a current model, which contains terminal nodes representing random variables for words and cluster nodes representing clusters of conceptually related words. Within the current model, nodes are coupled together by weighted links, so that if a cluster node in the probabilistic model fires, a weighted link from the cluster node to another node causes the other node to fire with a probability proportionate to the link weight. The system also receives a set of training documents, wherein each training document contains a set of words. Next, the system applies the set of training documents to the current model to produce a new model.

Share

6 thoughts on “New Google Ranking Patents”

  1. The system also receives a set of training documents, wherein each training document contains a set of words. Next, the system applies the set of training documents to the current model to produce a new model.

    Generative model, training documents, producing new models …

    Very interesting. I think this patent confirms their use of machine learning to improve their ranking algorithms.

  2. Nice one, Shor.

    There is some pretty good stuff in these. For instance, there’s a section of the second patent that uses an alternative version of this process to look at clustering with queries.

    The example uses the word “jaguar” as a query term, which could mean the animal or the car. the model being used could identify clusters relating to both of those meanings when someone conducts that search.

    Knowing that there is a “car” cluster and an “animal” cluster, pages of both results could be returned to the searcher in a ratio that matches the probability of the queries – allowing for a diversity in the search result set.

    Understanding the possible clusters that results might be found within may also influence which advertisements are shown or targeted, could allow similar results to be grouped together, or could be used when applying an adult filter for some searches, for some clusters of results by not others.

    The patent tells us that using clusters related to the different senses of the query terms might allow for expansion of results from a “parent” cluster, or perhaps help when there has been a misspelling.

    There’s a lot to these patents.

    Michael,

    I think that the potential to match users who exhibit similar behavior around the same or similar pages, to identify shared interests is one of the possibilities described.

    The first patent tells us that they might use multiple elements in the data collected, which they refer to as instances in the patent filing:

    Each instance may include a triple of data: (u, q, d), where u refers to user information, q refers to query data provided by the user, and d refers to document information relating to documents retrieved as a result of the query data and which documents the user selected and did not select.

    The kinds of data that might be associated with each of those triples of data cover a lot of ground:

    These features might include one or more of the following:

    the country in which user u is located,
    the time of day that user u provided query q,
    the language of the country in which user u is located,
    each of the previous three queries that user u provided,
    the language of query q,
    the exact string of query q, the word(s) in query q,
    the number of words in query q,
    each of the words in document d,
    each of the words in the Uniform Resource Locator (URL) of document d,
    the top level domain in the URL of document d,
    each of the prefixes of the URL of document d,
    each of the words in the title of document d,
    each of the words in the links pointing to document d,
    each of the words in the title of the documents shown above and below document d for query q,
    the number of times a word in query q matches a word in document d,
    the number of times user u has previously accessed document d,
    and other information.

    In one implementation, repository 220 may store more than 5 million distinct features.

    So, how might all of that information be used?

    A ranking model may be created from this data. The model uses the data in repository 220 as a way of evaluating how good the model is. The model may include rules that maximize the log likelihood of the data in repository 220. The general idea of the model is that given a new (u, q, d), the model may predict whether user u will select a particular document d for query q. As will be described in more detail below, this information may be used to rank document d for query q and user u.

    We’ve seen the use of triples of information in another patent filing from Google which I described here:

    Refining Queries Using Category Synonyms for Local and Other Searches

    Those triples involve query term/result business name/result business category instead of user information/query data/document data to find business categories for businesses being searched for.

Comments are closed.