Conventional search engines focus upon the words that they find on a web page rather than the meanings of those words. So, when you search for something like [cooking classes Palo Alto], a search engine might look for all of the pages that it can find that include all of those words. If it doesn’t find many, it might do something called “backing off,” and also show some results that don’t include all of the words.
But, chances are that the search engine might not show results for a slightly different version of that search, such as [cooking class palo alto], where “classes” is replaced with “class.” While “class” and “classes” are related with class being a subpart, or stem, of the word classes, sometimes variations of words have very different meanings when used in different contexts.
Google was granted a patent this week that focuses upon more effectively capturing the underlying semantic meaning behind words within text. It builds upon a patent that Google was granted in 2008 which “characterizes a document with respect to clusters of conceptually related words.”
The patent granted this week is:
Selectively deleting clusters of conceptually related words from a generative model for text
Invented by Uri Lerner, Michael Jahr, and Vishal Kasera
Assigned to Google
US Patent 7,877,371
Granted January 25, 2011
Filed February 7, 2007
One embodiment of the present invention provides a system that selectively deletes clusters of conceptually-related words from a probabilistic generative model for textual documents.
During operation, the system receives a current model, which contains terminal nodes representing random variables for words and contains one or more cluster nodes representing clusters of conceptually related words. Nodes in the current model are coupled together by weighted links, so that if an incoming link from a node that has fired causes a cluster node to fire with a probability proportionate to a weight of the incoming link, an outgoing link from the cluster node to another node causes the other node to fire with a probability proportionate to the weight of the outgoing link.
Next, the system processes a given cluster node in the current model for possible deletion. This involves determining a number of outgoing links from the given cluster node to terminal nodes or cluster nodes in the current model. If the determined number of outgoing links is less than a minimum value, or if the frequency with which the given cluster node fires is less than a minimum frequency, the system deletes the given cluster node from the current model.
The Google patent granted in 2008 is:
Method and apparatus for characterizing documents based on clusters of related words
Invented by Georges Harik and Noam M. Shazeer
Assigned to Google
US Patent 7,383,258
Granted June 3, 2008
Filed September 30, 2003
One embodiment of the present invention provides a system characterizes a document with respect to clusters of conceptually related words. Upon receiving a document containing a set of words, the system selects “candidate clusters” of conceptually related words that are related to the set of words.
These candidate clusters are selected using a model that explains how sets of words are generated from clusters of conceptually related words. Next, the system constructs a set of components to characterize the document, wherein the set of components includes components for candidate clusters. Each component in the set of components indicates a degree to which a corresponding candidate cluster is related to the set of words.
Both build upon ways of understanding the meanings behind small groups of text, like you might find in searcher’s query suggestions during a query session, and trying to decide how those blocks of text might be related to each other.
The benefits of an approach like this can include:
- Help in guessing at concepts behind a piece of text. These concepts might be shown to a searcher during a search to help them better understand the meaning behind the text.
- Enabling the search engine to compare words and concepts found in a document and in a query. This can help the search engine come up with an information retrieval scoring function to help rank web pages in search results based upon those concepts.
- Expanding search results to include related words and concepts in a search by looking at clusters of potential results related to different concepts that might include a specific word. For instance, a search for the word [jaguar] could mean the car, the animal, or the NFL football team. Clusters created around each of these “concepts” associated with the term could lead the search engine to show a certain percentage of results covering the different concepts, and making sure of diversity in those results.
- Comparing the relationship between words and concepts on a web page and in an advertisement. This can stand in as a proxy for how well an advertisement might perform when displayed on a certain web page. For example, an advertisement for a jaguar car on a page about jaguar cats may not be very effective.
- Comparing the relationship between words and concepts in a query and in an advertisement. This can provide an idea of how well an advertisement might do on a search result page for a specific query.
- Comparing the relationship between words and concepts from different web pages. This can tell the search engine how far apart conceptually two pages might be when they are clustered together as “similar” documents.
- Classification of pages, and filtering of some kinds of pages, based upon which clusters words (that might be used withing queries) tend to appear within.
- Generalizing a search query to retrieve more results (similar to the kind of backing off that I mentioned above), by looking at the clusters that the query terms appear within and parent concepts for those clusters.
- Identifying whether a word is a misspelling of another word by looking at the concepts related to each of those two words. For example, “flicker” is conceptually related to lights and flames, and “Flickr” is conceptually related to photographs. There’s a good probability that “flickr” is not a misspelling of “flicker.”