Why a Search Engine Might Cluster Concepts to Improve Search Results

Conventional search engines focus upon the words that they find on a web page rather than the meanings of those words. So, when you search for something like [cooking classes Palo Alto], a search engine might look for all of the pages that it can find that include all of those words. If it doesn’t find many, it might do something called “backing off,” and also show some results that don’t include all of the words.

But, chances are that the search engine might not show results for a slightly different version of that search, such as [cooking class palo alto], where “classes” is replaced with “class.” While “class” and “classes” are related with class being a subpart, or stem, of the word classes, sometimes variations of words have very different meanings when used in different contexts.

Google was granted a patent this week that focuses upon more effectively capturing the underlying semantic meaning behind words within text. It builds upon a patent that Google was granted in 2008 which “characterizes a document with respect to clusters of conceptually related words.”

The patent granted this week is:

Selectively deleting clusters of conceptually related words from a generative model for text
Invented by Uri Lerner, Michael Jahr, and Vishal Kasera
Assigned to Google
US Patent 7,877,371
Granted January 25, 2011
Filed February 7, 2007

Abstract

One embodiment of the present invention provides a system that selectively deletes clusters of conceptually-related words from a probabilistic generative model for textual documents.

During operation, the system receives a current model, which contains terminal nodes representing random variables for words and contains one or more cluster nodes representing clusters of conceptually related words. Nodes in the current model are coupled together by weighted links, so that if an incoming link from a node that has fired causes a cluster node to fire with a probability proportionate to a weight of the incoming link, an outgoing link from the cluster node to another node causes the other node to fire with a probability proportionate to the weight of the outgoing link.

Next, the system processes a given cluster node in the current model for possible deletion. This involves determining a number of outgoing links from the given cluster node to terminal nodes or cluster nodes in the current model. If the determined number of outgoing links is less than a minimum value, or if the frequency with which the given cluster node fires is less than a minimum frequency, the system deletes the given cluster node from the current model.

The Google patent granted in 2008 is:

Method and apparatus for characterizing documents based on clusters of related words
Invented by Georges Harik and Noam M. Shazeer
Assigned to Google
US Patent 7,383,258
Granted June 3, 2008
Filed September 30, 2003

Abstract

One embodiment of the present invention provides a system characterizes a document with respect to clusters of conceptually related words. Upon receiving a document containing a set of words, the system selects “candidate clusters” of conceptually related words that are related to the set of words.

These candidate clusters are selected using a model that explains how sets of words are generated from clusters of conceptually related words. Next, the system constructs a set of components to characterize the document, wherein the set of components includes components for candidate clusters. Each component in the set of components indicates a degree to which a corresponding candidate cluster is related to the set of words.

Both build upon ways of understanding the meanings behind small groups of text, like you might find in searcher’s query suggestions during a query session, and trying to decide how those blocks of text might be related to each other.

The benefits of an approach like this can include:

  • Help in guessing at concepts behind a piece of text. These concepts might be shown to a searcher during a search to help them better understand the meaning behind the text.
  • Enabling the search engine to compare words and concepts found in a document and in a query. This can help the search engine come up with an information retrieval scoring function to help rank web pages in search results based upon those concepts.
  • Expanding search results to include related words and concepts in a search by looking at clusters of potential results related to different concepts that might include a specific word. For instance, a search for the word [jaguar] could mean the car, the animal, or the NFL football team. Clusters created around each of these “concepts” associated with the term could lead the search engine to show a certain percentage of results covering the different concepts, and making sure of diversity in those results.
  • Comparing the relationship between words and concepts on a web page and in an advertisement. This can stand in as a proxy for how well an advertisement might perform when displayed on a certain web page. For example, an advertisement for a jaguar car on a page about jaguar cats may not be very effective.
  • Comparing the relationship between words and concepts in a query and in an advertisement. This can provide an idea of how well an advertisement might do on a search result page for a specific query.
  • Comparing the relationship between words and concepts from different web pages. This can tell the search engine how far apart conceptually two pages might be when they are clustered together as “similar” documents.
  • Classification of pages, and filtering of some kinds of pages, based upon which clusters words (that might be used withing queries) tend to appear within.
  • Generalizing a search query to retrieve more results (similar to the kind of backing off that I mentioned above), by looking at the clusters that the query terms appear within and parent concepts for those clusters.
  • Identifying whether a word is a misspelling of another word by looking at the concepts related to each of those two words. For example, “flicker” is conceptually related to lights and flames, and “Flickr” is conceptually related to photographs. There’s a good probability that “flickr” is not a misspelling of “flicker.”
Share

28 thoughts on “Why a Search Engine Might Cluster Concepts to Improve Search Results”

  1. I wonder if this will be automatically applied to all languages or to English only.

  2. Hey Bill,

    I’m all for it. I think this type of approach would be a good thing to implement. From reading this I think the best benefit would be identifying whether a word is not spelled right by looking at the concepts associated to each of those two words. Because I often experience this confusion when I’m searching for something where the word I’m searching for is not misspelled at all.

  3. I wonder if this has something to-do with Google and its local search efforts, recently we have seen in the UK that our key phrase “Hot Tubs” was being treated as a search term that would be classed as local.

    I could understand “Pool” or “Swimming Pool” being a local search term. However in the UK we do not visit “Hot Tubs”, we do visit attractions such as the Roman baths in Bristol, but still this is a term that shouldn’t be locally related.

    After sometime our search phrase is now being treated as a “Semi Local” result, with the Google maps being included but not only showing local place results. Where before it was prioritising all local results before ours, which was very frustrating because we are based in the north of the UK and would miss out many national orders.

    In my view Google could be considering local search as one point when registering this new patient, so they could then fully automate which terms could be classed as local and which shouldn’t.

  4. I like how Google search is becoming more intelligent every year with techniques like this. They become better and better with suggesting more productive variants and interpreting the meaning of your search query.

    Combined with other amazing recent developments in AI, like Wolfram Alpha and IBM Watson, I believe in not too distant future some SE will be able to give meaningful answers to natural language queries.

  5. I’m really excited for Google to fully integrate this into their search results. The whole idea of semantic clustering is just completely drool inducing to me. The ability for the search engine to understand meaning and context! It’s just one step closer to HAL 9000.

  6. Hmm, from the description it sounds like typical NLP (natural language processing) techniques. Maybe the trick they describe is how to make NLP scalable.

  7. It’s a good news as a google user, now we’re able to find more accurate result as google has got the brain too. ;)

  8. Interesting. I wonder what the implications of “capturing the underlying semantic meaning behind words within text” will be for internet marketers who spin and submit tons of articles to directories that are spun with synonyms? It seems to me that this could also be directed towards the movement of tons of semi-duplicate content to the “supplemental” archives. Who knows.

  9. In order for Google to provide users with better, faster and more relevant search results, it must come up with new features like this. They have been making a lot of changes recently and all for the better, IMO.

    @Mark I think you could be right here. Google has publicly announced that it is going to take a harsher stances against duplicated content from things such as Autoblogs. I’d like to see them go down in flames.

  10. i for one agree with Joseph. Google still has a lot to work on when it comes to pin pointing people searches..they have it down to a science it’s just a matter of playing the waiting game.. as technology improves they improve. I Also agree with auto blogs being taken down..

  11. Hi John,

    One of the other good things about this clustering by concepts is that it looks like it would work well with some of the other things that Google appears to be doing this days, from phrase-based indexing, to associating specific terms and phrases with particular pages to providing diversity in search results. The spell correction approach is one that can help when a word appears to be misspelled but it really isn’t.

  12. Hi Ross,

    Interesting. It sounds like Google was classifying most searches for “hot tubs” as presenting an intent by searchers to buy a hot tub, and visit a place that sells them. That would definitely be a local search if true. But, there’s also a chance that people searching for [hot tub] were also attempting to learn more about hot tubs. It does sound like Google recognized that there was more than one intent behind that search, and is now providing more diverse results in response.

  13. Hi Val,

    If you asked anyone working at a search engine whether or not search is a solved problem, I would expect them to tell you that it’s still something in its infancy. They’ve come a long way, but there’s still a long way to go.

  14. Hi Iouri,

    As with most problems, no matter how simple or difficult, scale can make everything much harder.

    Did you read the patent description?

  15. Hi Shailender,

    This is one of my favorite lines from the patent:

    In general, there is a lot of information in such a small piece of text, using which we can draw conclusions, but there is also a lot of uncorrelated junk. The main task our system has is to cull out the proper correlations from the junk, while looking at a large number (billions) of such pieces of text.

  16. Hi Mark,

    It’s possible that when pages associated with specific concepts might be clustered together may be compared so that higher quality results tend to be the one’s returned in searches, though I think there’s probably more going on to solve the kind of problem you’ve raised than this concept clustering approach.

  17. Hi Joesph,

    It’s hard to tell if and when Google may have started doing this. The older patent was first filed in 2003, so the ideas have been around a while. The benefits of using it when doing things like choosing which ads to show on which pages and in search results, or deciding whether or not a word in a query is a misspelling might lead someone to the conclusion that Google is, and has been for a while.

  18. As explained by you I think this concept of clusters is very complicated! However, I understand that LSI becomes the crucial part of purifying results from this model. Am I correct?

  19. Hi Anil,

    There’s no reference to LSI in the patent at all, and I think it would be a mistake to assume that LSI would play any role in the way that this works. This process is based upon a probabilistic generative model rather than latent semantic indexing.

  20. Hi Tony,

    It’s quite possible that a system like this would handle slang better than one that relied on a dictionary-based system that would have to be updated manually. For a word or phrase to be considered slang, it has to have been adopted by enough people so that the slang meaning of the word can be recognized by at least some amount of users of the term or phrase.

  21. Is that generally how long it takes the patent office to hand out a patent?

    Granted June 3, 2008
    Filed September 30, 2003
    Granted January 25, 2011
    Filed February 7, 2007

    When I think of the word patent I think of Alexander Graham Bell and the guy he beat to the patent office running there in order to file 15 minutes before the other guy…. 4-5 years. Wow.

  22. Hi Isaiah,

    It isn’t unusual for the patent office to take a while to grant a patent. There needs to be time for people to challenge the patent, and time for a patent examiner to search for prior art as well. Some or all of the claims in a patent can be rejected, and there’s a chance for the inventors of the patent to amend claims as well. It does happen that the patent that is originally filed may be different in a number of ways when it is granted. And then, some patents end up rejected as well.

    If you find the AAlexander Graham Bell patent disputes interesting, you might want to look into the patent battles between Tesla and Marconi.

Comments are closed.