Context Sensitive Stemming for Web Search

Sharing is caring!

It is questionable how much most commercial search engines use stemming as part of the process involved in returning search results, because it could have the effect of reducing the relevance of search results, and because it can be a computationally expensive process.

Researchers at Yahoo! take a second look at stemming, and how it can be adapted to Web search in Context Sensitive Stemming for Web Search (if that link doesn’t work, try this).

The paper explores using statistical language modeling to perform a context sensitive analysis, and predict which variants of words in a query will be useful when expanding a search for the query term. This can result in a lot less bad expansions, involving less computation and improving the precision of results.

Context sensitive document matching is also performed for the expanded variants.

The focus of this approach is the use of stemming as part of a query, rather than during indexing of a page. Doing that allows for an analysis in the context of the full query term, to try to find relevant results for the searcher.

The paper uses pluralization handling to show off their stemming approach, and tell us that “…as far as we know, no previous research has systematically investigated the usage of pluralization in Web search.” The method that they describe isn’t limited to pluralization, however.

The processes of performing a context sensitive analysis of a query term, and of document matching is described in detail in the paper, and is definitely worth reading through if you are interested in how a search engine might return relevant results to a searcher.

The idea of stemming based upon the words used in a query, rather than at the time of indexing, appears to be a good one.

Sharing is caring!

9 thoughts on “Context Sensitive Stemming for Web Search”

  1. That’s interesting – so, for example, a search for “blued steel” could pull up relevant stemming such as “blueing” which wouldn’t (or rather, shouldn’t) show up for “chicago blues.”

    And, hopefully, would eliminate “blue steel”, which is a) the title of a movie and b) a drug which I shall choose not to describe on the basis that the description might cause this comment to be blocked as spam!

  2. Thanks Bill. Another sleepless night mulling over the processes of performing a context sensitive analysis of a query term and document matching.

    Why can’t you post a “How to get backlinks” post every once in a while 😉

  3. Hi Joe,

    From the Yahoo search results that I see for blued steel, they just might. At least for the search results. The paid results point right to the drug. Then again, it might be a coincidence that the seond result is titled blueing steel.

    The “how to get backlinks” post was in the queue, Dave.

    Ok, it wasn’t. But it is interesting that stemming might be done on the query side. Thanks.

  4. I’ve took a look on “Context Sensitive Stemming for Web Search” pdf and there’s a lot of valuable information there. Thank you.

  5. The document seems very in-depth but it doesn’t mention the fact that Context is also sensitive in terms of regional differences. When Indian tele sales people were originally trained for the UK no one took into consideration regional accents and hence when they talked to someone from Scotland they hadn’t a clue what they were saying. They speak English just the same but their accent is different as well small differences in regional dialects.

    In terms of effectiveness for searches its a big advancement but I don’t think there will be truly accurate results achieved until neural networks are trained to look for subtle change based on region, dialect, culture etc and be able to learn from their findings.

    Khaled

  6. Hi Khaled,

    You raise some good points, and I think that differences in culture, in dialect, and in region are things that a search engine should consider when building language models. I don’t think that negates the research or conclusions drawn from it that those differences weren’t taken into account. Your points do suggest some interesting avenues that should probably be explored further.

  7. Yes agree with @bill, “You raise some good points, and I think that differences in culture, in dialect, and in region are things that a search engine should consider when building language models. “

Comments are closed.