How a Search Engine Might Rerank Search Results Based upon Topics

If you search for the word “cold” and you’re using the search box for a health related site, chances are you want to find out something about the illness. If you search for “cold” at Google or Yahoo or Bing, there’s a chance that you might be interested in weather or airconditioning or a cold war or stuffy nose.

Different sites and pages might focus upon specific topics of interest, such as health or sports, or weather, or constuction. A way a search engine might use to try to get around some of the limitations of words with multiple meanings is to assign domain or topical scores to web pages and other items found on the Web, regardless of which queries they might be good results for. Then if a query seems to cover a specific domain or topic, to return pages that involve that topic, based upon a “domain score” for those pages.

Why Look at Domains (Categories of Interest) in Ranking Pages?

The patent’s description begins by describing conventional methods of ranking pages in search results. When a search engine attempts to match a query with a document, there are a number of steps that it may go through first.

One of those may be to “lemmatize,” or group together different forms or inflections of a word found in a query, and identify the lemma, or base version of that word. For example, the words swims, swimming, swum, and swam might be considered infected forms of the word “swim.” The word swim might be considered the lemma of that group of words. This is different from stemming, in that stemming might reduce swims, swimming, and swimmer done to the root “swim,” but a stemming operator wouldn’t include “swam” or “swum” the way that lemmatizing those words would.

The lemma, or normal form of the word in the query, would them be used to see which documents contain that term.

A term-based ranking system like this might also assign scores to web pages based upon statistics about how frequently query terms appear in documents that contain those terms. A large number of occurences of the term in a document might increase the score for the document. If a rare term (one that doesn’t show up in too many documents on the web) also co-occurs in both the query and the document, that may increase the score for the page. If there are terms in the query that aren’t on a page, the score for that page may be decreased. If the terms appear in certain parts of a page, such as the page title, the score for the page may be increased.

In addition to looking at whether or not a query term appears upon a page for purposes of ranking that page, popularity based ranking signals are often also used in ranking web pages for query terms. These signals can include information about links to pages, looking at which search results people select when seeing a page in search results for a particular query, and others.

But one of the main limitations to this kind of approach is when a word in a query might have more than one meaning, and the meaning isn’t very clear from the context of the query.

If each page had a domain-based score, and queries could be also given domain scores, it might help in finding results that might match queries well.

There might be a number of ways to calculate a domain score for a page. One might be to give a page a “medical” domain score based upon how many medical terms appeared on the page. Another might be to take a query into account and see how well the set of medical terms in a query match up with medical concepts appearing on a page.

The order of pages in search results might them be based upon both the traditional way of ranking pages, as well as a topical or domain score for those pages.

Domain-Based Ranking in a Document Search
Invented by Alain Thierry Rappaport and Daniel Adamson
Assigned to Microsoft
US Patent Application 20100228743
Published September 9, 2010
Filed: March 3, 2009

Abstract

In one example, documents that are examined by a search process may be scored in a manner that is specific to a domain. A domain may be a substantive area, such as medicine, sports, etc. Different scoring approaches that take aspects of the domain into account may be applied to the documents, thereby producing different scores than might have been produced by a simple comparison of the terms in the query with the terms in the documents.

These domain-based approaches may take a query into account in scoring the documents, or may be query-independent. Each approach may be implemented by a scorer. The combined output of the scorers may be used to generate a score for each document. Documents then may be ranked based on the scores, and search results may be provided.

A page might cover a number of topics, or domains, and a number of levels. For instance, a page about baseball may contain a lot of terms and concepts related to baseball. Another page might cover sports more generally, including baseball, football, hockey, gymnastics, curling, and more.

The more general page about sports might be a much more popular page in terms of links pointing to it, and visits, and it might rank higher in search results for a query about baseball under a conventional term-based ranking system because of that popularity. But, if the domain scores for those two pages are taken into account, the less popular page that focuses more upon baseball may rank higher because it has a better domain score for baseball.

Conclusion

The patent describes a number of approaches that could be used to understand the concepts presented on a page, and to rank a page based upon different domains or areas of interest that it might cover. These can include things like possibly boosting the value of terms found in a web page’s title if it seems to point out a topic for the page. But it also covers other areas.

Regardless of the approaches the search engine might use, an interesting exercise if you’re a web page publisher, is to read through a number of pages on your site and see how well each of them might indicate categories that the page should be associated with. If you had to assign a number of categories to the pages of your site, and rank how well they might fit into those categories, it might be something that you could do.

Now imagine a search engine trying to do the same thing. It would attempt to base its categorization of your pages based upon what it know about other pages on the Web that might cover similar categories.

How close would your categories be to the categories that the search engine came up with?

A search engine might use the categories (or domains) it chooses for a page, to rank how well that page might match the concepts or topics it identifies for a query.

Share

29 thoughts on “How a Search Engine Might Rerank Search Results Based upon Topics”

  1. I’m a little concerned that, despite your repeated clarifications here, some people are going to read this post and come away with the idea that domain names are of particular importance. In a sense, some might misunderstand the domain of the page because of the use of the word “domain”.

    Apart from that, I’m curious about how specific a search engine would get in determining a page’s domain. In the example you use, a page about baseball would get more of a domain boost for baseball than a more general page about sports, and that implies that “sports” is too general a term to be treated as a domain itself, or at least that there would be levels of specificity in making such judgments about a page’s subject-matter.

  2. They say to place the keyword every 100 words in the post. I think this is sensible especially if search engines do it the way you described it here. It is really best to have the right categories for your posts, helps the search engines understand your pages better.

  3. Interesting stuff. Wish I was far enough along in my SEO knowledge to use it. I will keep plowing ahead.

  4. What do you think of the recent stuff on LDA from seomoz? Seemingly this solves the puzzle in the sense that it moves away from keywords being the ultimate definition.

  5. I agree with Bob that people may just scan the headline and the first paragraphs and conclude it’s all about domain names. The idea of using domain relevance to weight a page has been suggested before but I think it’s interesting to see that Microsoft is the source of this patent.

    Tim, SEOmoz’s LDA nonsense has no relevance to search. This patent deals with how a domain might be assigned one or more topic weights that could be used to adjust which pages are selected from the domain for inclusion in query-relevant results.

    LDA ignores word-order and proximity. While it might assist with domain topic weighting, SEOmoz is arguing (unconvincingly at the scientific level) that GOOGLE may be using LDA to order its search results.

    They have no idea of what they are doing at SEOmoz these days. That’s a very sad development.

  6. Interesting patent. I wonder how all the other factors add up and how they all weight together for a page to be then shown for a query. For instance #links.

  7. Hi Bob,

    I was concerned about readers misinterpreting the title of this post for that very reason, but I wanted to try to explain the concept using at least some of the words that the inventors of the idea did. I’ve changed the title of the post, and some of the wording of the first couple of paragraphs.

    Sometimes search engineers do make things confusing for the rest of us.

    I really hate that nofollow was chosen as a value for the link attribute “rel” even though there’s a robots meta “nofollow” value. And I don’t like that Google decided to use the name “XML sitemap” to refer to XML index files for URLs, even though it would create confusion between those and HTML sitemaps. The authors of this patent, in using the word “Domain” to mean a specific category of knowledge or interest must have known that it would be likely for many people coming across it to think of the sense of the word “domain” as in a domain name.

    The authors of the patent do tell us that they can drill down to fairly fine levels of granularity when determining “domains” for pages, but they really don’t focus within the claims or the description of the patent on telling us where this hierarchical pool of concepts might be taken from. Here’s where they mention looking at more specific concepts:

    [0024]Many of the examples herein refer to searches performed in the medical domain, where the medical domain may, for example, contain information that is descriptive of a human body, or that is descriptive of treatments of the human body. However, a domain could represent any substantive field of knowledge, and searches could be performed in any domain. For example, there could be domains for sports, cars, food, education, or any other area. Domains could be at any level of granularity–e.g., “sports” could be a domain, but so could “baseball” or “hockey.”

    It’s an aspect of the method described in the patent that I’d like to know more about as well, but one that they could possibly address in many different ways, and not really essential to granting of the patent based upon the idea of having separate scores based upon domains associated with a page that could be used to either rank or rerank search results based upon how well a query might match up with that topic (or domain of knowledge).

  8. Hi Andrew,

    I don’t think that I have heard anyone saying to try to place a keyword every 100 words or so in a post (or I guess any page). That might be a somewhat helpful rule of thumb if you’re just trying to make sure that you use a keyword a number of times throughout something that you’ve written, but I don’t think I’ve ever seen a suggestion like that from any of the search engines.

    In my conclusion, I probably should have stressed more that if think that a page is appropriate for a specific category, you should go back through that page and make sure that your assessment of that category is matched by words that are appropriate for that category. If I believe that a page is about baseball, for instance, do I have words or phrases within the page that demonstrate that it is about baseball, such as “pitcher’s mound,” “left field,” “dugout,” “playing field,” “earned run average,” “runs batted in,” “uniform,” “fans,” “concession stands,” “Babe Ruth,” etc. If I don’t, and I want to the search engines to think that my post or article or page is about baseball, I may want to rewrite my page using more baseball related terms and phrases. It’s not your own tagging of a page with a category like “baseball” that is important here as much as it is making sure that the words you use makes it obvious that the page is about baseball.

  9. Hi Tim,

    There are so many different signals and filters and reranking methods involved in how a search engine may be ranked or reranked by a search engine that it’s difficult to look at certain signals or methods and say how much importance each one may have. We know a lot about some of the things that search engines may be looking at, but there are many things that we don’t know as well.

    I think back to the “perfect page” approach that a number of SEO tool makers came out with many years ago (and may still use), where they tell you things like the perfect keyword density for the phrase that you’re trying to optimize a page for, and where to place keywords, all based upon the usage of that phrase on the top ranked sites, and think about how that method ignores the influence of things like PageRank the influence of text in links pointing to a page, and other signals that may play a large role in helping a page rank for a specific phrase.

    I can’t see how you can effectively isolate one signal like LDA while failing to explore other potential influences upon ranking. Here are just a small handful that are potentially ignored by the approach:

    - PageRank,
    - Hypertext Relevance,
    - Location of terms and links upon pages, and how a visual segmentation of pages into parts can influence where a link or term might be considered much more important that others if it is within one block as opposed to another.
    - Stemming and lemmatzing of terms
    - The semantic ontology behind the Applied Semantics Circa Technology and the Google patents developed by that team for organic search,
    - Google’s Phrase-Based Indexing and the 9 or more (a number are pending and unpublished) patents from Google that describe how those might be used to rerank and filter search results based upon co-occurrence of phrases on top ranking pages, and use of related phrases in anchor text
    - The identification and association of specific named entities with attributes and values associated with those entities, and category labels that can be associated with different entities with the same names to determine confidence scores about which entity might be the one mentioned in a query.
    - The use of statistical machine translation tools to identify synonyms within context
    - The use of query logs (and query sessions) to identify synonyms within context, and misspellings
    - The use of query logs to identify different types of intent behind a query, such as geographic, transactional, commercial, informational, navigational, and a manipulation of results to show pages more closely matching those results.
    - The manipulation of results based upon a language preference, or a country preference.
    - The manipulation of results based upon the burstiness or consistency of a topic, as well as the freshness (possibly a positive influence or a negative influence, depending upon the context) or the maturity (possibly a positive influence or a negative influence, depending upon the context) of results within a set of results for that topic.
    - The prominence, proximity, and adjacency of terms on a page, and the semantic closeness of terms within a semantic construct such as an equal distance between a heading for a list, and all list items within that list.

    There are a number of different ways to define relevance, that can include things such as:

    Whether keywords in a query are also in a document,
    Whether a category that keywords could be placed within might match the category of a page,
    Whether a page is a good match for the intent behind a query
    Whether a page is a good match for the situational inquiry behind a search.

    A number of the things that I listed go beyond keyword matching to looking at meaning behind matching a query to a page.

  10. Hi Michael,

    Thanks for weighing in about the use of the word “Domain” in this page’s title and opening paragraphs. After reading your comment and Bob’s, I’ve made some slight changes. Hopefully there’s less potential to confuse people reading the post now.

    It is interesting to see Microsoft as the source of this patent. I didn’t dig around for possible related papers or patents involving this patent like I often do, but probably should to see if they’ve done more research that isn’t described in the patent – they often do. A very quick search reveals at least one possible related patent – I’ll have to look further.

    Not sure what to make about SEOmoz’s focus upon LDA, and I did search Google’s patents for any reference to LDA or Latent Dirichlet Allocation, and the only results that included “LDA” involved an image optimization process known as Linear Discriminant Analysis. Does Google use LDA? I don’t know. But they have other tools at hand that can help them with topic modeling that are discussed in patent and whitepapers. Google’s been using technology based upon Applied Semantics CIRCA technology since 2003, and the founders of that company published at least one Google patent describing a search based upon concepts rather than keywords back then.

  11. Hi Ernest

    It’s likely that at least one infrastructure change in Google’s past allowed them to switch in and out different ranking algorithms with the pressing of a button, enabling them to set up changes to test without rewriting everything everytime they wanted to test something. It’s also possible that different kinds of search results or different categories of queries may be treated in different ways, with signals that might change from one category to another.

    For example, results on a search for “Boston Hotels” might be based upon different signals than one for “George Washington’s education,” or “gulf oil spill,” Some searches may be best served by fresher results, some by more informational results, and others still by more commercial results. The way those results are ranked may vary from one type to another.

  12. I think it is an incredibly interesting debate – clearly search engine topic modelling is of importance (as you describe Bill). Thanks for the interesting post, and apologies for going (very slightly) off topic.

  13. I had never really thought of it like this- that different search engines bring up different results. I always just assumed, like most of the population, that you search for one thing and you find it, no matter what the engine you’re using. Thanks for such an informative article!

  14. Hi Tim,

    Thank you. I don’t think that we ventured all that far off topic, and topic modeling is definitely something being widely discussed these days.

  15. Hi Andy,

    It’s definitely something that can be hard to get your head around. When we search at one of the search engines, we aren’t searching the Web, but rather that search engine’s index of the web, which means that alll of the decisions that they make regarding crawling and indexing pages can influence what we see as well as how they might rank results.

    I think it’s actually a good thing that different search engines deliver different pages in response to a query.

  16. Hi Carl,

    A tutorial section is probably a good idea. It is something that I’ve considered in the past, as well as an SEO Glossary. And then there’s that book on SEO that I would like to write someday soon as well. Need to find more time in the day. :)

  17. Bill, Have you ever pondered a “tutorial” section for your site? Nothing extravagant, simple tips & tricks for SEO without giving away all your secrets of course. Possibly a tip, trick for each post (if applicable)? Always curious as to learning new techniques. Eager to hear back…

  18. Hmmm, so if I am clear, it is better to narrow the focus for the domain and keywords, rather then a general apporach?

  19. I think that looking at the meaning behind a search rather than just the keywords themselves is crucial. If a search engine can derive the intent of the searcher (ie. to find an answer to a particular question) based on variables other than keywords, we are one step closer to being provided with information we want before we even know what we want. That would be a fun time to live in.

  20. Hi Paul,

    If a query and a page both were considered to be in the same topic or category, it’s possible that the ranking of that page might be boosted in search results on a search for that topic.

    So, for instance, if the front page of Microsoft included the word “baseball,” it might rank well for the term based upon things like the number and quality of links pointed to the Microsoft home page. But the “domain of interest” or topic of Microsoft’s home page really has nothing to do with baseball. The home page of Major League Baseball does however, and so it’s “domain” might be considered to be “baseball” and it might be given a boost in rankings above the Microsoft page.

  21. Hi Barry,

    I agree with you on the value of search based upon meaning rather than just keywords. I think search is moving in that direction.

  22. Hi Tim,

    Thanks for the link to the SEOmoz update. I do appreciate that they are experimenting, and that they are being transparent about what they are doing, and open to acknowledgeing that they made a mistake. We all learn from it.

  23. I might be missing something,
    If the web page was about Chicago, and the web site was about theater; was Bing able to index it as the play?
    If the web page was about Chicago, and the web site was about DVD; was it able to index it as the movie?

  24. Hi SEO Toronto Canada,

    To use your question as an example, the search engine might look to the content of the page that the term appears upon, and attempt to identify a “domain” or topic based upon the other terms that might appear upon the page as well.

    If terms on a page are ones that are related to a theatrical presentation of “Chicago,” it might possibly identify it as related to a play. Those terms might include words or phrases such as “intermission,” “balcony,” “admission,” “cast members,” and “stage design.” It’s also possible that the search engine might identify terms that may help identify the page as being about a DVD, and those might include things like “special features,” “director’s commentary,” “running time,” and “two-disc set.”

Comments are closed.