Ranking Search Results Using Featured Based Rankings

Sharing is caring!

How does one optimize pages for MSN, given that they use a machine-based ranking system for ranking search results and returning results to visitors?

Some new research from Microsoft and some recently released patent applications might provide some ideas.

Before I dive into this, I want to point out Search Engines and Algorithms: Optimizing for MSN’s RankNet Technology by Jennifer Sullivan Cassidy, which takes a look at Microsoft’s Ranknet Technology. It’s a good introduction to some of the research that Microsoft has been doing lately.

Query independent ranking

Ranknet is discussed more in a paper to be presented in May at the WWW2006, titled Beyond PageRank: Machine Learning for Static Ranking. It provides a detailed look at how human ranked pages can be used to identify other high-quality pages, without relying upon the link structure of the web.

The document describes experiments conducted on an approach to ranking pages, rather than reporting how MSN might be using this ranking system. But, it can act to help us understand some of the approaches that MSN may be taking to ranking pages and sites.

The focus of the paper is on a query independent static ranking of pages, as opposed to a dynamic ranking. That doesn’t mean that a dynamic query-based ranking is no longer part of the equation on which pages get returned in response to a query, but rather than the quality of pages returned will play a larger part in returning results. The authors tell us that:

In this paper, we show there are several simple URL or page-based features that significantly outperform PageRank (to statically rank Web pages) despite ignoring the structure of the Web.

We combine these and other static features using machine learning to achieve a ranking system that is significantly better than PageRank (in pairwise agreement with human labels).

One benefit mentioned from this result is that it would be harder for people to manipulate search results because a static ranking search results are based upon a large number of factors on a page, rather than relying as much upon the number of links to a page, in a system that uses something like PageRank. And the learning nature of the system used for ranking search results can enable features that are being manipulated to carry less weight or be ignored completely.

The paper looks at:

  1. Pagerank (which it calls slow, and computatively expensive)
  2. Ranknet (A learning and ranking algorithm)
  3. Specific features used to rank pages
  4. Experiments
  5. Results
  6. Related and future work

The categories of specific features that they looked at for this paper:

  • Popularity,
  • Page-level
  • Anchor text and inlinks, and;
  • Domain-level

Feature based rankings

The paper refers to the method it uses for ranking search results as fRank or feature-based ranking. (As an aside, I’ll note here that on my first reading of the paper, I had to scroll back up to the top and look at the author’s names to see if any of them was named “Frank.” Maybe because “PageRank” takes its name from Larry Page’s surname.)

They list some of the features, from the categories listed above, but not all of them.

Popularity is interesting because they used MSN toolbar data to find out which pages were the most popular. Other sources they note for finding popularity include proxy logs (which tend to be on the small side for these purposes), and click-through rates, such as were used by older search engines like Direct Hit. The advantage of using toolbar visits is that it includes information about site visits beyond that gathered from the use of a search engine, such as clicking on favorites or following links from one page to another.

Page (and URL) features can include such things as the number of words on a page, frequency of the most common terms, and more.

Anchor text features look at the text pointing to pages, rather than on the pages themselves, and can involve aspects like the number of words in a link, the words used in those links (for more details, see the “anchor text” patent application, below).

Domain based features involve averages across all of the pages in a specific domain, such as the average number of words, average PageRank, and others.

Human based judgments and changing features

For purposes of this experiment, approximately 500,000 human-based rankings were used for the results of 2800 queries on MSN.

They used these rankings to train fRank and then set it out to rank some pages. Details of the training methods and results are spelled out in the document and provide some interesting insights. One part of the training involved looking at how rankings changed as they added more features. Here’s one conclusion that might provide some room for thought:

For each URL in our train and test sets, we provided a feature to fRank which was how many times it had been visited by a toolbar user.

However, this feature was quite noisy and sparse, particularly for URLs with query parameters (e.g., http://search.msn.com/results.aspx?q=machine+learning&form=QBHP). One solution was to provide an additional feature which was the number of times any URL at the given domain was visited by a toolbar user.

Adding this feature dramatically improved the performance of fRank.

They note that a further refinement of that approach, involving the “hierarchical structure of URLs to construct many levels of backoff between the full URL and the domain,” increased accuracy even more (see the “click distance” patent application, below).

Future work

The fRank approach presently only uses a small number of features, and the authors of the paper believe that they can improve its accuracy with the addition of more, by possibly looking at many other factors. A few things are listed:

We believe we could achieve even more significant results with more features. In particular, the existence, or lack thereof, of certain words could prove very significant (for instance, “under construction” probably signifies a low-quality page).

Other features could include the number of images on a page, size of those images, number of layout elements (tables, divs, and spans), use of style sheets, conforming to W3C standards (like XHTML 1.0 Strict), background-color of a page, etc.

Other resources

At the end of Beyond PageRank are a number of citations to other papers. I’ve read through a few of them, and they seem like good followup material if you are interested in delving deeper into this topic. A couple of newer ones that are from Microsoft which I would recommend are these two:

Two newly published patent applications assigned to Microsoft also contain ideas that are incorporated into the Beyond PageRank paper:

System and method for incorporating anchor text into ranking search results

Inventors: Dmitriy Meyerzon, Stephen Edward Robertson, Hugo Zaragoza, and Michael J. Taylor (Cambridge, GB)
Assigned to Microsoft Corporation
US Patent Application 20060074871
Published April 6, 2006
Filed September 30, 2004

Abstract

Search results of a search query on a network are ranked according to a scoring function that incorporates anchor text as a term.

The scoring function is adjusted so that a target document of anchor text reflects the use of terms in the anchor text in the target document’s ranking. Initially, the properties associated with the anchor text are collected during a crawl of the network. A separate index is generated that includes an inverted list of the documents and the terms in the anchor text.

The index is then consulted in response to a query to calculate a document’s score. The score is then used to rank the documents and produce the query results.

System and method for ranking search results using click distance

Inventors: Dmitriy Meyerzon, and Hugo Zaragoza
Assigned to Microsoft Corporation
United States Patent Application 20060074903
Published April 6, 2006
Filed September 30, 2004

Abstract

Search results of a search query on a network are ranked according to an additional click distance property associated with each of the documents on the network.

The click distance is the measurement of the number clicks or user navigations from a page or pages on the network designated as a highest authority or root pages on the network.

The precision of the results is increased by the addition of the click distance term when the site or intranet where the search query takes place is hierarchically structured.

Conclusion

Microsoft releases a large number of patent applications, and research papers regularly, and has a large number of teams working on different, but sometimes very related topics. The research I’ve pointed at today is only a small slice of what they have going on behind the scenes.

It’s likely that the other major search engines, and others looking at search-related problems, are also striving to find more ways to rank documents that don’t rely as much upon link popularity, as measured by something like PageRank. Page quality and layout, site structure, and the words used in links are likely to be part of factors reviewed in these attempts to provide more relevant and meaningful results to searchers, as well as popularity measured through something like a toolbar.

I’ll be looking at some other recent Microsoft material very shortly which covers some other approaches to indexing and ranking pages.

Sharing is caring!

14 thoughts on “Ranking Search Results Using Featured Based Rankings”

  1. Great research, as usual. I really like their approach, even with the toolbar factor. The “click distance” is also very interesting.

  2. Thanks, Nadir.

    Nice to see someone publishing something about the on-page factors that they look at, and how they determine relevance based upon quality.

    As much as I’ve looked at MSN recently, I know I’ve only touched the surface of what they are doing.

  3. Pingback: RankNet Technology from MSN | inter:digital strategies
  4. Thank you, Jeff.

    These papers and patent applications from Microsoft include some interesting approaches. I’m excited to see them travel down some of these paths.

  5. You are a prince among men. An SEO poet. I couldn’t find anything that concisely cracks the MSN code and explains what I see happening in the MSN serps until I found your post. In my opinion, it is like the tortoise and the hare. Google should worry.

  6. I have been monitoring MSN for quite some time now, glad to see a post on it 🙂

  7. I was just saying I was glad to see a post on them. Not saying “finally a post” just glad to see one 🙂 easily misunderstood.

  8. Hi James,

    Sorry for the misinterpretation. 🙂

    I do look at patents every week from Google and Yahoo and Microsoft, and from a lot of other sources as well, when I’m deciding what to write about. I try not to favor one over the other, but rather focus upon what those patents describe, and it does seem like I end up choosing something from Google a little more frequently than from the others.

  9. Thanks bill for point me to such a nice and interesting topic of RankNet, I just went through few pdf from microsoft and I loved the fRank idea vs page rank, they really beaten the page rank, but I wonder why people almost forgot it.

    Is Bing still using this technology of Ranknet? Are they having the very same approach even today in 2012?

    I have so many things to ask, so I will try to write them as brief as I can so that you can easily answer them for me.

    Thanks again.

    now I am loving this SEO Field… seriously! I was looking for the real source of knowledge in SEO, and it all is available here at SEObythesea… 🙂

  10. Hi Asad,

    You’re welcome. The Ranknet approach is pretty interesting, and I think anyone who is serious about SEO should try to learn about it if they can.

    It’s possible that Bing is still using Ranknet, but in a modified version. One of the more recent papers that I’ve read on Ranknet is this one:

    From RankNet to LambdaRank to LambdaMART: An Overview (pdf)

    Here’s the abstract to the paper:

    LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. RankNet, LambdaRank, and LambdaMART have proven to be very successful algorithms for solving real world ranking problems: for example an ensemble of LambdaMART rankers won Track 1 of the 2010 Yahoo! Learning To Rank Challenge. The details of these algorithms are spread across several papers and reports, and so here we give a self-contained, detailed and complete description of them.

Comments are closed.