Link Analysis, Web Spam, and the Cautious Surfer

The Somewhat Random Surfer

Inventor Lawrence Page, in the patent Method for node ranking in a linked database, introduces the idea of a Random Surfer, following links from page to page, in describing a method of ranking Web pages.

Looked at another way, the importance of a page is directly related to the steady-state probability that a random web surfer ends up at the page after following a large number of links. Because there is a larger probability that a surfer will end up at an important page than at an unimportant page, this method of ranking pages assigns higher ranks to the more important pages.

There’s some fun stuff in that patent, filed in 1998, that doesn’t get a lot of attention.

For instance, the following factors are noted as possibilities that could weigh in on the value of a link from a source page, to increase the probability that someone would end up at an important page if they were surfing the Web and followed the link:

  • Whether the links are from different domains
  • If the links are located at different servers
  • Which institutions and authors maintain the links
  • Where those institutions and authors are located geographically
  • Whether the links are highly visible near the top of a document
  • If the links are found in web locations such as the root page of a domain
  • If the links are in large fonts, or are emphasized in other ways
  • If the pages that the links are upon have been modified recently

That doesn’t sound completely random, does it? But the idea behind this ranking method was to:

…be useful for estimating the amount of attention any document receives on the web since it models human behavior when surfing the web.

It also aims to avoid the possibility of “artificially inflated relevance” in pages that might be under the control of just one web site designer, thus the inclusion of different domains and different servers in that list.

The Cautious Surfer

A poster being presented at the WWW2007, A Cautious Surfer for PageRank, adds a twist, noting that the concepts behind PageRank and the Random Surfer model were based upon the assumption that the content and links of a page can be trusted, and that they are trusted equally.

The authors of the paper introduce the idea of a Cautious Surfer:

Unlike the random surfer described in the PageRank model, this cautious surfer carefully attempts to not let untrustworthy pages influence its behavior.

The paper uses the TrustRank (pdf) rankings of pages that a surfer is on to determine how likely a surfer will trust in following a link from that page, or if that browser will instead randomly jump to another page. The higher the TrustRank ranking of a page, the more likely that a surfer will follow a link from the page to another. There’s also a bias introduced to that random jump so that it is more likely that the surfer will go to a more trustworthy page.

The authors note that they could use other estimates of trust other than TrustRank ranks in this process. They also point out that the idea behind this approach isn’t to use TrustRank as a determination of the authority of a page, but rather continue to use PageRank as influenced by estimations of trust, to avoid web spam.

So why not just rely upon TrustRank? It can have the effect of demoting spam. Here’s the answer that we are given:

However, the goal of a search engine is to find good quality results; “spam-free” is a necessary but not sufficient condition for high quality. If we use a trust-based algorithm alone to simply replace PageRank for ranking purposes, some good quality pages will be unfairly demoted and replaced, for example, by pages within the trusted seed sets, even though they may be much less authoritative.

Considered from another angle, such trust-based algorithms propagate trust through paths originating from the seed set; as a result, some good quality pages may get low value if they are not well connected to those seeds.

Share

6 thoughts on “Link Analysis, Web Spam, and the Cautious Surfer”

  1. The document goes a long way toward revealing just how naive he was in 1998, to believe some of the nonsense that is presented in the patent.

    That’s what happens when people construct things without taking reality into consideration. But he was tremendously lucky in that Google’s relevance scoring proved to be so good for so long in spite of PageRank.

  2. Hi Michael,

    There are a lot of aspects to that original patent that were forward thinking (or maybe wishful thinking). :)

    They were able to do enough of what was in that patent to stand out though.

  3. @Bill what would you say were the aspects that were luck, and which were forward thinking….

    Might make for a retrospective post about that patent? I love “antique” web stories….:)

  4. Hi JC,

    Good questions. Thanks.

    Using links as part of the ranking signal for web pages was a breakthrough in many ways because it took advantage of one of the unique aspects of the Web – people do link to things that they find interesting and useful.

    But, as helpful as that was, there are at least a couple of potential problems with those assumptions behind PageRank.

    One of them is the assumption that when someone links to a page, they are endorsing it – the PageRank algorithm can’t distinquish between a positive and a negative sentiment associated with a link. A link could as easily be associated with a warning or a negative review or statement as it could be an endorsement.

    Another is that pages and sites that focus upon highly specialized fields of knowledge often will never attract the backlinks that more mainstream pages and sites do.

    For example, if you want highly technical information about black holes, and you search in Google’s web search, your search results will be biased towards more mainstream science sites that have attracted more quality links because of their broad coverage of topics.

    You’re right, a retrospective post may be a great idea.

  5. Thanks Bill – indeed, not all links should be treated equally. I often wonder if any links can be treated as bad? I say this because from an SEO pov, surely defaming a competitor would be easier than ranking a good site?

  6. Hi JC,

    There have been some approaches trying to understand when links might be primarily for purposes of increasing rankings, like Yahoo’s link-based Trustrank approach, but when it comes to considering sentiment, I don’t think that a ranking algorithm should consider whether a page is associated with praise or with criticism.

Comments are closed.