Trust and the Internet: Web Search Spam

Trust is a topic that has a profound affect upon the way search engines work on the web.

How easy or difficult is it to come up with methods that don’t rely (much) on human judgment to identify spam free pages that can be trusted, and to locate pages that are intended solely to rank well in search engines without providing any value at all for visitors, except possibly ads that are on the topic of their search?

In a week, there will be a gathering in Edinburgh, Scotland, during the 15th Annual World Wide Web Conference, on the subject of Models of Trust for the Web. While I won’t be attending, it sounds like an interesting presentation, and I wanted to take a look at some of the papers written by presenters at the conference. In this post, I’ll be looking at one of the papers to be presented, and listing some of the other work by its authors.

Problems with Yahoo’s Trustrank Assumptions

One of the presentations during the “Models of Trust for the Web” presentation is on Propagating Trust and Distrust to Demote Web Spam by Baoning Wu, Vinay Goel and Brian Davison, from Lehigh University.

This paper takes a critical look at the Trustrank approach that Yahoo has developed, which can be seen in a recent patent application from the company, Link Based Spam Detection, and in a paper on Trustrank from 2004.

The concept of Trustrank could be said to run on two assumptions. One is that good sites tend to point mostly at good sites. The second is that the more links to other sites are on a “good” page, the less care there was in the selection of those sites, and less concern over checking upon them on a regular basis.

The first of the assumptions means that a seed set of good sites can be used as a starting point for finding and indexing other good sites. The second means that the amount of trust given to sites pointed to in those links would be divided amongst the total number of links on a “good” seed site. This Lehigh paper challenges those assumptions:

This assumption is open to argument. Why should two equally trusted pages propagate different trust scores to their children just because one made more recommendations than the other? Also, with respect to the accumulation of trust scores from multiple parents, TrustRank puts forth just one solution, that of simple summation. Clearly, there are other alternatives.


In general, spam pages can be considered to be one type of untrustworthy pages. To elaborate on this idea, consider that a page links to another page and hence according to the above definition of trust, this page expresses trust towards the target page. But if this target page is known to be a spam page, then clearly the trust judgment of the source page is not valid. The source page needs to be penalized for trusting an untrustworthy page. It is likely that the source page itself is a spam page, or is a page that we believe should not be ranked highly for its negligence in linking to an untrustworthy page.

In addition to looking at alternative ways to propagate trust along the web, the paper sets an eye upon locating and propagating distrust on the web, and using scores from both approaches to try to reduce the amount of web spam indexed in search engines.

Other works by the authors

Here are five other recent papers which also look at web spam from Brian Davison and Baoning Wu.

Also to be presented at the 15th Annual World Wide Web Conference:

Detecting Semantic Cloaking on the Web

Topical TrustRank: Using Topicality to Combat Web Spam (with Vinay Goel)

From the 21st ACM Symposium on Applied Computing, in April 2006

Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings

From the 14th International World Wide Web Conference

Identifying Link Farm Spam Pages

From the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), held at WWW 2005

Cloaking and Redirection: A Preliminary Study

3 thoughts on “Trust and the Internet: Web Search Spam”

  1. Lik Mui in his PhD thesis defined the “Ghandi or Christ Question”

    “Based on our models in this chapter, one can raise the following question:
    “Would Mahatma Ghandi (or Jesus Christ) get a lower reputation because he tends
    to err on the side of cooperation even when they ‘should’ defect?”
    The underlying concern for this question is about the mechanism for reciprocity.
    The questioner has in mind reciprocity in the form of a globally defined tit- for-tat strategy based on an action space with two actions.
    In the context of our models, this question is flawed.”

    Very interesting eh?

  2. Hi Paolo,

    It looks like you’ve found a way to fill up with some of my spare time with a very interesting looking paper from Lik Mui. I found it online – Computational Models of Trust and Reputation: Agents, Evolutionary Games, and Social Networks (pdf).

    I’ve also subscribed to the RSS feed for your blog, which seems to do a very nice job of covering Trust issues. I wish I was able to attend this conference in Edinburgh this week. It looks like they are going to be covering some great topics.

    The Ghandi or Christ Question does pose a challenge for our assumptions based upon linking and reciprocity.

    The Einstein Problem, mentioned in the paper is also interesting. While we may agree that Einstein is an authority figure, we aren’t in the position to approve his theories and work. Yet in an embedded social network, where a link counts as a vote, we would be voting for him with a link, even if his theory didn’t work.

    Thanks very much, for your thoughts on this topic.

  3. Pingback: SEOPittfall » Blog Archive » Spam Defined - Search Engine Optimization by pittfall!

Comments are closed.