PageRank, Self-Serving Links, and Domain Trust

A patent application from Microsoft from 2005, and a new one published last week explore the concept of PageRank, and what they call a vulnerability of using PageRank, and come up with a couple of solutions.

Here’s the problem, as they state it:

One way to increase the PageRank score of a web page v is by having many other pages link to it. This is inherent in the basic idea of web pages being able to endorse other web pages, which is at the heart of PageRank. If all of the pages that link to web page v have low PageRank scores, each individual page will contribute only very little.

However, since every page is guaranteed to have a minimum PageRank score of dl|V|, links from many such low quality pages can still contribute a sizable total.

It may not be completely obvious why that’s a problem. They go on to explain:

In practice, this vulnerability of PageRank is being exploited by web sites that contain a very large set of pages whose only purpose is to “endorse” a main home page.

This “home” page does not have to be on the same server, but can be a home page (or any page) of some other server. Typically, these endorsing pages contain a link to the page that is to be endorsed, and another link to another endorsing page. All the endorsing pages are created on the fly.

A web crawler, once it has stumbled across any of the endorsing pages, continues to download more endorsing pages (because of the fact that endorsing pages link to other endorsing pages), thereby accumulating a large number of endorsing pages.

This large number of endorsing pages, all of them endorsing a single page, artificially inflates the PageRank score of the page that is being endorsed.

The patent filings provide a couple of potential solutions. The first is to split a minimum PageRank value amongst all of the indexed pages of a domain or an IP address rather than add more PageRank as new pages are created. The second is to assign a domain-based trust rank, based upon the web server it is hosted upon.

If implemented, a system like this might mean that the number of domains hosted on the server your site is located upon might make a difference to your rankings, and the value of the links from your pages.

Here are the patent applications:

Systems and methods for ranking documents based upon structurally interrelated information (20050060297)
Published March 17, 2005; filed September 16, 2003

Abstract

Systems and methods for ranking Web pages based on hyperlink information in a manner that is resistant to nepotistic links are provided. In one embodiment, a Web search service is provided for returning quality query results. The vulnerability of existing ranking algorithms, such as PageRank, to Web pages that are artificially generated for the sole purpose of inflating the score of target page(s) is addressed. Intuitively, it is recognized that it is less likely to reach a particular page on a Web server having many pages via a random jump than it is to reach a particular page on a Web server having few pages, which implies that the influence of such a page upon another page by linking to, or endorsing, the other page is diminished. Thus, in various non-limiting embodiments, each Web server, not each Web page, is assigned a guaranteed minimum score. This minimum score assigned to a server can then be divided among all the pages on that Web server.

Domain-based spam-resistant ranking (20070067282)
Published March 22, 2007; Filed September 20, 2005

Abstract

A domain-based spam-resistant ranking architecture that computes trust in a domain based on web-servers on which a domain is hosted and a set of other domains that link to the domain. The ranks of pages are computed based on how much trust there is in each domain and which pages link to it. Web documents are ranked in a spam-resistant manner by assigning uniform significance to each IP address of a network location and then assigning trust values to domains hosted on those IP addresses. Then, based on a domain graph, the invention constructs a domain-rank which is an estimate of how authoritative the domain is. The domain ranks are then used to assign a minimum rank to each document.

Share

17 thoughts on “PageRank, Self-Serving Links, and Domain Trust”

  1. Neither of their proposed solutions would halt or even impede the manipulation of PageRank very much.

    The sooner search engines abandon the whole concept of PageRank, the sooner the quality of their search results — where PageRank has any real impact on them — will improve dramatically.

    But even sooner they should abandon the practice of passing link anchor text and just gauge relevance on the basis of on-page content. They are now far better able to distinguish between hidden and visible text than they were years ago.

  2. Hi Michael,

    I think that the search engines probably are better at understanding the quality of a page by looking at the page itself these days, and whatever role pagerank might have is likely reduced. Do you think that any of the citation models for importance from outside of a site still have much merit?

  3. IMHO citation models still have merit, and are constantly being redeveloped and thought of in new ways! Google are definitely the ones who take this most seriously though, and I reckon it is possible to see where they are going with it…

    Authorship trust with Google’s AgentRank patent application will be able to contribute another dimension to the whole issue of how valuable a link is and exactly who is making the citation, and would probably easily mix into the PageRank score.

    Also if I remember right Google was doing something with DNS which would mean its much more efficient for them to check IP’s and look for any suspicious networks, much like described in abstract 2.

    Two possibilities, total conjecture but I reckon it shows some of the constant effort going into attempting to evolve the idea of the citation model, and its long-term future.

  4. Good examples. I think that there is still value in looking at factors that are external to a page to determine something about its quality.

    It is possible that Google is looking for suspicious networks based upon IP.

    Agent Rank is an interesting approach, but I think it would likely need the development and widespread adoption of a digital signature type system for it to work most effectively. (patent application, summary overview) I can see the value in it though.

  5. The basic concept here is distinguishing “internal” links, which aren’t valuable for ranking, from “external” links. The notion of “internal” can be broadened beyond “same domain” to at least “same IP address” and “same server”.

    One would like to get to “same ownership”; if the same person or organization controls multiple sites, links between them are internal links. We’re working on technology to do that.

  6. Hi John,

    The solution might be best focused upon finding a measure of importance that doesn’t place as much value upon links, but rather looks at other quality metrics that aren’t as easily subject to attack.

  7. Interesting Bill (yeah call me PageRank obsessed :D)

    Google already solved this problem by requiring a minimum PageRank threshold to get indexed that is above minimum PageRank score. A 1,000,000 page site with even a decent amount of inbound PageRank will split that PageRank to million pieces, so that a large portion of the site doesn’t meet the minimum PageRank threshold. To add fuel to the fire, Google is more aggressively evaluating and devaluing links, so that the PageRanks of manipulative links (footer links, blog comment spam, forum link injection, excessive reciprocal links, paid links, free directory links, crosslinks, etc) are devalued. Those two mechanisms guarantee that a large size spam site will go supplemental, as Matt Cutts said during SMX Seattle: a PageRank X site can only have Y number of pages in Google’s index.

  8. P.S. Yes, I know PageRank is attributed to a page, not a site :) When I say a PageRank X site, its a short hand for a site that has a total of X inbound PageRank, calculated by summing up all the inbound PageRanks into a domain.

  9. You’re pagerrank obsessed, Halfdeck. :)

    Thanks for that statement from Matt. It sounds like he is saying that a domain based pagerank is one of the importance metrics determining how deeply a site gets crawled.

  10. I’d like to know what “self-serving links” are, as it is mentioned in the title but not in the body of the post.
    Regards.

  11. Hi Charles,

    I found that phrase early on in the first of the patent applications that I linked to:

    This invention relates to the ranking of documents based upon structurally interrelated information. More particularly, this invention relates to the ranking of Web pages based upon hyperlink information in a manner that is resistant to nepotistic, or self-serving, links

    In other words, links that appear to have been created (on the same domain or at the same IP address or server) only to increase the link popularity (or pagerank) score of the pages of a site are considered “self serving” links.

    In their introduction of the problem that their patent application is supposed to solve, they go into more detail on what they consider self serving links:

    [0014] Thus, while the basic idea is sound, the results of PageRank are subject to interference introduced by nepotistic links, i.e., a family of pages can be created for the purpose of self-endorsement and promotion without consideration of the real merit of the endorser or the endorsee. While it is known that the problem of link spam exists with respect to PageRank scores, a solution has eluded the art.

    [0015] Accordingly, an improved query-independent link-based ranking algorithm is desired. More particularly, improved ranking systems and methods are desired that significantly reduce the effect(s) of nepotistic links. Furthermore, improved ranking systems and methods are desired that reduce a link spammer’s incentive to create a family of self-endorsing Web pages for the purpose of artificially inflating PageRank scores associated with target Web page endorsee(s).

  12. Hi Enrique,

    Google is still using PageRank. They just aren’t reporting Toolbar PageRank values for pages on a site in the Google Webmaster Tools anymore.

  13. There’s some good insight here… and conflicting views. Is page rank important? I used to work for an seo firm and I would say it definitely is something you should keep an eye on along with other factors. Funny to still see people on different sides.

  14. Hi Pat,

    Good question. PageRank is still important at this point, but we have to wonder what role it might play in the future. At this point, Google has an exclusive license to use PageRank from Stanford – the holder of the patent on PageRank. But that license is set to expire in 2011.

    Does this mean that Bing might start using PageRank sometime next year? Chances are that if they do, it will be a PageRank that looks different than the one written about originally over 10 years ago. There are a number of patent filings from Microsoft that describe other ways that PageRank might be used, like this one.

  15. I think that Google is trying to go into more of a social search engine…with their new experiment Google +1 that just launched recently, it sounds like they are trying to go away with page rank and more of recommendations of users.

  16. Hi Ray,

    Google’s been looking at social signals for a while now, and seems to be paying more attention to user-behavior data while ranking web pages.

    PageRank itself may be around for a few more years, but the numbers of other signals that Google and the other search engines are looking at have been growing. The new +1 feature might potentially influence rankings in the future, but we do have to ask how likely is it that such a signal might be manipulated and gamed by the people who use it?

Comments are closed.