Yahoo Replaces PageRank Assumptions with User Data

PageRank is an algorithm that measures the importance or quality of a Web document.

It can be used in a number of ways by a search engine, such as being combined with relevance factors to rank search results, or to determine which web pages to crawl (pdf) and how frequently to crawl them, or which part of a database a document should be placed within.

Search algorithms are based upon assumptions about how people use the Web, how they might search, what they might pay attention to, and what they might find important. That’s true with PageRank in both theory, and how it may be used in actual practice.

Challenging PageRank Assumptions

It’s good to see folks in the search community challenging some assumptions behind PageRank. A patent application from Yahoo, published last week raises a number of issues, from people who know PageRank very well.

Here are some problems the inventors of the patent filing point to involving some basic assumptions about PageRank:

Not All Links are Equal — people don’t randomly choose links on pages that they visit – some pages are more important than others, and some are rarely followed at all like “disclaimer” links.

The assumption that all the outgoing links in a Web page are followed by a random surfer uniformly randomly is unrealistic. In reality, links can be classified into different groups, some of which are followed rarely if at all (e.g., disclaimer links).

Such “internal links” are known to be less reliable and more self-promotional than “external links” yet are often weighted equally. Attempts to assign weights to links based on IR similarity measures have been made but are not widely used.

See, for example, The Intelligent Surfer. Probabilistic Combination of Link and Content Information in PageRank (pdf), M. Richardson and P. Domingos, Advances in Neural Information Processing Systems 14, MIT Press, 2002.

Bored Surfers Don’t Go to Random Pages — one of the assumptions of the PageRank formula is that sometimes, instead of following a link on a page, the “random surfer” will grow bored and just go anywhere else at random. The patent application notes that it is unrealistic to assume that most people using the web choose major portals and tiny home pages with an equal probability. When someone leaves a page to go somewhere else (a uniform teleportation jump to any random page under PageRank) it’s unlikely to be any random page at all where they will go.

Bored Surfers Don’t Only Go to Trusted Pages — when that “random surfer” leaves instead of following links, it’s also unlikely that they will only go to a trusted set of pages or sites, under something like TrustRank (See, for example, Combating Web Spam with TrustRank – pdf). This assumption really has nothing to do with how people actually use the Web, but is instead retrofitted into PageRank to combat link spam instead of being “reflective of real-world user behavior.”

Pages Change and Lose Value at Different Rates — the PageRank process also ignores that pages are purchased and repurposed, or decay and become less valuable over time and do so at very different rates.

Sometimes PageRank Calculations Cheat — some uses of PageRank formulations in practice are “typically implemented with regard to aggregations of pages by site, host, or domain, also referred to as ‘blocked’ PageRank.” See Exploiting the Block Structure of the Web for Computing PageRank (pdf)., This means that links between pages are being somehow aggregated to a block level. The patent application tells us that, “Unfortunately, most heuristics for performing this aggregation do not work well.”

User Sensitive PageRank Patent Application

I mentioned that the people behind the patent application know PageRank well. One of the most comprehensive and detailed documents I’ve seen on PageRank is A Survey on PageRank Computing, which was written by one of the named inventors on the following document. It’s also cited in the patent filing.

User-sensitive pagerank
Invented by Pavel Berkhin, Usama M. Fayyad, Prabhakar Raghavan, Andrew Tomkins
Assigned to yahoo
US Patent Application 20080010281
Published January 10, 2008
Filed: June 22, 2006

Abstract

Techniques are described for generating an authority value of a first one of a plurality of documents. A first component of the authority value is generated with reference to outbound links associated with the first document. The outbound links enable access to a first subset of the plurality of documents.

A second component of the authority value is generated with reference to a second subset of the plurality of documents. Each of the second subset of documents represents a potential starting point for a user session.

A third component of the authority value is generated representing a likelihood that a user session initiated by any of a population of users will end with the first document.

The first, second, and third components of the authority value are combined to generate the authority value. At least one of the first, second, and third components of the authority value is computed with reference to user data relating to at least some of the outbound links and the second subset of documents.

The patent application adds elements of user behavior to the calculation of PageRank.

Link Weight — the weight or value of links can be influenced by actual “user data representing a frequency with which the corresponding outbound link was selected by a population of users.”

Likelihood of Randomly Leaving to a New Page — the chance that someone might leave (or teleport) to another page instead of following a link on a page is also influenced by user data.

Satisfaction with Found Pages — the probability that someone might stop, and not visit new pages by following links on the page they are on also is calculated by looking at user data.

These three components can be used to create an “authority value” for a document on the Web.

The importance of anchor text, and other text associated with a link, is also addressed in User Sensitive PageRank:

According to yet another embodiment, an authority value of a first one of a plurality of documents is generated.

Text associated with each of a plurality of inbound links enabling access to the first document is identified.

A weight is assigned to the text associated with each of the inbound links.

Each of the weights is derived with reference to user data representing a frequency with which the corresponding inbound link was selected by a population of users.

The authority value is generated with reference to the weights.

The Role of User Data

User data incorporated into this algorithm should “reflect the behavior and/or demographics of an underlying user population.” It’s actual real user data reflecting the way that people browse pages. User Sensitive PageRank can reflect “the navigational behavior of the user population with regard to documents, pages, sites, and domains visited, and links selected.”

Other Implications of a User Sensitive PageRank

The patent application describes a number of different mathematical formulations to calculate this User Sensitive PageRank. I’m not going to delve deeply into those. It also addresses some other interesting implications:

User Segment Personalized PageRank — user data from different demographic profiles (based upon age, gender, income, user location, user behavior, etc.) could be specified, so that search results could be different for people from those different demographics. This could be used with other approaches to personalized PageRank, like a Topic Sensitive PageRank.

People Visit Blocks — user behavior based upon visiting and browsing blocks (sites, hosts, or domains) may be helpful in understanding how people go from one block to another block, and augment a block level PageRank approach based solely upon links between those blocks.

How the Passage of Time Can Affect PageRank — PageRank should be updated regularly because the links between pages on the Web change over time. Pages that might be considered core pages can also change in significance, or go out of fashion even though the links to and from those pages haven’t changed. Incorporating user data into PageRank means that recent events can be emphasized, and older events discounted.

Choosing Pages to Crawl — PageRank can be used in determining whether to crawl and follow links associated with a page. The addition of user data in PageRank may make choosing easier.

Beyond PageRank to Analysis of Text Associated with Links — anchor text can be “one of the most useful features used in ranking retrieved Web search results.” The importance of anchor text (and related text) can be associated with user behavior scores much like the importance of link weights can vary in User Sensitive PageRank.

Conclusion

PageRank, in most of the different formulations that have been described in patent filings and papers, focuses upon links published upon the Web, and makes a number of assumptions about how people visit, browse, and use documents attached to those links.

User Sensitive PageRank attempts to replace some of those assumptions with actual user data about how people do travel to and use Web documents.

Highly recommended: David Harry digs pretty deeply into this patent application too, in Yahoo, Page Rank and Teleportation Oh My! and offers a view of this document from a different perspective. David pulls out a number of fascinating aspects of the document that I didn’t, like “The Web Garbage Collection Utility,” and explores the user data aspects of the patent filing.

Share

52 thoughts on “Yahoo Replaces PageRank Assumptions with User Data”

  1. Well it’s all fairly obvious stuff really, I’m really surprised that Yahoo managed to patent this, I would have though that Google did this years ago.

    After All,

    If Google haven’t then they must have performed tests and found that this wasn’t effective for some reason – or perhaps the iterative part takes too long to terminate.

    I can think of several ways to increase the behaviour of this, and I’m surprised that they haven’t been tried out – it seems Yahoo is just switching to another first-order assumption on probabilities.

  2. A great find, Bill, and of fundamental importance I believe. I don’t think any method that doesn’t include user actions will correctly separate the ‘ham’ from the spam. I’m sure Google must be doing some of this as well, even if it keeps PageRank out there for marketing differentiation reasons.

  3. Hi Bill,

    Thanks for unearthing this gem. The concepts discussed here are very similar to what I proposed on my October post.

    http://hamletbatista.com/2007/10/29/pagerank-caught-in-the-paid-link-crossfire/

    The need for an intelligent surfer

    In their paper, Google’s founders claim that PageRank tries to model actual user behavior, yet they model a random surfer. Regular users do not click on links at random! They do not click on irrelevant links, invisible links or links that are not in the main content area. They follow links that pique their interest. Most of the time they ignore ads, too.

    A couple of researchers from the University of Washington came up with an improved PageRank model they call the intelligent surfer. Their basic idea is that the transition probabilities are weighted according to the similarity of the pages at both ends of the link. The end result is that links between pages that are related receive more weight and irrelevant paid links on a page would not count as they do now.

    I like this intelligent surfer concept, but a simpler and more practical approach occurred to me. Instead of using the similarity between linked pages as the probabilities that will determine where the surfer jumps to, I’d use the click-through data of each page as measured by the Google toolbar.

    Think about it. Click-through data reflects real user behavior. Links with higher click-throughs would have greater weight because users click on what really interests them. On the other hand, the links that are not relevant or not visible to the user will receive few clicks or no clicks at all. Google can also measure the bounce rates of users clicking on links and reduce the click-through rates accordingly. For example, if I place a prominent link on my blog to a questionable site, many of my readers will follow it, but they will hit the back button if it is not as good as I recommended or if they felt they were tricked.

    And all this data is already available to Google via their toolbar!

    Cheers

  4. “recent events can be emphasized, and older events discounted.”

    One of the biggest weaknesses of PageRank is that there is no “voting period.” If the presidential was run the way Google counts its “votes”, John Mccain would not start at 0 votes (the votes he received in his previous presidential run would carry over). That would give him an unfair advantage against someone who has never run for president before, like Obama.

    Yet that’s how Google counts votes for websites. Old outdated, less relevant website with millions of “votes” racked up over the last 7+ years will tower over new, higher quality, more relevant site with 0 backlinks. Googlers will claim that if you build something like myspace, you can catch up pretty quick.

    But if we have site A and site B of equal quality but site A has 50,000,000 backlinks while site B has 10, site A will outrank site B. How is that fair? Both sites are equally valuable.

    The solution is to devalue links older than X days. That will open a window of opportunity for mom and pop sites competing against giants like amazon, about.com, and wikipedia.

  5. Ho Hum. IR researchers have rarely found that PageRank is helpful in building web search engines. How you handle partial phrase matching is more important.

    So far as the cat-n-mouse game you guys are playing with them, Google uses machine learning functions to find ranking heuristics, and there all kinds of inputs — what they’ve published about PR is about a decade old and almost certainly obsolete. What you need to do is make your site match their profile of what a good site is, or at least avoid the profile of a bad site.

  6. Hi Tim,

    Something like this patent filing always seems more obvious in retrospect, but we have to keep in mind that it was originally filed in June of 2006 and work upon it probably started much longer before that.

    Pavel Berkhin’s Survey on PageRank is a pretty impressive document, and if you haven’t had a chance to look it over, it’s work taking at least a peek at. I think it helps define how much work actually went into the patent filing, especially since it’s referred to within the document.

    One thing that we have to keep in mind with patent filings also, is that when they explore the processes that they might use, and describe them, they don’t have to describe everything – rather just enough to make the processes defined appear new, nonobvious, and useful.

  7. Hamlet,

    Enjoyed your post very much. I like that the inventors of this patent filing also referred to the Intelligent Surfer paper. This was an interesting quote from your post:

    Essentially this means that in order to use this algorithm Google needs to increase its datacenter capacity by a couple of hundred times. It is definitely far cheaper and easier to scare the search crowd and continue the paid-link campaign propaganda.

    Would handling their data in other ways make a difference, too?

  8. Halfdeck,

    Some excellent points. I do thnk that there are older pages where the value of the links pointing towards them still should retain the same weight. If user information were to show that people used both pages equally, the one with more but older backlinks and the one with less but newer backlinks, would it make sense to treat them similarly?

    I do like the phrase in the patent filing which I didn’t use, the “Web garbage collection utility.” Seems reasonable that every internet should have one.

  9. Hi Barry,

    Thanks.

    I don’t think any method that doesn’t include user actions will correctly separate the ‘ham’ from the spam. I’m sure Google must be doing some of this as well, even if it keeps PageRank out there for marketing differentiation reasons

    Looking at user behavior makes a lot of sense from the perspective of fighting spam. I think that the search engines having additional signals to look at can have a big impact, in a positive manner.

    The actual commercial implementation of PageRank likely has changed a great deal since the early days of Google. We’ve seen a lot of patent applications and papers that point at possible variations, and the Survey on PagerRank Computing paper from Pavel Berkhin does a great job of exploring some of the possibilities.

  10. Masked Bandit

    Thanks for your comments. Your point about partial phrase matching is an excellent one. Have you been looking over my shoulder while I work?

    There’s no one playing any cat-and-mouse games with Google here. There’s a lot of respect and an interest in them working to increase both precision and recall in search results, and eliminating spam in those results, too.

    Yes, Google does work toward programmatic approaches to do what they do. Stanford’s PageRank has seen almost a decade of lipservice, with whatever commercial installations of the algorithm that exist likely matured and modified in many ways. Again, Pavel Berkhin’s Survey is worth a look if you haven’t seen it before, at least as a starting point.

    The word Profile does show up in a number of patent filings and papers from the search engines, as well as language about a number of different statistical models. Profiles for websites, profiles for queries, profiles for searchers. There are some good sides to that, and some bad ones , but it’s interesting to watch it unfold.

    Appreciate your stopping by.

  11. Bill, I’ve been waiting to read your summary of this “teleportation” patent. This is one of the most interesting patents to come out from Yahoo in awhile IMHO. Thanks.

  12. You’re welcome, Jordan.

    I liked the dual trustrank ones from Yahoo, too. A little surprised to see the critique here of trustrank, but it’s a good pont.

    Thanks.

  13. That is really interesting.. And yeah.. internal links(99% of the pages will be interlinked) should be given less weightage.. and its a good decision to follow user responsiveness for a link.

  14. Even though alot of this was over my head, trying to understand and comprehend the entire Search world, I found alot of value in this. Being merely an infant in search marketing, its good to find such information. I am at the “I just don’t get it stage”, it does not seem right to me with all the backlinking scams, that you can continue to see these sites rank well, and yet it is tauted that its not right. But they are not penalized.

  15. Great pagerank info. With the whole random surfing thing, where does stumbleupon fit into this – surely there is a random element to this? Pagerank seems a bit of a minefield, and I yearn for the day where I hit a PR6 or 7 :)

  16. Seems obvious but glad to see work going on in this area. The point for me is this is a nother signal they could use to improve relevance and context. Doubt if there’s ever be a single algorithms that provides a full proof solution.

  17. Pingback: Yahoo Finds Fault with Google’s Secret Sauce | NoGray SEO
  18. Pingback: Seo Блог :: :: January :: 2008
  19. Hi Vishnu,

    It’s tempting to think that most search engines give less weight to internal links within a site now, than ones from outside.

    I think, to be fair to the original PageRank patent, it’s probably important to point out this passage from that document:

    The present method of ranking documents in a database can also be useful for estimating the amount of attention any document receives on the web since it models human behavior when surfing the web. Estimating the importance of each backlink to a page can be useful for many purposes including site design, business arrangements with the backlinkers, and marketing. The effect of potential changes to the hypertext structure can be evaluated by adding them to the link structure and recomputing the ranking.

    Real usage data, when available, can be used as a starting point for the model and as the distribution for the alpha factor.*

    This can allow this ranking model to fill holes in the usage data, and provide a more accurate or comprehensive picture.

    Thus, although this method of ranking does not necessarily match the actual traffic, it nevertheless measures the degree of exposure a document has throughout the web.

    * My emphasis

  20. Hi Anthony,

    That kind of incestuous linking is an issue that this might help to uncover. You might be interested in this post of mine from November: Google Patent on Web Spam, Doorway Pages, and Manipulative Articles.

    Here’s a snippet from it that discusses that kind of interlinking:

    One part of this process might be for the search engine to identify clusters of documents that may be related to each other somehow, such as being on the same web host, or being interlinked at doorway pages and articles targeted by those doorway pages.

    For clusters that are identified, signals that manipulation is happening on pages within those clusters are explored, and an overall signal for the cluster may be determined. Signals can exist within documents contained in a cluster, and from documents outside of the cluster pointing into it:

  21. Hi Sue,

    Glad to hear that you got some value out of this post (and welcome to search marketing). Stop by my friend Kim’s blog if you haven’t seen it before: Learning SEO Basics. Her focus is on people who are fairly new to internet marketing.

    Google does try to handle spam and backlinking scams programmatically if they can, so that they can address many pages at a time, instead of one at a time. So, sometimes people may get away with some things for a while.

  22. Hi PageRanker,

    StumbleUpon is interesting in that it lets people explore pages randomly, or at least somewhat randomly. But it also lets people create annotations for pages, and that might influence where people visit.

    I think that one of the orginal conceptions behind what led to PageRank was an annotation system. A snippet from that link:

    It wasn’t that we intended to build a search engine. We built a ranking system to deal with annotations. We wanted to annotate the web – build a system so that after you’d viewed a page you could click and see what smart comments other people had about it.

  23. Hi Nilhan,

    Agree completely. More signals that provide value are positive. One question might be, given so much user data collected, what is there in it that is useful as a signal? What provides value?

  24. Hi Bill,

    Something like this patent filing always seems more obvious in retrospect, but we have to keep in mind that it was originally filed in June of 2006 and work upon it probably started much longer before that.

    I understand your point, but from my point of view, I remember reading the origional pagerank paper when it was released, and what was obvious then was that we are trying to model the probability of a user clicking on a particular link. As soon as spiders began to recognise text size etc. I just assumed that they were doing that to fit a better distribution of “probability of following a link”. When the google toolbar came out, I assumed they were using that to provide feedback for a learning algorithm for this.

    I just can’t believe that someone has not done (internal) research on this before elsewhere.

    As I stated, I have several ideas for dramatic improvements on this, but I won’t mention them now in case they are patentable (I would have assumed Google would have done them early last year, but perhaps not if this is anything to go by :-) )

  25. Pingback: » Pandia Weekend Wrap-up Jan 20
  26. Hi,

    thanks for this great article.. Nice to see what Yahoo is trying to archive.. And it seems that they are getting closer to google…

  27. Bonjour William,

    Grand pour vous lire l’article sur le rang de page sur le google il était le plus intéressant. Respect, Marianne G3 conception créatrice Ecosse.

    G3 Creative

  28. I think if proven to work well, this could be the way of the future, or perhaps just another attempt to try something different that may or may not work?

    Does anyone think that Yahoo’s system is less effective than google’s collection of data?

    Lately though I am beginning to think that ranking well under optimised search terms find better results in Yahoo and are much more quicker to rank for SEO than google.

  29. Hi Global SEO,

    It’s interesting to see the search engines paying more attention to user-based activity, and I like how the researchers involved in this patent filing gave us their thoughts on a lot of how PageRank works.

    Google has also been showing signs of paying more attention to how people use web pages, and we see that in things like the site links that they show under some of the top ranked results for certain queries, which indicate that they are paying attention to pages that people visit on those sites.

    I’d hate to say that one search engine is more effective than another… :)

  30. How we say first impression is the last impression, the same way I think a page rank gives the first impression of the site. But in some cases I feel bad that even if the site is good but does not ahve a page rank the site is not given much weightage. With google doing updates more often now and getting tougher day by day, page rank has become even more important.

  31. Hi Eva,

    It’s easy to confuse the PageRank that you see in the Google toolbar with the PageRank that a site actually has. The toolbar is only updated like 3 or 4 times a year. But I do agree that people sometimes pay that toolbar ranking more attention than they really should.

    Google does seem to be looking at more signals than just PageRank in the way that they order search results these days. It may still have value, but there are other factors that play a role in how a page ranks in those search results. This patent filing from Yahoo explores how user data may be one such factor.

  32. good post and i hope the pagerank concept goes out of the world like a thin aro. has been srewing people for a lot till now

  33. As an outsider to the search engine world it is fascinating to see the kind of thought and process management that goes into this, mirroring in a linguistic and behaviourial way what we do in our area of industrial automation and process design. Remarkable similarities between our tangible processes and a world that seems much less tangible and more ethereal than ours.

    It’s also interesting that the point is made about paid links being a big No No for google yet they allow and encourage adsense themselves and put a lot of effort into analysing the optimisation and behaviour associated with those links. Surely if page rank is based on user data rather than some overly complex algorithm then people will be free to buy and sell links without being traumatised by google for trying to make a legitimate living?

  34. Hi Mark,

    In a number of ways, the search engines do try to emulate human judgment in the ways that they attempt to rank web pages. The processes may not always work as well as they should, but it’s interesting to see them incorporate more and more actual user-behavior data into their calculations.

    Google isn’t necessarily against paid links as much as they are against paid links for the purpose of manupulating search results and buying PageRank. Their “prohibition” against paid links involves links that don’t disclose in some way (such as with the use of a rel-“nofollow” value in a link).

    This post is about Yahoo finding ways to do something similar to PageRank by using user-behavior data, though it’s quite likely that Google will also find ways to do that as well.

  35. Hi William,

    It’s pretty much impossible to tell exactly how Google is using PageRank these days.

    If you look at the book, Google’s PageRank and beyond, by Amy N. Langville and Carl D. Meyer, they list a lot of different possible variations of PageRank.

    Another paper that provides a detailed look at different variations of PageRank is Pavel Berkhin’s paper A Survey on PageRank Computing

    Taher Haveliwala, the inventor of Topic Sensitive PageRank, did join Google after Google acquired Kaltix Corp., which Haveliwala was a co-founder of. He worked there for around 5 years as well, so there’s a possibility that Google may have tried on Topic Sensitive PageRank or something like it.

    At this point, I wouldn’t be completely surprised if the version of PageRank that Google is running is closest to the one that I described in my post on the Reasonable Surfer

  36. Hi,

    I didin’t really know where to post this question but I hope that someone can give me an answer here.

    Is Google PageRank still 100% the PageRank of incoming links or have Google integrated “Topic Sensitive PageRank” with the old PageRank? Or, are they both used but as different ranking signals? In other words, is link relevancy a part of Google PageRank?

    I’m writing about PageRank and have not found an answer for this question anywhere else. Thank you in advance!

Comments are closed.