Google’s Reasonable Surfer: How the Value of a Link May Differ Based upon Link and Document Features and User Data
Not every link from a page in a link-based ranking system is equal, and a search engine might look at a wide range of factors to determine how might weight each link on a page may pass along.
One of the signals used by Google to rank web pages looks at the links to and from those pages, to see which pages are linked to by others. Links from “important” pages carry more weight than links from less important pages. An important page under this system is one that is linked to by other important pages, or by a large number of less important pages, or a combination of the two. This signal is known as PageRank, and it is only one of a large number of Google ranking signals used to rank web pages and determine how highly those pages show up in search results in response to a query from a searcher.
An early paper by the founders of Google, The Anatomy of a Large-Scale Hypertextual Web Search Engine, tells us:
PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank.
Under that approach, any link from the same page might carry the same amount of weight, or importance, when pointed to another page.
A Google patent filed in 2004 and granted today takes a somewhat different approach to the value that links might have when they appear on the same page:
Systems and methods consistent with the principles of the invention may provide a reasonable surfer model that indicates that when a surfer accesses a document with a set of links, the surfer will follow some of the links with higher probability than others.
This reasonable surfer model reflects the fact that not all of the links associated with a document are equally likely to be followed. Examples of unlikely followed links may include “Terms of Service” links, banner advertisements, and links unrelated to the document.
The patent is:
Ranking documents based on user behavior and/or feature data
Invented by Jeffrey A. Dean, Corin Anderson and Alexis Battle
Assigned to Google Inc.
United States Patent 7,716,225
Granted May 11, 2010
Filed: June 17, 2004
A system generates a model based on feature data relating to different features of a link from a linking document to a linked document and user behavior data relating to navigational actions associated with the link. The system also assigns a rank to a document based on the model.
In this “reasonable surfer” model, not every link that appears upon a page is equal in value. Different features associated with links, and the pages they appear upon and point to, may determine how much value those links pass on to the pages to which they link.
Features of Links and Documents
Under this patent, when a search engine crawls and indexes pages on the Web, it may create a model that it uses to help rank those pages which looks at features associated with the source pages that links appear upon, the target pages that links point to, and the links themselves. The search engine may also collect data about how visitors to pages use those pages, such as which links they click upon, what query terms they use to find pages, and other information that could be collected from a web browser or an add-on to a browser, such as a toolbar.
The following lists provide examples of features, and not all features listed may be used, while other features could be considered as well.
Examples of features associated with a link might include:
- Font size of anchor text associated with the link;
- The position of the link (measured, for example, in a HTML list, in running text, above or below the first screenful viewed on an 800 X 600 browser display, side (top, bottom, left, right) of document, in a footer, in a sidebar, etc.);
- If the link is in a list, the position of the link in the list;
- Font color and/or other attributes of the link (e.g., italics, gray, same color as background, etc.);
- Number of words in anchor text of a link;
- Actual words in the anchor text of a link;
- How commercial the anchor text associated with a link might be;
- Type of link (e.g., text link, image link);
- If the link is an image link, what the aspect ratio of the image might be;
- The context of a few words before and/or after the link;
- A topical cluster with which the anchor text of the link is associated;
- Whether the link leads somewhere on the same host or domain;
- If the link leads to somewhere on the same domain,
- Whether the link URL is shorter than the referring URL; and/or
- Whether the link URL embeds another URL (e.g., for server-side redirection)
Examples of features associated with a source document might include:
- The URL of the source document (or a portion of the URL of the source document);
- A web site associated with the source document;
- A number of links in the source document;
- The presence of other words in the source document;
- The presence of other words in a heading of the source document;
- A topical cluster with which the source document is associated; and/or
- A degree to which a topical cluster associated with the source document matches a topical cluster associated with anchor text of a link.
Examples of features associated with a target document might include:
- The URL of the target document (or a portion of the URL of the target document);
- A web site associated with the target document;
- Whether the URL of the target document is on the same host as the URL of the source document;
- Whether the URL of the target document is associated with the same domain as the URL of the source document;
- Words in the URL of the target document; and/or
- The length of the URL of the target document.
User behavior data associated with documents and links may also be considered, such as:
- Information about how people access and interact with documents, such as navigational actions (e.g., links selected, web addresses entered, forms completed, etc.),
- The language of the users,
- Interests of the users,
- Query terms entered,
- How often a link is selected,
- How often links aren’t selected when one link is chosen,
- How often no links are selected on a page,
This user behavior data could be obtained from a web browser or a browser assistant program such as Google’s Toolbar.
How Features May Influence the Weight of a Link
This model based upon features is intended to determine how likely a link on a page might be selected based upon positive and negative aspects of those features.
For example, a link with anchor text that is bigger than a certain size may have a higher probability of being selected than links with anchor text of a smaller size. Links positioned closer to the top of a page may also be more likely to be clicked upon. If the topic of the document being pointed to is related to the topic of the page the link appears upon, it may also have a higher probability of being selected by a visitor to the page. So, a link in a larger font, near the top of a page, leading to a page covering a similar topic as the page it appears upon may have a much higher probability of being chosen by a visitor than a link using smaller text, appearing at the bottom of a page, pointing to a page on an unrelated topic.
The patent provides a number of other examples of rules that might be applied to different features to determine how likely it might be that different links on a page might be selected and clicked upon by a visitor. Those probabilities are used to determine a dynamic weight for each of the links that can influence how highly the pages they point to might rank in Google. The different weights for the links might determine how much PageRank that each link passes along to other pages.
Or, as the patent filing tells us:
The rank of a document may be interpreted as the probability that a reasonable surfer will access the document after following a number of forward links.
How much value might a link on a page pass along in a link-based ranking system like PageRank?
Under the patent filing granted today, the value of a link may be different based upon a large number of factors, such as where the link is located on a page, whether the link is a different color or font style than other links, how many words are used in the anchor text for the link, whether the link text used is commercial or not, what the topic of the page is that the link appears upon and the topic of the page pointed to by the link, and many others.
It’s likely that in the early days of Google, the search engine quickly moved past the 1999 description of PageRank in The PageRank Citation Ranking: Bringing Order to the Web, where the weight of links were shown as split equally amongst links pointing out from a page. This patent describes a number of approaches that Google may have used to weight the value of links differently, though it’s likely that the lists above provide more value as possible examples of how links might be weighed than as definitive guidelines.
It does offer one broad rule of thumb that might be helpful. Which links on a page are most likely to be selected by a reasonable surfer – those are the links that probably carry the most weight.