When web sites link to each other, either directly or indirectly through a number of different pages, a search engine might consider those links to be reciprocal links. If you’re familiar with some of the mythology and folklore surrounding search engine optimization, you may have read or heard that reciprocal links are bad, and that search engines don’t like them.
The truth is more complicated than that.
What about blogs that link to each other on every page in their blog rolls? Or links between sites owned by the same owner that are reasonable, such as a storefront on a different domain, or a blog at a different domain or subdomain associated with a site, or a group of sites from the same company or organization that focus upon different topics?
What about sites that cover similar topics or provide complementary goods or services and find that it’s helpful to link to each other for the benefit of their visitors?
What do search engines think of resource pages, where sites include pages of links and descriptions to other sites that they think their visitors might find helpful and useful? What happens if some of those sites link back? Does it make a difference if those resource pages include a statement on them that they will list your site on their page in exchange for a reciprocal link back?
Search Engine Warnings on Links Between Pages
The major commercial search engines do provide some information about linking in their guidelines:
Google’s page on Link Schemes warns site owners that some kinds of linking might impact the ratings of their web sites negatively, including:
- Links intended to manipulate PageRank
- Links to web spammers or bad neighborhoods on the web
- Excessive reciprocal links or excessive link exchanging (“Link to me and I’ll link to you.”)
- Buying or selling links that pass PageRank
Yahoo, in their Search Content Quality Guidelines, provides examples of content that they don’t want included in their search engine, such as:
- Sites cross-linked excessively with other sites to inflate a site’s apparent popularity (link schemes)
Windows Live Help, in their page on Guidelines for successful indexing, include amongst their list of “techniques that might prevent your website from appearing in Live Search results,” the following:
- Using techniques, such as link farms, to artificially increase the number of links to your webpage.
How helpful are these guidelines to most searchers or webmasters or bloggers?
Chances are that some percentage of the people who use Google, or have their websites indexed by the search engine are familiar with PageRank, but may not know what these guidelines mean by “link schemes” or “link farms”.
Why are search engines so concerned about links between pages?
Classifications for Search Ranking Signals
When you perform a search at a search engine, the pages that show up in response to your search appear are ranked and ordered by the search engine based upon a large number of signals used by the search engine to try to provide you with pages that might best match up with what you intended to find on the Web.
That kind of ranking is a challenge for search engines because there can often be many thousands or millions of pages that might contain the words that you used to perform your search. They want to try to provide the best pages that they can at the top of the results, or at least better pages than the other search engines are showing.
These different signals that a search engine might used to determine the order of pages in search results could be classified a few different ways.
Content Based, Link Based, and User Behavior Based Ranking Signals
One set of classifications consists of breaking those signals into three different types: content based, link based, and user behavior based.
Content based signals look at the actual content that appears upon the pages of a web site. Link based signals pay attention to the links between your site and other sites on the web. User based signals look at data that indicates how people might react to the pages of your site, whether they are viewing the site directly, or seeing it in search results at a search engine.
Query Dependant and Query Independant Ranking Signals
Another way that search engines might classify the signals that they use to rank pages can depend upon whether or not that signal is related to a query that you might use to search with or not. This way of classifying those signals breaks them down into two different groupings – how important they might consider a page to be, and how relevant a page might be to a specific search term or phrase.
Signals that look at the importance, or “quality” of a page might look at the quality of the content of a page, or the number and perceived importance of links to that page, or how people use the page such as bookmarking it, spending time on it, annotating it in some way, or using it in some manner that might not be tied to a specific query. These kinds of signals for ranking a page are often referred to by search engines as query independant signals, because they don’t rely upon a query that might have been used to find that page.
Signals that look at the relevance of a page might look at how relevant that page might be to a specific query term or phrase, what words might appear in links pointing to the phrase and in words surrounding those links and associated with them, and in how people might use the page in a way that is associated with a specific query term or phrase such as clicking on a link to the page when it appears in search results for a specific search for a specific term or phrase, or spending a certain amount of time on that page after a search brings them to it. These kinds of signals for ranking a page are often referred to by search engines as query dependant signals because they do rely upon a specific query used to find a page.
Mixing Signals and Reordering Page Rankings
A search engine can use a mix of a good number of signals to determine in which order it might show pages to searchers in response to a search. It might also take those ordered results and reorder them before presenting them to searchers based upon other factors involving those pages, such as which country the searcher might be from, which language they have indicated they prefer to see results in, and many others.
I’ve written about how and why a search engine might reorder search results a number of times, including the following two posts:
- 20 Ways Search Engines May Rerank Search Results
- 20 More Ways that Search Engines May Rerank Search Results
Links Between Pages as a Ranking Signal
While there are many different kinds of signals that a search engine might use to rank web pages in response to a search, one of the important differences between web pages, and pages that you might find in a collection of documents on an intranet is that web pages can link to each other with hyperlinks. Those links can be a help to search engines in identifying which pages might be the most important ones, if it pays attention to those links. The premise behind using links as references to other pages comes from thinking about citations in academic papers and how they refer to other resources.
When someone writes an academic paper that will be reviewed by their peers, they will often include a list of citations to other academic papers as sources of references or data relied upon in their paper. It might be assumed that an academic paper that is referred to frequently by other papers is important. It might also be assumed that papers referred to by “important” papers are also important, even if they aren’t referred to by lots of other academic papers themselves.
Those assumptions about citations in academic papers is one of the influences behind PageRank, which takes advantage of the hyperlinks between pages on the Web to determine which pages are important.
Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page’s importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page.
While links between pages on the Web might be helpful, search engines are also concerned and suspicious about links between web pages.
There are site owners who have worked to take advantage of links between pages to make their pages look more important than they actually might be. Their primary focus hasn’t been to share links that provide value to people who visit their sites, or transparently connect to other sites that might be under their ownership or control, or link to pages that they value based upon the content of those pages. Instead, they link solely to manipulate link based ranking signals to try to get their sites to rank more highly for search results.
Yahoo’s Patent Application on Excessive Reciprocal Links
A newly published patent application from Yahoo discusses how it might look at those links between pages for reciprocal links, and attempt to determine whether those links exist to manipulate search results. The patent filing is:
Identifying excessively reciprocal links among web entities
Invented by Timothy M. Converse, Priyank Shankar Garg, and Konstantinos Tsioutsiouliklis
Assigned to Yahoo
US Patent Application 20090013033
Published January 8, 2009
Filed July 6, 2007
A method for identifying reciprocal links is provided. At a particular host, the set of hosts which link to the particular host and the set of hosts to which the particular host links are determined. The intersection and union of the two sets of hosts are also determined, and the sizes of the intersection and union are calculated.
The concentration of reciprocal links at the particular host is calculated based on the sizes of the intersection and union. A ratio of the intersection size to the union size is used to determine the concentration of reciprocal links. The particular host’s rank in a list of ranked search results may be changed as a result of identification of a high concentration of reciprocal links.
Related Yahoo Patent Filings Involving Linking
This patent filing on excessive reciprocal links notes that it is related to a couple of other patent filings from Yahoo.
A method that search engines can use to keep an eye on who is linking to whom is through the creation of something known as a link graph or web graph. A link graph is a visual representation of the web that views a web page as a node, and links between pages as edges, or lines between those nodes. The Exceptional Changes in Webgraph Snapshots patent application looks for changes to that link graph over time to try to identify suspicious activity. The abstract from the filing tells us:
Techniques are provided through which “suspicious” web pages may be identified automatically. A “suspicious” web page possesses characteristics that indicate some manipulation to artificially inflate the position of the web page within ranked search results.
Web pages may be represented as nodes within a graph. Links between web pages may be represented as directed edges between the nodes. “Snapshots” of the current state of a network of interlinked web pages may be automatically generated at different times. In the time interval between snapshots, the state of the network may change.
By comparing an earlier snapshot to a later snapshot, such changes can be identified. Extreme changes, which are deemed to vary significantly from the normal range of expected changes, can be detected automatically. Web pages relative to which these extreme changes have occurred may be marked as suspicious web pages which may merit further investigation or action.
The other patent filing is one that pays attention to links from sites that it has already identified as “suspicious,” Link-Based Spam Detection. A snippet from that one:
In this section, the concepts of a spam farm, inlink page ranking (commonly referred to as “PageRank”), and trust-ranking are described.
A spam farm is an artificially created set of pages that point to a spam target page to boost its significance. Trust-ranking (“TrustRank”) is a form of PageRank with a special teleportation (i.e., jumps) to a subset of high-quality pages.
Using techniques described herein, a search engine can automatically find bad pages (web spam pages) and more specifically, find those web spam pages created to boost their significance through the creation of artificial spam farms (collections of referencing pages). In specific embodiments, a PageRank process with uniform teleportation and a trust-ranking process are carried out and their results are compared as part of a test of the “spam-ness” of a page or a collection of pages.
While these other two patent filings focus upon links between pages, they don’t look at how excessively pages or domains might link between themselves directly, or indirectly through a number of pages or domains like this newly published patent application does.
It’s quite possible that the processes described in all three of these patent filings, as well as a number of others, might be used together to try to keep the use of linking as a ranking signal from being abused.
Reciprocal Links and “Suspicious Entities”
As I mentioned above, you may have read or heard that reciprocal links are bad, and that the truth of the matter is more complicated than that.
The Yahoo patent filing gives us their definition of a reciprocal link:
A web page contains a “reciprocal link” when one of its “outlinks” is also one of its “inlinks”. That is, a reciprocal link exists when a web page links to another web page which also links back to the web page.
So, a reciprocal link exists whenever two sites link back and forth to each other.
The patent also tells is that it will also consider links that are circular as reciprocal links. For example, a page from site A points to site B, a page from site B points to Site C, and a page from site C points to site A.
If the links between pages (or domains or hosts) is a small percentage of the links on each page or domain or host, the process described in this patent filing may not kick off. I say “kick off” because this is an automated process rather than a manual review at this point.
If the percentage of links is larger than than, a number of steps might be taken by the search engine.
The sites might be reviewed manually by “human investigators” or they might be examined by a program from the search engine that has been trained to look for signals of suspicious activity.
The patent application does tell us that pages or domains or hosts might have a high percentage of reciprocal links for legitimate reasons:
For example, a particular web page may have many reciprocal links with a group of web pages because these web pages discuss the same subject matter in a complementary fashion and the web page authors have found it expedient for those web pages to refer to each other.
In another example, two groups of web pages refer to each other reciprocally because those groups belong to two company web sites where the companies are part of the same conglomerate.
A review of those pages might lead to a determination that they are “suspicious,” which could lead to an automatic demotion of those pages or domains or hosts in search results.
Some pages might be included in a “white list” of web pages or hosts or domains as automatically excluded from being identified as suspicious. These are sites that are known to be “popular” and “legitimate.” Not surprisingly, the patent application uses Yahoo.com as an example.
In an alternative approach, pages or domains that have been identified as “suspicious entities” might not be automatically excluded or demoted from search results, but may be further reviewed based upon their content. For example, the page may be explored to see if it contains words related to pornography or prescription drugs.
The use of links by search engines as a ranking signal to determine how well a page might rank in search results is just one of many ranking signals that a search engine may use, and it is possible that site owners might attempt to have links pointed to their pages only to increase the rankings of their pages in search results.
The three patent filings that I’ve referred to in this post are just some of the ways that a search engine might try to identify when people are attempting to inflate their rankings through linking solely for the purpose of manipulating their rankings.
Chances are that if the links on your blog or site are open and transparent and reasonable (rather than excessive), providing value to your visitors, and reasonably cover similar topics or complementary ones, to sites that might link back to yours, that a search engine might find them to be legitimate. If you include indications on your pages that you will link back to others who link to you to boost rankings in search results, you may have more reason to be concerned. If you engage in link farms or link schemes or reciprocal link programs, a search engine might find your pages to be “suspicious,” and may be taking a closer look.
If you want to dig deeper into this topic, here are some papers from Yahoo researchers on detecting spam pages by looking at links:
- Combating Web Spam with Trustrank (pdf)
- Link Spam Alliances
- Link Spam Detection Based on Mass Estimation (pdf)
- Link Based Characterization and Detection of Web Spam (pdf)
- Using rank propagation and probabilistic counting for link-based spam detection Slides (pdf)
- Link Based Spam Detection Slides (pdf)
- Know your Neighbors: Web Spam Detection using the Web Topology (pdf) slides(ppt)
- Link Analysis for Web Spam Detection (pdf)
- Web Spam Detection: link-based and content-based techniques (pdf)
- Technical Report YR-2008-001 – Witch: A New Approach to Web Spam Detection (pdf) (video)
- Web spam Identification Through Content and Hyperlinks (pdf)
I mentioned above that search engine ranking signals can be classified as content based, link based, and user behavior based. A few of the papers above look at both links and content to find web spam. Another recent approach from Yahoo looks at user behavior and query logs to find spam pages:
Added: David Harry also wrote about this patent application and reciprocal links in a thoughful post titled: This just in; reciprocal links are pointless