What is Web Decay and how Can it Hurt Your site?
How harmful are dead links to search engine rankings? Or pages filled with outdated information? Can internal redirects on a site also hurt rankings? What about the redirects used on parked domains?
A new patent application published last week at the US Patent and Trademark Office (USPTO), and assigned to IBM, Methods and apparatus for assessing web page decay, explores the topics of dead pages, web decay, soft 404 error messages, redirects on parked pages, and automated ways for search engines to look at these factors while ranking pages. I’ll explore a little of the patent application here, and provide some ideas on ways to avoid having decay harm the rankings of web sites.
The authors of the patent filing include:
- Andrei Broder, now with Yahoo,
- Ziv Bar-Yossef, Senior Lecturer at Israel Institute of Technology – check out his publications page for some interesting papers, (added – 9/2/2006 – Dr. Bar-Yossef joined Google in August, 2006)
- Shanmagasundaram Ravikumar, an IBM Researcher, and;
- Andrew Tomkins, also now with Yahoo.
Why should a search engine look at web decay?
A search engine wants to provide useful, and timely information with the results that they return to searchers. One of the struggles that they have with this is that the web is growing rapidly, so they have to try to keep up with new pages. But, just as importantly, recent studies show that many web pages don’t last long, and the web exhibits rapid decay as well.
Is it important to for page creators to keep links up to date and avoid web decay?
From the aspect of providing a usable site, friendly to visitors, it certainly is, but dead links may also harm rankings. This patent application points to a number of ways for search engines to identify and use the concept of link decay to reduce the rankings of pages that exhibit significant link decay.
The document notes that significant decay can be seen on individual pages, collections of pages or even entire web neighborhoods, making them less effective as information resources. These neighborhoods frustrate searchers rather than providing value to them. This provides some strong incentive for search engines to stay away from these areas.
Avoiding Death and Web Decay
Perception plays a substantial role in what searchers may perceive as irrelevant information. The immediacy and flexibility of the web creates an expectation that content is up-to-date. The inventors tell us that in a library no one expects every book to be current. Most people though, expect books not to change after they have been published, and it is easy to find a publication date in a book. They note that the web is different:
While there have been substantial efforts in mapping and understanding the growth of the web, there have been fewer investigations of its death and decay. Determining whether a URL is dead or alive is quite easy, at least in the first approximation, and, in fact, it is known that web pages disappear at a rate of 0.25-0.5%/week. However, determining whether a web page has been abandoned is much more difficult.
Regardless of whether or not the processes described in this patent are in use, it’s not a bad idea to give people a few clues that show that the pages of a site are being actively maintained and updated. Here are some ideas not listed in the patent filing for doing that:
- Use “last updated” statements at the bottoms of pages
- Publish articles, or blog posts, or other information that have dates attached to them, and include links to those on high traffic pages of your site. For example, if you have a blog on your site, use the RSS feed to display the titles of posts on other pages of the site. The same can be done with articles.
- Update the copyright notice date in footers of your page. There’s nothing wrong with including a range of years in that notification, such as “© 1996-2006.”
- Use a link checker periodically, and make sure that internal links within the site, and external links aren’t broken.
- Use a tool like Xenu Link Sleuth to check those links, and also look at internal and external redirects.
- If you are using a redirect when the URL of a page on your site changes instead of changing the link itself on pages, go ahead and change the links.
- If you see that a link which is external to your site gets redirected (which Xenu can tell you), take a look where the redirection takes you. If the new location of the page you pointed to is valid, change your link to that new destination. If the new page is an error message, or something other than what you linked to in the first place, make appropriate changes such as removing your link, or finding a new source for the information.
- Manually check the destination pages of external links from your site, to see if the pages you were originally linking to have changed in some way. When you provide external links from your site, it can be helpful to provide enough information with the link so that you can remember why you linked to the page in the first place. A benefit of that can be that potential visitors to that page get a better understanding of why you linked to it in the first place, and can rely upon the links they find on your site to lead them to places that are what they say they are.
- If you sell outdated, old, discontinued, or refurbished products, make sure that it is clear that they are old in ways that will give visitors and search engines an understanding that the products are last year’s model, or the year before’s, or are even older.
- Write content in a timeless manner, so that your pages have value after some time has past. Example, an event based newsletter could simply be a listing of events, locations, dates, and links. Instead, include information about the event such as the past history of the event, and reasons why it might be remarkable this year. Think about whether or not the page will still have value to visitors a year or two or three in the future, and use words and text that achieve that as a goal. In addition to linking to pages created specifically for the event, also include links that may be more timeless, such as the organization holding the event. If the link to the event specific page becomes outdated and even a dead link, but the organization still exists, remove the event page link and keep the organization link, especially if the page still provides value to searchers.
- That last suggestion can work equally well with links in blog posts and articles. If you write about some company or organization, and something that they are doing, link to their main page, and link to the page on their site that describes what they are doing. The second link is more likely to become a dead link than the first, and you can write in a way in which removing the link doesn’t harm your blog post or article much.
There are other things you can probably do also, to keep search engines and visitors from thinking that your pages are stale. Some of them require ongoing maintenance, and others demand presenting information in a manner in which it doesn’t become stale. One of the most important statements made in this patent document is that adjustments to rankings based upon decay are much easier to fix than rankings dependent upon something like pagerank. Checking links, and fixing them can be a lot easier than pursuing links for your site.
Uses of the method described in the patent application
I mention manual checking of links in the section above. Actually, one of the anticipated uses of the methods described in this patent may help people check to see if links are really dead. Here’s the list of uses and descriptions of those that the document provides:
Webmaster and ontologist tools:
There are a number of tools made available to help webmasters and ontologists track dead links on their sites; however, for web sites that maintain resources, there are no tools to help understand whether the linked-to resources are decayed. The observation about Yahoo! leaf nodes suggests that such tools might provide an automatic or semi-automatic approach to addressing the decay problem.
Web Decay measures have not been used in ranking, but users routinely complain about search results pointing to pages that either do not exist (dead pages), or exist but do not reference valid current information (decayed pages). Incorporating the decay measure into the rank computation will alleviate this problem. Furthermore, web search engines could use the soft-404 detection algorithm to eliminate soft-404 pages from their corpus. Note that soft-404 pages indexed under their new content are still problematic since most search engines put a substantial weight on anchor text, and the anchor text to soft-404 pages is likely to be quite wrong.
The decay score can be used to guide the crawling process and the frequency of the crawl, in particular for topic sensitive crawling [1. For instance, one can argue that it is not worthwhile to frequently crawl a portion of the web that has sufficiently decayed; as seen in the described experiments, very few pages have valid last modified dates in them. The on-the-fly random walk algorithm for computing the decay score might be too expensive to assist this decision at crawl-time but post a global crawl one can compute the decay scores of all pages on the web at the same cost as PageRank. Heavily decayed pages can be crawled infrequently.
Web sociology and economics:
Measuring web decay score of a topic can give an idea of the `trendiness` of the topic.
Note – the first description includes a criticism of the Yahoo Directory, which is expanded upon later in the document. The authors say the Directory does a great job of removing dead links, and a terrible job of identifying links where the pages are still alive but may have changed purposes, or are significantly outdated. Now that at least two of the inventors work with Yahoo, it might be interesting to see if that has changed much.
How does the web decay process work?
There are four parts:
- A date threshold,
- A topicality threshold,
- A link threshold, and;
- The assignment of a decay score based upon the distance between dead links and pages.
Information about the age of a page is extracted from the page, or the “last-modified” information about the page is used. So things like dates on pages, copyright notices, and so on, can be important. If the page is older than a certain date threshold, then it is not considered current.
One of the examples they provide explains this well. A page describes a pentium III computer for “non historical reasons,” such as selling it as if it were a new product. This may provide a clue that it is not current. Likewise, to add an example, a page where the current World Series Champions are listed as the New York Yankees might also be seen as dated.
If a certain percentage of links on a page are dead links – the page is not current.
This is a section worth thinking about. I’ll use the words from the document here to show some of the issues involved in difficulties with identifying some dead links.
This is also the part of the document that includes the critique of the Yahoo Directory, and an interesting insight about parked pages pointing to a different domain (i.e., Disney owns http://www.mickeymouse.com/ which is a parked domain and has a redirect to a page on the Disney domain – audio alert warning – the Disney starts playing music when you visit, so only do so when that may not be an issue). While the Disney example is a reasonable use of a parked domain, the web decay patent identifies another use – when someone purchases a dead domain, and uses a redirect upon it to “profit from the prior promotional works of the previous owners of the dead sites.”
 (1) The first problem–determining whether a link is “dead”–is not trivial. According to the HTTP protocol  when a request is made to a server for a page that is no longer available, the server is supposed to return an error code, usually the HTTP return code 404. As discussed in the following sections, in fact many servers, including most reputable ones, do not return a 404 code–instead the servers return a substitute page and an OK code (200). The substitute page sometimes gives a written error indication, sometimes returns a redirect to the original domain home page, and sometimes returns a page which has absolutely nothing to do with the original page. Studies show that these type of substitutions, called “soft-404s,” account for more than 25% of the dead links. This issue is discussed in detail and a heuristic is proposed for the detection of servers that engage in soft 404s. The heuristic is effective for all cases except for one special case: a dead domain home page bought by a new entity and/or “parked” with a broker of domain names: in this special case it can be determined that the server engages in soft 404 in general but there is no way to know whether the domain home page is a soft 404 or not.
 (2) The second problem associated with dead links as a decay signal is that they are very noisy signals. One reason is because it is easy to manipulate. Indeed, many commercial sites use content management systems and quality check systems that automatically remove any link that results in a 404 code. For example, experiments indicate that the Yahoo! taxonomy is continuously purged of any dead links. However, this is hardly an indication that every piece of the Yahoo! taxonomy is up-to-date.
 Another reason for the noisiness is that pages of certain types tend to live “forever” even though no one maintains them: a typical example might be graduate students pages–many universities allow alumni to keep their pages and e-mail addresses indefinitely as long as they do not waste too much space. Because these pages link among themselves at a relatively high rate, they will have few dead links on every page, even long after the alumni have left the ivory towers; it is only as a larger radius is examined around these pages that a surfeit of dead links is observed.
The method used to identify “soft 404s” is clever, and not very complicated – see how each server handles errors within directories on a domain by asking for pages with names that are very unlikely to exist within that directory. The reason why the process looks at directories, instead of whole domains is that there are some sites that may treat directories as if they were separate web sites, with different error handling processes.
There’s a fair amount of math in this section, but again, it isn’t that complicated. May be best to illustrate by using their Yahoo! example, which they describe in more detail:
 Thus, it can be concluded that many of the pages pointed by Yahoo! nodes, even though are not dead themselves yet, are littered with dead links and outdated. For example, consider the Yahoo! category Health/Nursing. Only three out of 77 links on this page are dead. However, the decay score of this page is 0.19. A few examples of dead pages that can be reached by browsing from the above Yahoo! page are: (1) the page http://www.geocities.com/Athens/4656/has an ECG tutorial where all the links are dead; (2) the page http://virtualnurse.com/er/er.html has many dead links; (3) many of the links in the menu bar of http://www.nursinglife.com/index.php?n=1&id=1 are dead; and so on. It is believed that using decay scores in an automatic filtering system will improve overall quality of links in a taxonomy like Yahoo!.
They use a few other examples, with slightly different results. Rather than detail this process fully, I’m going to leave that to anyone interested in digging into this more deeply. But, I hope that this example makes it clear why it could be important to revisit pages that you may have linked to on your site to see how much they may have decayed.
The issue of web decay is an area that hasn’t been explored as much as things such as mapping the web, or other methods for ranking pages. but, it’s something that is important to recognize as an area that will be attracting more attention. It’s also an area that a web master can have a fair amount of control over if regular checks on links, and pages linked to are performed, and the topics of pages are written in a manner that keeps them from becoming dated. Sure, important historical documents don’t change – but introductions to them can show that they are still valuable, current, and worth ranking highly.
Consider the possibility that a search engine will start using some type of decay ranking adjustment, if they haven’t already, and take steps now to keep from being harmed in rankings if, and when, they do.
While patents often include lists of references such as other patents, papers, books, and web pages; patent applications rarely include such lists. This one is an exception, and has a list of 31 references. Many of those may be worth taking a look at, if you want more information on this topic.