Study Concludes Robots.txt Files Should be Replaced

One of my first stopping points when assessing whether there are any technical issues involving a Website is a text file in the root directory of a site with the name robots.txt.

Some sites don’t have a robots.txt site, and some don’t necessarily need one, but a dynamic site with endless loops that a search engine spider may get lost within should have a robots.txt file, and a disallow statement keeping the spidering programs from trying to index those pages.

A site that republishes the same content under different URLs, such as alternative print versions of pages, should also consider disallowing those pages.

An error in a robots.txt file can have some serious implications for the indexing of a web site.

A failure to have a robots.txt file, when one could be helpful, may mean that a site could have its internal link equity distributed poorly. A percentage of the important pages of a site also might not get indexed, while less meaningful pages may be.

A new paper, A Large-Scale Study of Robots.txt, describes one of the first detailed reviews of the usage of robots.txt files. It’s an interesting look at one of the most important pages on a Web site.

Combine it with a recent post at Search Engine Land, Up Close & Personal With Robots.txt, and you’ll come to the conclusion that not enough people are using a robots.txt page on their web sites, and of those using it, many are using it incorrectly or have problems with the ways that they have it set up.

The paper involved five crawls of websites from between Dec. 2005 and Oct. 2006, to view the robots.txt files for those sites. The sites were chosen from the Open Directory project, covering education, news, and government sites, and from a Fortune Top 1000 Company List for business sites. 7,593 sites were reviewed in total. Here’s the breakdown:

  • 600 government websites,
  • 2,047 newspaper websites,
  • 1,487 USA university websites,
  • 1,420 European university websites,
  • 1,039 Asian university websites, and,
  • 1,000 company websites

The study looks at the growth of use of robots.txt files over that time period, which kinds of sites robots.txt sites are more likely to be found upon, which kinds of sites have the longest robots.txt files (government sites), mistakes in robots.txt files, and some other interesting stats.

The conclusion that the authors of the study came to was that the use of robots.txt should probably be replaced with a “better specified, official standard.”

Until that happens, it pays to know how to use a robots.txt file when you need to have one.

One of the best sources of information about robots.txt files are The Web Robots Pages, which contain robots.txt examples and a lot of information about the robots exclusion protocol.

The major search engines follow the rules set in the protocol, but there have been some additions. Here are links to pages where each describe what each may look for in a robots.txt file:

Google: Webmaster Help Center – How Google crawls my site

Yahoo! Search > Yahoo! Search Help > Yahoo!’s Web Crawler How do I prevent my site or certain subdirectories from being crawled?

Bing – Control which pages of your website are indexed

Ask.com – Web Search Help

Share

11 thoughts on “Study Concludes Robots.txt Files Should be Replaced”

  1. That’s a most useful article and set of resources, Bill. Thanks for putting it all in one place.

    One small item for me on robots.txt files is that often search engine robots seem to look only at the robots.txt file on some visits and nothing else. It presumably is to ensure the site is still live. I prefer that they don’t get a 404 error message when they do that. I’m not sure what harm that 404 error might do, if anything, but it just seems an obvious thing to avoid.

  2. Hi Barry,

    That’s a great point. Is there an impact when a search engine calls for a robots.txt file, and nothing else? I’m not sure.

    We do know that the major search engines seem to ask for the file, and then cache a copy of it for a short period of time – so that they aren’t asking for it every time they visit a page on a site and send off information about that page to have it indexed. The bandwidth required to index a site would more than double if they did that.

    It’s possible that sometimes those visits that involve just looking at the robots.txt file may be a check to see that it hasn’t changed, and that they don’t have pages indexed that have been recently disallowed since their last visit and check of that file.

    This paper is the first academic study that I know about which looks at robots.txt. As they write:

    In this poster, we present the first large-scale study of robots.txt files covering the domains of education, government, news, and business. We present our observations on a considerably larger scale data than previous studies.

    If I ran a search engine, information about how site owners use robots.txt might be something that I would want to know because of its potential impact on the quality of that search engine’s index. A search engine might conduct such a study and not make it public – though doing so might influence more people to use a robots.txt file, and use it correctly. Maybe that’s another reason why search engines only grab just that file on a visit.

  3. Hey Bill,

    Just thought I’d leave a comment after replying to your post on V7N forums about the SEO book some guy linked to. As usual, we’ve got an underreported, well researched piece here! I’m favouriting this one for future reference and to pass around to clients and info seekers needing to get the 411 on the robots.txt file!

    Keep the classics coming!

    Bookworm SEO

  4. Thanks, Bookworm SEO and Joe.

    The robots.txt file is one of those often overlooked and under utilized parts of a site that can possibly have a very large impact upon a site. The percentage of sites failing to use one, mentioned in the report, was surprising.

  5. Thanks for the post Bill. I’m also surprised at the number of sites that don’t make use of robots.txt. I can understand a typical mom and pop not having one, but when I see large dynamic sites with duplicate content issues I have to wonder.

    Barry that’s an interesting point about the 404 if the robots.txt isn’t present. I would think the error wouldn’t cause a problem, but you never know.

  6. You’re welcome, Steven.

    The lack of robots.txt on some of the types of sites that they mention, like newspaper sites, is a surprise. I would expect a lot of those to be dynamic, and to have a number of issues that could be resolved pretty cleanly with robots.txt.

  7. I competely agree that robots.txt file should be replaced with something more standard. Search Engines should offer clear cut instructions for this like the google sitemap

  8. I agree completely. When we first concocted the robots.txt spec, I don’t think any of us anticipated some of the ways the web would evolve. At the time, we needed something that (we thought) anyone could create and which would allow people to lock robots out of certain recursive applications. Hence the simple text format, although we didn’t really have the option of XML at the time. Even the file name is an artifact; there were web servers running on Windows systems at the time that we limited to 8+3 file names. Kind of dates the spec, doesn’t it?

    If there is interest in creating a straw man replacement, I’d be interested in being part of the effort.

  9. Hi Chakra,

    I agree that this should be easier for people to understand. The robots.txt file goes beyond just search engines to other sites that have robots programs visiting other people’s web pages – so hopefully if there are changes in the future, people from oganizations other than search engines will get involved.

    Thanks John,

    Appreciate the historical perspective very much. Some of the questions I get about robots.txt files could probably be answered just by a rewrite of the protocol into a little simpler language. I think that there’s still a benefit to simplicity, and that a rewrite might have the biggest impact over many other options.

Comments are closed.