Search engines use programs to crawl the web, and identify new pages and newly updated pages to include in their index. These are often referred to as robots, or crawlers, or spiders. But there are other ways that the search engine gets information about pages that it might include in search results.
A whitepaper from Google, Sitemaps: Above and Beyond the Crawl of Duty (pdf), examines the effectiveness of XML sitemaps, which Google announced as an experiment called Google Sitemaps in 2005. The experiment seems to have been a success.
XML sitemaps are a way for web site owners to help the search engine index pages on their web sites, through the use of an xml Sitemap. Yahoo and Microsoft joined Google in adding support for XML sitemaps not long after, and a set of pages explaining the sitemaps protocol was launched.
The paper tells us that approximately 35 million websites publish XML sitemaps, as of October 2008, providing data for several billion URLs. While XML sitemaps have been adopted by a large number of sites, we haven’t had much information from any of the search engines on how helpful those sitemaps have been, how they might be used together with web crawling programs, and if they make a difference in how many pages get indexed, and how quickly.
The paper answers some of those questions, with a look at how Google uses XML sitemaps in discovering new pages, and new content on already indexed pages, as well as a case study on three different web sites – Amazon, CNN, and Pubmed.
Amazon’s approach to XML sitemaps revolves around the very large number of URLs listed – 20 Million, as well as the addition of new products on a regular basis. They also take effort to indicate the canonical, or best URL versions, of product pages in their XML sitemap.
CNN’s approach to XML sitemaps focuses upon helping a search engine find the addition of many new URLs daily, and also addressing canonical issues with their pages.
Pubmed has a huge archive of URLs listed in their XML sitemaps, with very little change to most of them over time, and a change rate of URLs listed as monthly.
One part of the study was limited to 500 million URLs which were found in XML sitemaps, and it focused upon deciding whether or not the use of XML sitemaps provided the inclusion of higher quality pages than the use of crawling programs alone, without considering the sitemap information.
Another aspect of the study looked at 5 billion URLs that were seen by both XML sitemaps and by the discovery of pages through web crawling programs, to determine things such as which approach showed the freshest versions of those pages. It appears that the sitemap approach found new content quicker:
Next, we study which of the two crawl systems, Sitemaps and Discovery, sees URLs first. We conduct this test over a dataset consisting of over five billion URLs that were seen by both systems. According to the most recent statistics at the time of the writing, 78% of these URLs were seen by Sitemaps first, compared to 22% that were seen through Discovery first.
The last section of the paper discusses how information from XML sitemaps might be used by a search engine to help decide which pages of a site to crawl first.
If you’re using XML sitemaps on your website, you might find the case study section interesting, and it’s descriptions of how Amazon, CNN, and Pubmed organize and use those sitemaps.
If you’re not using XML sitemaps on your web site, you may want to read through this paper, and consider adding them.