Using an XML Sitemap to Index Fresh Content Quicker
It’s nice to find support documents from Google that point out the benefits of using certain approaches, and Google published one on the benefits of using XML sitemaps.
Search engines use programs to crawl the web and identify new pages and updated pages to include in their index. These are often referred to as robots, or crawlers, or spiders. But there are other ways that the search engine gets information about pages that it might include in search results.
A whitepaper from Google, Sitemaps: Above and Beyond the Crawl of Duty (pdf), examines the effectiveness of XML sitemaps, which Google announced as an experiment called Google Sitemaps in 2005. The experiment seems to have been a success.
XML sitemaps are a way for web site owners to help the search engine index pages on their web sites, through the use of XML Sitemaps. Yahoo and Microsoft joined Google in adding support for XML sitemaps not long after, and a set of pages explaining the sitemaps protocol was launched.
The paper tells us that approximately 35 million websites publish XML sitemaps, as of October 2008, providing data for several billion URLs. While a large number of sites has adopted XML sitemaps, we haven’t had much information from any of the search engines on how helpful those XML sitemaps have been, how they might be used together with web crawling programs, and if they make a difference in how many pages get indexed, and how quickly.
The paper answers some of those questions, looking at how Google may use an XML sitemap in discovering new pages and new content on already indexed pages, as well as a case study on three different websites – Amazon, CNN, and Pubmed.
Amazon’s approach to XML sitemaps revolves around the huge number of URLs listed – 20 Million and new products regularly. They also make an effort to indicate the canonical or best URL versions of product pages in their XML sitemap.
CNN’s approach to XML sitemaps focuses upon helping a search engine find the addition of many new URLs daily and also addressing canonical issues with their pages.
Pubmed has a huge archive of URLs listed in their XML sitemaps, with very little change to most of them over time and a change rate of URLs listed monthly.
One part of the study was limited to 500 million URLs found in XML sitemaps. It focused upon deciding whether or not the use of XML sitemaps provided the inclusion of higher quality pages than the use of crawling programs alone, without considering the sitemap information.
Another aspect of the study looked at 5 billion URLs seen by both XML sitemaps and by the discovery of pages through web crawling programs to determine which approach showed the freshest versions of those pages. It appears that the sitemap approach found new content quicker:
Next, we study which of the two crawl systems, Sitemaps and Discovery, sees URLs first. We conduct this test over a dataset consisting of over five billion URLs seen by both systems. According to the most recent statistics at the time of the writing, 78% of these URLs were seen by Sitemaps first, compared to 22% seen through Discovery first.
The last section of the paper discusses how a search engine might use XML sitemaps to help decide which pages of a site to crawl first.
If you’re using these sitemaps on your website, you might find the case study section interesting and its descriptions of how Amazon, CNN, and Pubmed organize and use those sitemaps.
If you’re not using XML sitemaps on your website, you may want to read through this paper and consider adding them.