Google Study Shows Use of XML Sitemaps Helps Index Fresh Content Quicker

Search engines use programs to crawl the web, and identify new pages and newly updated pages to include in their index. These are often referred to as robots, or crawlers, or spiders. But there are other ways that the search engine gets information about pages that it might include in search results.

A whitepaper from Google, Sitemaps: Above and Beyond the Crawl of Duty (pdf), examines the effectiveness of XML sitemaps, which Google announced as an experiment called Google Sitemaps in 2005. The experiment seems to have been a success.

XML sitemaps are a way for web site owners to help the search engine index pages on their web sites, through the use of an xml Sitemap. Yahoo and Microsoft joined Google in adding support for XML sitemaps not long after, and a set of pages explaining the sitemaps protocol was launched.

The paper tells us that approximately 35 million websites publish XML sitemaps, as of October 2008, providing data for several billion URLs. While XML sitemaps have been adopted by a large number of sites, we haven’t had much information from any of the search engines on how helpful those sitemaps have been, how they might be used together with web crawling programs, and if they make a difference in how many pages get indexed, and how quickly.

The paper answers some of those questions, with a look at how Google uses XML sitemaps in discovering new pages, and new content on already indexed pages, as well as a case study on three different web sites – Amazon, CNN, and Pubmed.

Amazon’s approach to XML sitemaps revolves around the very large number of URLs listed – 20 Million, as well as the addition of new products on a regular basis. They also take effort to indicate the canonical, or best URL versions, of product pages in their XML sitemap.

CNN’s approach to XML sitemaps focuses upon helping a search engine find the addition of many new URLs daily, and also addressing canonical issues with their pages.

Pubmed has a huge archive of URLs listed in their XML sitemaps, with very little change to most of them over time, and a change rate of URLs listed as monthly.

One part of the study was limited to 500 million URLs which were found in XML sitemaps, and it focused upon deciding whether or not the use of XML sitemaps provided the inclusion of higher quality pages than the use of crawling programs alone, without considering the sitemap information.

Another aspect of the study looked at 5 billion URLs that were seen by both XML sitemaps and by the discovery of pages through web crawling programs, to determine things such as which approach showed the freshest versions of those pages. It appears that the sitemap approach found new content quicker:

Next, we study which of the two crawl systems, Sitemaps and Discovery, sees URLs first. We conduct this test over a dataset consisting of over five billion URLs that were seen by both systems. According to the most recent statistics at the time of the writing, 78% of these URLs were seen by Sitemaps first, compared to 22% that were seen through Discovery first.

The last section of the paper discusses how information from XML sitemaps might be used by a search engine to help decide which pages of a site to crawl first.

If you’re using XML sitemaps on your website, you might find the case study section interesting, and it’s descriptions of how Amazon, CNN, and Pubmed organize and use those sitemaps.

If you’re not using XML sitemaps on your web site, you may want to read through this paper, and consider adding them.

Share

37 thoughts on “Google Study Shows Use of XML Sitemaps Helps Index Fresh Content Quicker”

  1. I add XML sitemaps to my sites as standard, and in fact the XML sitemap plugin for blogging software such as WordPress goes a long way towards getting new posts indexed as soon as possible. The trick is keeping them updated, something which the plugin does automatically. An XML sitemap should ideally update every day.

    It has been argued though that XML sitemaps can have detrimental effects on sites where pages that have been poorly optimised will be listed, but they won’t appear for their keywords. Basically, XML sitemaps remove any chance you might have of analyzing the optimisation quality on individual pages.

  2. Hi Adam,

    Thanks. You raise some thoughtful issues. I recommend the use of XML sitemaps as well, and it is great when there is a plugin or some other tool that can help create sitemaps for a site as it updates.

    I don’t believe that using an XML sitemap has too much of an impact on helping a page rank any higher in search results. Some actual optimization ideally needs to happen to make a difference. An analysis of the optimization of the pages of a site shouldn’t be hindered by whether or not there is an XML sitemap in place. I’ve analyzed many sites, with and without XML sitemaps, and haven’t found that the use of such a sitemap harms such optimization efforts. I’m not sure that I’ve seen the arguments that you mention that say that it might, and would be interested in seeing them. A page may be indexed by a search engine without ranking well, and an XML sitemap can help make that happen, but there are many other factors which determine how well a page ranks.

    It does appear, from Google’s research and statements about how they might use XML sitemaps in this paper, that an XML sitemap can have some positive results in indexing the pages of a site by showing a search engine the preferred canonical versions of the URLs for pages, and by making it easier for a search engine to see all of the URLs for a site, and helping the search engine define which pages might be crawled first, and most frequently for that site.

  3. I am undecided as to whether site maps help or hinder a website. I am conscious that on the one hand, correctly inserting a site map via GWebmaster Tools, means confirming a Google account and WM Tools account. And that this then needs to be verified (read: G pushing importance of getting a site map as a strategy towards aquiring a certified mailing list of website owners for future marketing strategies – and there is nothing wrong with that, for those not specialized in this field)

    Equally I’ve seen sites perform just as well that have absolutely nothing to do with the specialist services offered by the search engine, so I’m not still undecided.

    If I see a site is not getting indexed well, I’ll add a site map (and by that I mean XML sitemap loaded into Google webmaster tools, in all cases a site map is featured on a clients website)

  4. Hi Glyn,

    Some rationally applied skepticism isn’t a bad thing.

    In most cases, all things considered I would prefer to work on making a site more search engine friendly before considering using an XML sitemap, but I think that there are some benefits that outweight having to tell Google what all the URLs are for a site. I am also starting to see a lot more site owners asking for help who already have XML sitemaps.

  5. So my theory is correct. There is a need to resubmit again and again the xml file so that it will be crawled faster.

  6. Hi William, sorry I thought we were talking about XML sitemaps? I am with you on the fact that a website needs to be easy to navigate on at two levels: first the user level by way of clearly defined usability rules and navigational/information hierarchy, second in terms that is completely crawlable by all search engines – not just Google. I’d actually go so far as to say that a website doesn’t need a sitemap if the information logic is properly mapped out, although you need to factor in how information is digested across cultures….so it’s probably a little arrogant not to include one at all. However I think site maps can be a good get-out clause for a poorly thought through website.

  7. We do use XML and urllist.txt style sitemaps submitted to google, yahoo and MSN webmaster areas. These have proved to be the quickest way of getting a new site listed, usually in less than a week, compared to the submit a site method which seems to take weeks or months, if at all

  8. Hi Arnold,

    Good point. I probably should have provided some resources in this post about how to submit an XML sitemap to the search engines.

    When you make changes to your XML sitemap or sitemaps, you’ll want the search engines to find out about those changes. There are a few different ways to let the search engines know about the changes. The sitemaps.org page tells us about them in a section titled Informing search engine crawlers.

    One of the easiest ways is to include a link to your XML sitemap or sitemaps in your robots.txt file, so that search engines can go to it when they look at your robots.txt file. Many XML sitemap plugins and XML sitemap generator programs have built in systems to ping the search engines when a new XML sitemap has been created. It is also possible to ping the search engines manually, or to submit through their tools sections.

    Google provides details on different ways to let them know about an updated XML sitemap here:

    Sitemap Submission Made Simple

    Yahoo! explains how to submit your XML sitemap on this page:

    Does Yahoo! support Sitemaps?

    Ask’s submission process is described on this page:

    Sitemaps Autodiscovery

    Microsoft has said that they will follow the link XML sitemaps found in robots.txt files, and that it’s also possible to submit the URL of an updated sitemap through their Bing URL submission page.

  9. Hi Glyn,

    I really wish that Google had used a different name than “Google Sitemaps,” and then “XML Sitemaps” to avoid confusion between HTML sitemaps and XML sitemaps. I’m not sure if they understood how much time and energy would be wasted in having people explain the difference between the two. Is that something that you find yourself spending time explaining, too?

    The paper linked to in my post is the first large-scale study from one of the search engines on how people are using XML sitemaps, and how Google is attempting to use information from those sitemaps. If I’m focusing a little more on Google in this post, it’s primarily because they are sharing information about XML sitemap usage that Yahoo or Microsoft haven’t.

    I’m not sure that an XML sitemap is absolutely necessary in most cases, if a site is created with care, and the information logic is set out well. XML sitemaps can help search engines find pages when the organization of a site is less than idea, and sometimes can help a search engine identify the canonical, or best version, of a URL when there might more than one URL pointing to the same page (or image, or video, etc.)

    However, for many sites, especially those with a large number of pages, from thousands to millions, it appears that an XML sitemap may also help speed up the discovery of new pages and changed pages, according to the paper from Google.

  10. Hi People Finder

    I agree. :)

    It can be a pain to manually update an XML sitemap or manually ping the search engines when you’ve made updates. I know that there are plugins for some ecommerce platforms as well as blogging platforms that automate the process, to make it easier.

    Google provides a sitemap generator here:

    http://code.google.com/p/googlesitemapgenerator/

    They also provide links to a number of XML sitemap generators and plugins here:

    http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

  11. I use a wordpress plugin to update my xml sitemap and I must say that my pages get indexed extremely fast. I used to update it manually and what a pain that was to do on a daily basis.

  12. Hi pays to live green,

    I use a wordpress plugin here to update my xml sitemap as well. I’m seeing pages get indexed very quickly too, though I don’t know if it’s because of the XML sitemap or because googlebot visits frequently.

  13. Funny how the xml sitemap has come back in vogue. Seems not that long ago that everyone was poo-pooing them. I never saw how it could be detrimental, so I’m glad I stayed on the bandwagon.

  14. Hi rob,

    Interesting that you’re finding that brand new sites are making it into search engine indexes so quickly with the use of XML sitemaps.

    I’d also definitely recommend to people publishing new sites to work on getting some links to those pages as well, so that search crawling programs can find the pages too, and so that the pages can start building some link popularity (or PageRank in Google’s case).

  15. Pay to live green,
    Is that wordpress blog hosted on the Worpress server? I’ve seen Search engines give priority crawls to in-vogue Social Media networks. Which is why the spammers are pretty active on them.

    G.

  16. The site map is helps us to index new pages created on site. We can see results on Google in two -three weeks. But you need to submit sitemap.xml file to Google with Web Master Tools.

    Serge

  17. Hi Serge,

    Two to three weeks isn’t bad. The times that I’ve seen vary, and I’m guessing that they rely upon factors such as how many links there are pointing to the pages of the site, how high quality those links are, and how frequently Google returns to check indexed pages and to see if there are new pages.

    While it isn’t necessary to create and use a Google Webmaster Tools account to use an XML sitemap with Google, and have them use it, it is something that they recommend on their page about Submitting a Sitemap:

    You can also tell Google and other search engines about your Sitemap by including the location of the Sitemap in your robots.txt file. We still recommend that you submit your Sitemap through your Webmaster Tools account so you can make sure that the Sitemap was processed without any issues, and to get additional statistics about your site.

  18. Hi Tom,

    In an ideal world, you would want to update your sitemap simultaneously with changes and additions to the pages of your site, or as soon as possible afterwards, so that the search engines might capture those changes as quickly as possible. For some sites, that might be easier to do because the content management systems or ecommerce platforms that they use offer plugins or modules that will allow them to update an XML sitemap or administrate it as they add pages, or whenever they might want.

    The three examples that were provided in the paper show a couple of other strategies that it’s possible to follow.

    For large sites that add lots of products over the course of a day, or add news pages every few hours, that might mean adding new XML sitemaps which contain new pages, updating old XML index files which have changed and updating a SitemapIndex file that shows the last modification time for all of the XML sitemaps.

    For smaller retail sites, it might mean just changing a single XML sitemap.

    Regardless of the approach that you take, if your main purpose behind using an XML sitemap is to help a search engine identify pages that have changed as quickly as possible, and new pages on your site, it makes sense to try to update your XML sitemaps as soon as you can.

  19. William

    I tend to set-up a xml sitemap for every site I manage. That’s the easy bit. But when you don’t have control over a site’s architecture, then it becomes a little tricky and time consuming.

    The reason for this, well site owners only want someone to look after their site once its gone wrong. Then you start to realise why. A whole cesspit of defunct pages sitting within the present sitemap, yes and there is a sitemap generator continually updating the search engines with these dead end pages.

    I have found myself literally removing these pages from a sitemap and then turning off the generator, which goes contra to the every day update I know, but how do you get around this? Site owners tend to want to keep everything they’ve ever produced; but that’s another story.

    Thanks for the article, good reading, I’ll now check out the report.

  20. Hi David,

    Very good points. I remember looking at one site from a fairly large organization, and running Xenu Link Sleuth on it to get an idea of how the site was organized, what the site contained, and what kinds of problems it might have.

    They commonly hired different marketing firms everytime they wanted to make changes to their site, and I got a chance to dig more deeply into SEO for their site after a few dozen different marketing firms had made a mess of it. There were roughly 1,000 pages to the site, with more than 1,000 internal redirects between the pages of the site, many of which were 302 redirects. For some pages,, when you clicked on a link to a page, your browser would go through half a dozen or more redirects before arriving at a final destination page. For example, here’s the kind of thing that I saw on their site:

    Link to FAQ with URL of “http://www.example.com/faq” -> 302 redirect to “http://www.example.com/faq/index.htm” -> 302 redirect to “http://faq.example.com/faq” -> 302 redirect to “http://faq.example.com/faq/” -> 302 redirect to “http://faq.example.com/faq/default.aspx” – 301 redirect to “http://www.example.com/faq/default.aspx”

    Now imagine that the site had hundreds of multiple redirect paths like that, and what problems search engines might have with that kind of site structure. Ouch.

    While you could set up an XML sitemap that included the pages at the ends of those redirection trails, that’s not much help in getting those pages to rank if the search engines have to attempt to navigate through multiple temporary (302) redirect trails to those pages instead of seeing any direct text internal links to the pages

    The solution is to convince the site owner that it could really make a difference in terms of rankings and traffic if they took the time to untangle the mess that their site had become, and that future development costs would be much less if they fixed the present problems and followed better practices in the future. If they don’t want to do that, the best case may be to find another client who will want to listen.

  21. Hi Bill,

    Thanks for the link to that study. Indeed, an xml sitemap is a tool that will speed indexing, help with taxonomy raking and link distribution among other things. But this could also be a patch to a major architectural problem a site may have.

    This situation can be seen very often with poorly architectured blogs and ecommerce sites, that even though rank for a set of keywords, the crawling rates and indexing are assisted by the xml sitemap. What they need is a better site mapping, internal linking re-structuring and other improvements.

    From my perspective, keeping that xml sitemap will do more harm than good for a site with those problems in the long run.

    You also have to consider that most plugins tend to fail at some point or the other. This creates a validation problem, something that the site owners are not notified unless they checked their xml files or get flagged through Google Webmaster Tools. I’ve seen blogs going months without updating plugins and therefore having problems that they didn’t even think of.

    I’ll write a post about this issue, but would love to hear your feedback on the potential negative factors an xml sitemap may create and when not to use it.

    Regards,

    P.S. I found the other day, while reading your about page, that you worked for judicial before turning full time to SEO. Glad to hear that I was not the only one taking the same route and going through the same experience :)

  22. Hi Augusto,

    Good to see you. Many people do attempt to use XML sitemaps as bandaids, for sites with bigger problems.

    The real solution is to fix the architecture of a site instead of creating an XML sitemap showing the versions of pages that they really want indexed. An XML sitemap can help a search engine identify pages from a site that the web master would like to have indexed, but if duplicate URL problems, bad link structures, unnecessary internal 302 redirects, pages that aren’t accessible by text links, and other problems aren’t addressed, then pages that might be indexed because of the XML sitemap will have continue having problems ranking well in search results.

    I’ve seen XML sitemap plugins fail as well. If the other problems that might keep the pages of a site from getting crawled have been addressed, that’s not a problem. In that instance, the only thing that the XML sitemap might be doing is to help pages get indexed more quickly.

    Some potential problems?

    I’ve seen pages that get updated daily or weekly with a monthly change frequency indicated in their XML sitemap.

    Including the same page under different URLs within an XML sitemap defeats one purpose of an XML sitemap – helping search engines identify the canonical version of a page.

    Not updating an XML sitemap when the pages of a site are updated is also something to avoid.

    It doesn’t hurt to make sure that you validate your sitemap through Google Webmaster Tools.

    If you create a video sitemap, it helps to include as much information as possible, including a thumbnail image.

    I’ll look forward to your post.

    Working for the Courts was an interesting experience. While I started out there because of my legal education, my work with the Court evolved in a technical direction as my interests started moving that way. I was fortunate to have had that opportunity. :) I’d like to hear more about your experiences with the judicial system.

  23. Hi effisk,

    The study does provide a breakdown of the different types of sitemap formats that it found in looking at around 35 million sites with sitemaps. Text-based sitemaps are referred to in their table as “URL List.” Here’s the breakdown by percentage:

    XML Sitemap 76.76
    Unknown 17.51
    Url List 3.42
    Atom 1.61
    RSS 0.11

    It doesn’t provide any other information on sitemaps that use a text-based format.

  24. Interesting read, I was looking through your archives and the title sounded interesting. Even without this study, I think xml sitemaps should be a part of every website. They tell search engines what new pages have been added to your site, and in the majority of cases will get your pages indexed faster.

    Regards,
    Omar

  25. Sitemaps are pretty useful to get content indexed quicker, especially for relatively new websites. Note that it does not mean your pages WILL be indexed, but increases the likelihood and speed.

    If you are running a blog, in addition to xml sitemaps I find pinging certain services (especially Google’s blog service) gets content indexed quickly. Sometimes I find my content indexed within minutes using this method.

    Regards,
    OZ

  26. Hi Omar and Oz,

    XML sitemaps can help a search engine discover new URLs that they may then consider indexing, but there are many other considerations that a search engine takes into account before they will index a page. The Google study appears to show that it might be more likely that a search engine will index pages from a site if it discovers a URL from that site in an XML sitemap than if it discovers it on a crawl through the web, and that it may index the page at that URL quicker.

    Pinging is definitely a good idea if you’re running a blog, and you want to let a search engine know that you’ve updated. I’ve seen pages show up in search engine indexes in minutes, most likely because the search engine has been pinged about a new post. But again, the rate at which a search engine will index pages from a site can rely upon more than whether you’ve pinged a search engine or not.

  27. Most of the clients that we work with have never even heard of a sitemap.xml much less have one on their site. This is one of the first things we do when start work on an website to make sure that all of the pages are getting crawled and indexed.

  28. Hi Chris,

    One of my first steps as well – making sure that the pages that should be crawled and indexed on a site are in fact being crawled and indexed.

    An XML sitemap can help search engines discover URLs on a site that they might not have seen before, but it’s absolutely no substitute for an intelligent linking structure between the pages of a site. Given a choice between adding an HTML sitemap to a site and an XML sitemap, I’d probably go with the HTML sitemap most of the time.

  29. I’ve definitely found that for my wordpress blogs the xml sitemaps get me indexed very quickly. Plus, since I run a couple dozen blogs, the time savings of the xml sitemap plugin is a must have. As to non wordpress sites, the information provided on this post has pointed out some issues I hadn’t previously considered. Great Content. Thanks.

  30. Hi Pete,

    Thanks you. The wordpress plugins for XML sitemaps can be really helpful. Really large sites, and large sites that are updated very frequently do have some special issues that make planning for them very important. The Google paper provided some really nice examples of those.

  31. Dear Bill, we are now in 2012, and i think sitemap are not beneficial to websites. xml sitemaps are indexed by google. It means, you see them in google results. It means you lose google juice because of it as they don’t have any productive content.
    I did extensive tests. Sitemap is not beneficial at all. I even think it slows down indexing by adding one cycle. Try it in webmaster tools, you can see that sitemap indexed versus submitted pages or images are slow to get indexed. Sitemap does not speed up indexing.
    It also does not help to get rid of duplicates. It also does not help to get more pageranks.
    The benefits of sitemaps are urban legends. I challenge every one to give evidences that sitemaps speed up indexing and crawling.

Comments are closed.