Google Webmaster Tools Patent on Crawl Rates

Google’s Webmaster Tools offers web site owners tools and reports to learn more about how the search engine views your site, and to make it easier for the search engine to index the pages of a site.

A patent granted to Google today involves one of the tools included within the Webmaster Tools which can enable webmasters the ability to set different crawling rates on their website for Google’s crawling programs. The description from the patent doesn’t limit itself to that tool, and describes other processes involving the webmaster tools. These include:

  • The verification process used by owners to claim ownership of their site to use Webmaster tools,
  • The generation of XML sitemaps,
  • How XML sitemaps may be crawled by the search engine,
  • Setting a preferred version of a domain (such as with or without a “www”),
  • Informing the search engine of a move of a site to a new domain,
  • Setting a crawling rate for a site.

The patent is:

System and method for enabling website owners to manage crawl rate in a website indexing system
Invented by Vanessa Fox, Amanda Ann Camp, Maximilian Ibel, Patrik Rene Celeste Reali, Jeremy J. Lilley, Katherine Jane Lai, Ted J. Bonkenburg, and Neal Douglas Cardwell
Assigned to Google
US Patent 7,599,920
Granted October 6, 2009
Filed October 12, 2006

Abstract

Web crawlers crawl websites to access documents of the website for purposes of indexing the documents for search engines. The web crawlers crawl a specified website at a crawl rate that is based on multiple factors. One of the factors is a pre-set crawl rate limit. According to certain embodiments, an owner for a specified website is enabled to modify the crawl rate limit for the specified website when one or more pre-set criteria are met.

XML Sitemaps

The patent’s description section begins by providing a good amount of detail on how an XML sitemap generator might be used to create XML based files that a search engine can use to learn about pages that it might find on a web site. An XML sitemap doesn’t just include a list of URLs, or addresses of web pages found on a site, but also could also provide meta data associated with those pages, such as the last time a page was modified or accessed.

While Google has provided a great deal of information about XML Sitemaps on the help pages of their site, the patent provides some additional features that we may or may not see in the future. Some of them are interesting.

For example, we’re told that a sitemap index might at some point contain specific information about a site, such as different crawling rates for the site at different times.

<crawl_rate from =08:00UTC to=17:00UTC>medium</crawl_rate>

<crawl_rate from=17:00UTC to=8:00UTC>fast</crawl_rate>

Other site specific information in an XML sitemap could include geographic location information associated with a site, or information about languages supported by a site:

<location>latitude, longitude</location>

<language>German</language>

The XML sitemap generator described in the patent might also look at the access logs for a site to find URLs that result in error messages, so that those aren’t included in the sitemap. It could check to see how popular some pages are by how often they are visited, and could schedule more popular pages to be crawled first and more frequently.

Conclusion

The patent also provides a fair amount of detail on tools that could be used to set up a preferred version of a domain name, such as with or without a “www” in the name, or if redirect has been set up on a site from one domain to another to change the address of a web site. While the webmaster tools have been around for a while, that change of address feature was only announced in June of this year. So, some aspects of Google’s webmaster tools described in this patent appear to still be rolling out.

The process to verify that you are an “owner” of a site, so that you can use the web master tools is described in great detail, as is information about setting crawling rates for a site.

If you use Google’s webmaster tools, you might recognize a number of the features described in the patent. If you haven’t used the webmaster tools, you might find the descriptions of those from the patent interesting.

Share

29 thoughts on “Google Webmaster Tools Patent on Crawl Rates”

  1. Bill

    Realy surprised and just have a feeling whether one could dictate the site crawling rate at will. Rather, my school of SEO thought is that you update the content of the website frequently (correction as on a regular basis), do some major changes on the website equivalent to content updation, then you are inviting the search engine crawler to the party naturally, but lo! here we have a technique that dictates the crawling rate!. And how about this tag that acts as a compliment to the this Google Webmaster Tools for Crawl Rates

    I am really unsure of the credibility of this tag. Perhaps, for a website (say a corporate website) that doesn’t change its content too much – can changing the crawl rate and inserting content revisit tag influence the site crawling rate. But any site that gets updated regularly i feel will have a much better crawling rate.

    And Bill as ever excellent primer on Crawl Rate (but this time bit Google oriented!)

  2. And here is the tag I was referring to:

    Sorry for this tag on the previous comment altogether!

  3. Hi Shameer,

    Not seeing your tag. You may have to use &-l-t-; (remove the hypens) for the opening brackets, and &-g-t-; (remove the hyphens) for the closing brackets of your HTML to show me which tag you mean. If you’re referring to a “revist” tag like one that is sometimes seen in the head HTML sections on some sites, that originated with the Search BC directory out of British Columbia and was never used by one of the major search engines ever.

    All things considered, I think it’s a good idea for search engines to provide ways for webmasters to interact with them in meaningful ways. If you don’t block a search engine crawler in your robots.txt file, those programs should come to visit anyway, but they will likely limit their crawling to a polite level that doesn’t cause problems with other visitors seeing your pages.

    If you have a fairly robust server, and a lot of pages on your site, and can handle a faster crawling rate from a search engine, being able to communicate that to them is nice. That’s what this patent provides.

    The description from the patent also gives us some insight into other things going on in webmaster tools, and with Google’s XML sitemap generator and crawling process.

  4. Shameer.S,

    If you’re referring to the crawl_rate tag we mention in the patent, I don’t believe that’s been implemented.

    Currently, the primary explicit control mechanism for site owners regarding crawl rate is the feature in webmaster tools. One primary purpose is that Googlebot doesn’t infinitely crawl a site. One reason Googlebot might not crawl an entire site is that it doesn’t want to crawl the site too much and inadvertently take down the server. So if Googlebot is limiting the crawl for that reason, a large site can go into webmaster tools and indicate that the server can handle a higher load. That enables Googlebot to crawl more, and the site can be more comprehensively crawled/indexed.

  5. I am still finding that google is still crawling my sites regularly with the crawl rate left at default setting in webmaster tools. I cannot imagine server overload being a major issue for any but the biggest web sites.

  6. Yes me too, I observed the Big G’s little crawlers are on my FTP on IIS regularly and crawling every link in it, be it a picture, a file, just about everything. I do second the motion of pwilliams about server overload a major issue.

  7. From my perspective, I prefer that Google just spider my site whenever the bot feels like it. Trying to set up a time period to me might be gaming the system a little which I really don’t see any value for a couple of reasons. If your are actively adding articles to your site and doing the necessary back linking, then G will find the site naturally. In fact G has discussed that very issue at length about making things look naturally with posting and linking to and from our own websites.

    I have read in other threads that when you submit your url to G, they in fact may delay in spidering your sight. So far, for my home based business sites, I notice the different bots coming to all of my sites on a regular basis.

    Will have to wait and see what happens with this new update and more importantly if someone does some testing to see any better rankings from this change.

    Submit good content all over the place, rinse and repeat – the bots will spider your site regularly.

    Cheers

  8. Hi Vanessa,

    Thanks for your comment and response to Shameer’s comment. I don’t believe that the crawl_rate element is something that webmasters can presently include in an XML sitemap for Google either. It is an interesting idea, to be able to set different crawl rates for different times of the day, and make it less likely that a site would be crawled heavily by Googlebot during times of a day when a site is typically at its busiest.

  9. Hi Darrell,

    Changing the crawl rate settings for a site through Google’s Webmaster Tools doesn’t tell Google when to crawl your site, nor does it even determine how deeply your site might get crawled or indexed. It’s not going to increase your rankings, or replace the need for backlinks or well written content, nor “game” Google in anyway.

  10. Hi pwilliams,

    I think the default setting for the crawling of websites by most of the major commercial search engines tends to be pretty conservative. We’re told in the section of the Webmaster tools where crawl rate can be set as faster or slower:

    Our goal is to crawl as many pages from your site as we can on each visit without overwhelming your server’s bandwidth. You can change the crawl rate (the time used by Googlebot to crawl the site) for sites that are at the root level – for example, “www.example.com” and “http://subdomain.example.com.” The new custom crawl rate will be valid for 90 days.

    Chances are for some sites, that if they are on a robust server and are fairly well designed, that they can handle a faster crawl rate. That might be a good thing, especially if those sites are bigger and have lots of pages. We’re also told in that crawling rate section that:

    The crawl rate affects the speed of Googlebot’s requests during the crawl process. It has no effect on how often Googlebot crawls your site. Google determines the recommended rate based on the number of pages in your site.

    So larger sites will have faster default crawl rates, but chances are that the default rate at which those sites are crawled are also set so that they don’t overwhelm a server, either. If they are, it is possible to set a slower crawl rate in webmaster tools as well.

  11. Hi CI Web Studio,

    It’s great to be able to change settings like those in webmaster tools to avoid the possibility that a site might be crawled too fast. Changing a setting in webmaster tools to make the crawl rate slower might help. But, there are other steps that a webmaster can take, such as optimizing a site so that it can handle a default crawl rate – such as making sure that the hosting plan a site is using is adequate for the site’s use, and making sure that the site uses techniques that make it more capable of handling higher rates of traffic, such as using gzip compression, as well as intelligently compressed and properly sized images, and other things that were discussed in a recent post and comments here on the latency of a web site. See: Does Page Load Time influence SEO?.

  12. Hi Bill. Agree with your comments to Darrell above. Have you done any testing of this feature on large sites? We tried to increase the crawl rate on a site with over 1M URLs and found the actual crawl rate didn’t change much at all.

  13. Bill, so does this mean that the new or updated google webmaster tools, will allow for multiple domain names to point back to the same site? Specifically, with the ability to manage these domain names? for example “www.mysite.com” and/or “www.mysite.net”. I only ask, because i have several clients that are regularly buying up domains that are associated with their company/brand names. Thanks

  14. Hi Bullaman,

    Thank you. I haven’t tested the crawl rate feature with any large sites – with crawl rates set automatically to be proportional to the size of the site, it seemed pretty reasonable to let Google go at sites with the speed that it had been following.

  15. Hi Chicago Web Design,

    There is a feature that is now included in Google’s Webmaster Tools that allows you to “change” an address of a site so that if you have a number of pages from that site indexed by Google, and you set up a 301 redirect to the new domain, it let Google update their index more quickly to reflect the new domain. See:

    http://www.google.com/support/webmasters/bin/answer.py?answer=83106

    If someone isn’t actually “moving” but has bought domains that they want to go to the the main domain for a company, a 301 redirect by itself may be all that they really want to do. A good example, is where Disney redirects “www.mickeymouse.com” (likely both protecting their brand, and anticipating some type-in traffic)to “http://disney.go.com/mickey/”

  16. Hi Bill,

    I recently watched a video that seomoz did with a top google engineer last friday and they were talking about the exact same thing. i.e. setting the crawl rate slower. SEOMOZ asked the question if you set the crawl rate slower does that mean that Google won’t index as many pages and the google engineer agreed. So the moral is! Don’t mess with the crawl rate! Thanks

  17. Hi Precise Internet Marketing,

    Thanks. I think it makes sense to not make changes, such as slowing down the crawl rate, unless it seems like it’s absolutely essential. And if it does seem to be essential, looking at other ways to optimize the site to reduce required bandwidth, or changing hosting to increase bandwidth available to you is a first step to take.

  18. pretty interesting idea. due to the international traffic we get, we don’t really have a slower time to allow for more crawling.

    as for setting a crawl rate having no effect on how fast google crawls the site, I can say that I upped it from “default” and the number of pages crawled every day went up. coincidence — must be.

  19. Hi Matt,

    Good to hear that the amount of pages crawled each day went up for you.

    Search engines will look at other things than just the Webmaster tools setting for crawling rates to determine how many pages of a site they will crawl, and how deeply through a site that they do. One of the early papers that Google relied upon when they first started was Efficient Crawling Through URL Ordering (pdf) which describes some of the importance metrics that they followed in deciding which URLs to crawl next when they had collected a bunch from other pages. While the paper is old, the ideas behind it still carry some weight. For instance, if Google had to decide between crawling a million pages from one web site, or a million home pages from a million web sites, it might choose to crawl the million home pages, because that would give it a much more diverse set of results that it could show searchers.

    The idea of being able to set a crawl rate isn’t so much tied to how many pages a search engine might crawl on your site, but rather how many it might crawl while being “polite” and not tying up so many of your server resources that it slows your site down considerably, or makes it inaccessible to other visitors. It sounds like Google might have been a little too conservative with its crawling of your site – meaning it was willing to crawl more pages than it had been based upon other things, like the importance metrics I mention, but was limiting itself to less pages to be “polite.” For some sites, that have robust enough hosting and are optimized for page load times, increasing the webmaster tools crawl rate can be a good idea.

  20. Bill

    With reference to your last comment to Matt, Just have a creepy feeling [though I may be a simpleton at times :)]that Google uses one of its main servers to index home page on a regular basis and sub-pages through its incremental servers which might take some time to update. This technique i presume might influence the crawling rate leading to different crawling times.

  21. Hi Shameer,

    That’s one possibility, though there are other metrics that a crawling program may follow as well, that can depend upon things such as PageRank, and a past frequency of updates. So it’s possible that some crawlers may focus upon home pages with high PageRanks that are frequently updated.

  22. Bill,

    If changing Google’s crawl rate settings doesn’t affect anything about how Google crawls your site, then what does it do? Can we gain something if we change it? I use Google’s default crawl settings on my websites, though, but it’s interesting to know what lies ahead of us if we change the settings.

  23. I believe the setting just works in one direction. In other words, you can tell Google to crawl your site less often than the default, but not more often.

  24. Hi Kevin,

    The Google Webmaster Tools settings for crawling shows a slider with “slower” on one end and “faster” on the other end. You can request that they crawl your pages faster, and that might not be a bad idea if you make some changes to your site, like moving to a much faster server. But chances are that Google would notice at some point that your server is more responsive.

Comments are closed.