Google on the Crawling of Web Sites

When I talk with someone about how a search engine works, I find it convenient to break the process down into three parts, because there are three primary functions that a search engine performs.

These three parts are Crawling, Indexing, and Serving Results. I like using this three part breakdown because I find that it makes it easier to explain how each of those parts work by themselves, and together with the other parts.

A patent granted to Google today, and originally filed in 2000, explores the first of those parts – the crawling of web pages.

This is an interesting area, because having some knowledge of it might help to explain why some pages on the Web get indexed, and why some other pages might not. There are a couple of links that I like to point people towards when I talk about Google and crawling web pages.

The first of those is a page on the Stanford web site, which provides us with a list of a Working Papers Concerning the Creation of Google. If you would like to explore some of the technological approaches and processes behind Google, it doesn’t hurt to look at some of the papers listed on that page.

One of those listed papers focuses upon different priorities for crawling the pages behind different URLs on a page that a search crawling program finds. For instance, one choice might be to try to crawl and index as many pages that are in the root directories of websites instead of trying to crawl all of the pages of one large website. The paper is Efficient Crawling Through URL Ordering.

It’s a paper that I recommend highly for anyone wanting to learn about some of the potential hurdles that a search engine faces when crawling pages.

The Google patent covers some issues that the paper doesn’t include, such as how crawling of pages from a specific domain or IP address might be scheduled to not impose too large of a impact upon a server, by looking at a stall time specified between accesses of a page by a search engine. So the priority of what page might be crawled next may also be influenced by what kind of strain might be placed upon a server.

The patent filing also discusses the need for a distributed set of crawling programs, so that different pages on the Web can be crawled in a timely manner.

The patent discusses the role of PageRank in determining crawl prioritizating, how PDF and Postscript files may be converted to text files before their content is sent to an indexing program, how known high output sites such as AOL might be crawled quicker than other sites since they can handle the impact of a greater number of crawler visits at one time.

Distributed crawling of hyperlinked documents

Inventors: Jeffrey A. Dean. Craig Silverstein, Benedict Gomes, and Sanjay Ghemawat
Assigned to Google
United States Patent 7,305,610
Granted December 4, 2007
Filed: August 14, 2000

Abstract

Techniques for crawling hyperlinked documents are provided. Hyperlinked documents to be crawled are grouped by host and the host to be crawled next is selected according to a stall time of the host. The stall time can indicate the earliest time that the host should be crawled and the stall times can be a predetermined amount of time, vary by host and be adjusted according to actual retrieval times from the host.

ps. It’s the first day of Pubcon, and I’m off to registration.

Share

9 thoughts on “Google on the Crawling of Web Sites”

  1. This is a timely post for me, since I have learned the hard way about the Google search engine’s crawling, indexing and serving of results as they relate to recent changes to my own site.

    I made changes to my site several weeks ago that evidently did not please the Google search algorithm. Once the changed site was crawled by the Google spider, it took approximately 5 days for the new site to appear in Google’s search results. It then took only approximately 24 hours for me to have a major fall from some key search terms that I was ranking very highly on.

    Knowing that the changes did not work, I replaced them with the “old site”, again it took 5-6 days for the”old site” to get crawled and appear in Google’s search results and even though the “old site” is starting to climb again in the Google search rankings – it is taking a lot longer to get back on top than it took to fall.

    Remember it took around 24 hours to fall and now I am going on day 4 and I am slowly climbing back, but I still have a long way to go. If you have something that is working, don’t step on your own feet with major changes. I think incremental changes are more digestible for Google’s search engine.

  2. Good to hear that you seem to be moving back in the right direction.

    Changes that you make to your website are always going to carry some level of risk with them, and it really helps to check everything twice when you do make changes – including doing things like using HTML validation programs and link checking programs to add another way to try to find any mistakes that might have been missed.

    It can difficult to tell how a search engine might react to changes that you make, so it does help to try to follow as many best practices as possible. If you’re not certain, than the incremental approach can be a good idea.

    Another thing to be concerned about is that you aren’t making those changes in a vacuum – other site owners are doing things with their sites, too.

  3. I’ve been told that the links above aren’t working right now. I don’t know if that is a temporary problem, or a long term one. Here are the names of the papers (and their authors) listed on that Stanford page:

    • The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin, Lawrence Page
    • Dynamic Data Mining: Exploring Large Rule Spaces by Sampling by Sergey Brin, Lawrence Page
    • Computing Iceberg Queries Efficiently by Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, Jeffrey D. Ullman
    • The PageRank Citation Ranking: Bringing Order to the Web by Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd
    • Extracting Patterns and Relations from the World Wide Web by Sergey Brin
    • Finding near-replicas of documents on the web by Narayanan Shivakumar, Hector Garcia-Molina
    • Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, Lawrence Page

    Hopefully the Stanford server is just down temporarily.

  4. Hi,

    You have given a clear idea about the theoritical part,but what about the practical program?try to post the code for cawling the webpages

  5. Hi Sirisha,

    I can give you some information about the theoretical part, but the practical program isn’t something that I can provide.

    The patent application provides some information about a process that Google may have developed, but not the code itself.

  6. My site got indexed but only 1 page, I have been now waiting for weeks to update my website keywords and other pages. It take s forever google to do anything to your site if it has no pr.www.workonlineathome.co.uk

  7. Hi Lee,

    That’s one of the paradoxes of search engine indexing when links play such and important role in the indexing of pages – regardless of how good your site or content might be, it really helps to have links pointing to your pages to convince Google to spend some time crawling your site, and indexing that content. Having great content can really help you get links. But it’s hard for anyone to find your content if you don’t have links.

Comments are closed.