When I talk with someone about how a search engine works, I find it convenient to break the process down into three parts, because there are three primary functions that a search engine performs.
These three parts are Crawling, Indexing, and Serving Results. I like using this three-part breakdown because I find that it makes it easier to explain how each of those parts works by themselves, and together with the other parts.
A patent granted to Google today, and originally filed in 2000, explores the first of those parts – crawling websites.
This is an interesting area because having some knowledge of it might help to explain why some pages on the Web get indexed, and why some other pages might not. There are a couple of links that I like to point people towards when I talk about Google and crawling websites.
The first of those is a page on the Stanford web site, which provides us with a list of a Working Papers Concerning the Creation of Google. If you would like to explore some of the technological approaches and processes behind Google, it doesn’t hurt to look at some of the papers listed on that page.
One of those listed papers focuses on different priorities for crawling websites and the pages behind different URLs on a page that a search crawling program finds. For instance, one choice might be to try to crawl and index as many pages that are in the root directories of websites instead of trying to crawl all of the pages of one large website. The paper is Efficient Crawling Through URL Ordering. One of the authors of the paper is Lawrence Page, a co-founder of Google.
It’s a paper that I recommend highly for anyone wanting to learn about some of the potential hurdles that a search engine faces when crawling websites.
The Google patent covers some issues that the paper doesn’t include, such as how crawling of pages from a specific domain or IP address might be scheduled to not impose too large of an impact upon a server, by looking at a stall time specified between accesses of a page by a search engine. So the priority of what page might be crawled next may also be influenced by what kind of strain might be placed upon a server.
The patent filing also discusses the need for a distributed set of crawling programs, so that different pages on the Web can be crawled on time.
The patent discusses the role of PageRank in determining crawl prioritization, how PDF and Postscript files may be converted to text files before their content is sent to an indexing program, how known high output sites such as AOL might be crawled quicker than other websites since they can handle the impact of a greater number of crawler visits at one time.
Distributed crawling of hyperlinked documents
Inventors: Jeffrey A. Dean. Craig Silverstein, Benedict Gomes, and Sanjay Ghemawat
Assigned to Google
US Patent 7,305,610
Granted December 4, 2007
Filed: August 14, 2000
Abstract
Techniques for crawling hyperlinked documents are provided. Hyperlinked documents to be crawled are grouped by host and the host to be crawled next is selected according to a stall time of the host. The stall time can indicate the earliest time that the host should be crawled and the stall times can be a predetermined amount of time, vary by host and be adjusted according to actual retrieval times from the host.
ps. It’s the first day of Pubcon, and I’m off to registration.
This is a timely post for me, since I have learned the hard way about the Google search engine’s crawling, indexing and serving of results as they relate to recent changes to my own site.
I made changes to my site several weeks ago that evidently did not please the Google search algorithm. Once the changed site was crawled by the Google spider, it took approximately 5 days for the new site to appear in Google’s search results. It then took only approximately 24 hours for me to have a major fall from some key search terms that I was ranking very highly on.
Knowing that the changes did not work, I replaced them with the “old site”, again it took 5-6 days for the”old site” to get crawled and appear in Google’s search results and even though the “old site” is starting to climb again in the Google search rankings – it is taking a lot longer to get back on top than it took to fall.
Remember it took around 24 hours to fall and now I am going on day 4 and I am slowly climbing back, but I still have a long way to go. If you have something that is working, don’t step on your own feet with major changes. I think incremental changes are more digestible for Google’s search engine.
Good to hear that you seem to be moving back in the right direction.
Changes that you make to your website are always going to carry some level of risk with them, and it really helps to check everything twice when you do make changes – including doing things like using HTML validation programs and link checking programs to add another way to try to find any mistakes that might have been missed.
It can difficult to tell how a search engine might react to changes that you make, so it does help to try to follow as many best practices as possible. If you’re not certain, than the incremental approach can be a good idea.
Another thing to be concerned about is that you aren’t making those changes in a vacuum – other site owners are doing things with their sites, too.
I’ve been told that the links above aren’t working right now. I don’t know if that is a temporary problem, or a long term one. Here are the names of the papers (and their authors) listed on that Stanford page:
Hopefully the Stanford server is just down temporarily.
Hi,
You have given a clear idea about the theoritical part,but what about the practical program?try to post the code for cawling the webpages
Hi Sirisha,
I can give you some information about the theoretical part, but the practical program isn’t something that I can provide.
The patent application provides some information about a process that Google may have developed, but not the code itself.
My site got indexed but only 1 page, I have been now waiting for weeks to update my website keywords and other pages. It take s forever google to do anything to your site if it has no pr.www.workonlineathome.co.uk
Hi Lee,
That’s one of the paradoxes of search engine indexing when links play such and important role in the indexing of pages – regardless of how good your site or content might be, it really helps to have links pointing to your pages to convince Google to spend some time crawling your site, and indexing that content. Having great content can really help you get links. But it’s hard for anyone to find your content if you don’t have links.