How Do Search Spiders Crawl Your Site?
Every web page has at least one unique address that people can reach it by. Sometimes a web page has more than one address, and that can be a problem.
For example, if I look for a digital piano on the target.com web site, I might find one at the following address:
- “http://www.target.com/Suzuki-Home-Digital-Piano-Rosewood/dp/B001HDAQ76/ref=br_1_1?ie=UTF8&id=Suzuki%20Home%20Digital%20Piano%20Rosewood&node=257467011&searchSize=30&searchView=list&searchPage=1&sr=1-1&qid=1288907760&rh=&searchBinNameList=subjectbin%2Cprice%2Ctarget_com_primary_color-bin%2Ctarget_com_size-bin%2Ctarget_com_brand-bin&searchRank=salesrank&frombrowse=1”
If I remove each of the following parameters (parts of the address) or combinations of parameters, from the Target URL for the piano, and place what is left of the URL into my browser’s address bar, I still get the same page each time, but with shorter URLs:
- &frombrowse=1
- &searchRank=salesrank
- &searchRank=salesrank&frombrowse=1
- &rh=&searchBinNameList=subjectbin%2Cprice%2Ctarget_com_primary_color-bin%2Ctarget_com_size-bin%2Ctarget_com_brand-bin
- &rh=&searchBinNameList=subjectbin%2Cprice%2Ctarget_com_primary_color-bin%2Ctarget_com_size-bin%2Ctarget_com_brand-bin&frombrowse=1
- &rh=&searchBinNameList=subjectbin%2Cprice%2Ctarget_com_primary_color-bin%2Ctarget_com_size-bin%2Ctarget_com_brand-bin&searchRank=salesrank
- &rh=&searchBinNameList=subjectbin%2Cprice%2Ctarget_com_primary_color-bin%2Ctarget_com_size-bin%2Ctarget_com_brand-bin&searchRank=salesrank&frombrowse=1
I can keep on removing more parts of that URL until I finally get down to the following address, which still shows me the very same page:
- “http://www.target.com/Suzuki-Home-Digital-Piano-Rosewood/dp/B001HDAQ76/”
If I search for the Target page for the piano at Google and Yahoo and Bing, on a search for “Suzuki Home Digital Piano” (without the quotation marks), the URL that those search engines show me for the piano at Target is:
- “http://www.target.com/Suzuki-Home-Digital-Piano-Rosewood/dp/B001HDAQ76/”
Both Google and Yahoo were granted patents this week on how they might address the problem of multiple addresses for the same page on a dynamic site when search spiders from those sites crawl or index pages on the web. Google’s patent dates back to 2003, and Yahoo’s was filed in 2006. Coincidentally, both were granted on the same day.
It’s possible that search spiders from all three search engines are using similar approaches to solve this problem, but it doesn’t always work as cleanly as in the Target example.
If I decide that I want to buy a welder, I might go to the Sears web site, and choose something like the Craftsman MIG Welder with Cart.
If I browse through the site, I might find that welder at the following URL:
- “http://www.sears.com/shc/s/p_10153_12605_00920569000P?prdNo=1&blockNo=1&blockType=G1”
If I search on Google for the welder at Sears, I see the following address:
- “http://www.sears.com/shc/s/p_10153_12605_00920569000P?prdNo=1”
I can remove the last parameter, “?prdNo=1”, and I still get the same page.
Yahoo and Bing both deliver me to the shorter version of the URL:
- “http://www.sears.com/shc/s/p_10153_12605_00920569000P”
Potential Search Problems on Sites with Extra Parameters
Extra parameters like the ones above often serve one of two purposes.
- Url Parameters may be used to control how content is displayed on a page, such as how it might be sorted, or how many results to show, or whether to include some additional navigation (such as links to other related pages).
- URL Parameters may also be used to track visitors and their movements through a site.
A search engine might inadvertently index the same page under more than one URL, with additional parameters showing.
For instance, if I search for “site:www.sears.com “Craftsman MIG Welder with Cart”” (without the outside quotation marks), I see around 2,600 pages listed as results, which include that phrase, such as the main product page, a main category page, and search results pages. Many of them are duplicates of each other, such as the following:
- “http://www.sears.com/shc/s/p_10153_12605_00920569000P”
- “http://www.sears.com/shc/s/p_10153_12605_00920569000P?sid=comm_sears_reviews”
- “http://www.sears.com/shc/s/p_10153_12605_00920569000P?prdNo=16”
- “http://www.sears.com/shc/s/p_10153_12605_00920569000P?sid=MMLxconnect+wires+replacement+overload+12555902GIDxsearsProductDetails”
- “http://www.sears.com/shc/s/p_10153_12605_00920569000P?prdNo=1&blockNo=1&blockType=G1%C2%A0%C2%A0”
- “http://www.sears.com/shc/s/p_10153_12605_00920569000P?vName=Tools&cName=Welding+Equipment&sName=MIG+Welders&sbf=Brand&sbv=Craftsman”
Is Google splitting PageRank between these pages? It’s possible.
I wrote a detailed post about Yahoo’s approach back when it was first published as a pending patent application entitled Same-Site Duplicate Pages at Different URLs. Google’s patent uses somewhat different language, but the ultimate goal is the same – finding a canonical URL for pages that might include more than one parameter that isn’t essential for displaying the content of a page.
As a side note, I was excited to see the Yahoo patent reference another post of mine, Solving Different URLs with Similar Text (DUST), as an “Other Reference.” That post describes a paper, Do Not Crawl in the DUST: Different URLs with Similar Text Extended Abstract, by Uri Schonfeld, Ziv Bar-Yossef, and Idit Keidar, who provide another approach to identifying “Different URLs with Similar Text.” Ziv Bar-Yossef joined Google not too long after the paper was published.
Google’s search spiders patent is:
Automatic generation of rewrite rules for URLs
Invented by Craig Nevill-Manning, Chade-Meng Tan, Aynur Dayanik, and Peter Norvig
Assigned to Google)
US Patent 7,827,254
Granted November 2, 2010
Filed December 31, 2003
Abstract
A rewrite component automatically generates rewrite rules that describe how uniform resource locators (URLs) can be rewritten to reduce or eliminate different URLs that redundantly refer to the same or substantially the same content. The rewrite rules can be applied to URLs received when crawling a network to increase the efficiency of the crawl and the corresponding document index generated from the crawl.
The Yahoo search spiders patent is:
Handling dynamic URLs in crawl for better coverage of unique content
Invented by Priyank S. Garg and Arnabnil Bhattacharjee
Assigned to Yahoo!
US Patent 7,827,166
Granted November 2, 2010
Filed: October 13, 2006
Abstract
Techniques for identifying duplicate webpages are provided. In one technique, one or more parameters of a first unique URL are identified where each of the one or more parameters do not substantially affect the content of the corresponding webpage.
The first URL and subsequent URLs may be rewritten to drop each of the one or more parameters. Each of the subsequent URLs is compared to the first URL. If a subsequent URL is the same as the first URL, then the corresponding webpage of the subsequent URL is not accessed or crawled. In another technique, the parameters of multiple URLs are sorted, for example, alphabetically.
If any URLs are the same, then the webpages of the duplicate URLs are not accessed or crawled.
The methods described in these patents seem to have some limitations, and the search engines have come up with other ways for webmasters to try to help themselves. Those include:
- The Canonical Link Element
- Parameter Handling from Google
- Dynamic URLs from Yahoo
- XML Sitemaps at Google, at Bing, and at Yahoo.
Google introduced an improved parameter handling approach in early October, in Webmaster Tools, and announced it on the Official Google Blog in the post Webmaster Tools: Updates to Search queries, Parameter handling and Messages
Conclusion
In an ideal world, it would be possible to only have one URL for each page on your website.
But, that’s something that isn’t always possible.
The search engines do strive to try to use only one URL per page, but that isn’t always possible for them as well.
There are other issues involving duplicate content that are more easily solved. If you have canonicalization issues because of multiple parameters in your URLs, you may want to spend some time learning about the canonical link element, parameter handling, and XML sitemaps. Search spiders will try to crawl every page of your site if they can, and a little self-help in making sure that they don’t is recommended.
Thanks for this interesting post,
An aspect you don’t talk about is the fact that it is generaly not a good idea to let crawler spend time crawling stupid (useless) pages, that’s time which is not use to crawl good pages.
After what, as you said, in real world, it’s not always possible to get just one URL for one content.
hope this help,
Paul
I wonder at what stage you think duplication is removed at? (I’ve not seen anything that suggests when it happens). For instance, Paul’s concern may be mute if the search engine uses this to exclude crawling of duplicate pages after initially discovering them; or it could be that duplicates are still included in the link graph, in which case link value and semantic meaning may stil be spread over all those duplicate pages; or is it purely at the presentation layer, a result set is returned and duplicates are removed from it.
@chris
What I understand is that it is very difficult to exclude a page from crawl for search engines.
Pages are not statics, so, crawler should get it oftenly, just … to be sure it did change or not.
not sure my poor english is understandable, sorry for that.
😉
after what, the question about when is duplicate filter applied is even more intersting …
In what I imagine – but just imagine,ok 😉 – that’s may vary.
First reason is there are many differents kinds of duplicates.
For exemple, on some sites, the site command has many duplicate (cause of the same header or same meta-description), in my mind, here, it’s a filter on the results.
Hi Paul,
There are a number of things that I considered including in this post. How the search engines approach crawling the Web could possibly fill up a book or two. 🙂
One of the reasons why a search engine would try to understand whether or not some parameters can be removed or not is definitely to save them from the effort of crawling the same pages over and over at different URLs. It would be a waste of time and resources to do so.
I’d rather not leave it in the hands of the search engines to have to choose, regardless of whether that means spending the time to present pages so that they only have one URL, or using something like the canonical tag to point out which version amongst many is the one that ideally should be indexed.
You’re right – there are many kinds of duplicates, and a number of different ways that a search engine might attempt to identify them and not continue to crawl them, or to index them, or to display them.
The easiest is when pages are exact duplicates, but many pages are what might be called “near duplicates,” where things like templates and boilerplate content (headers, footers, and sidebars, copyright notices, etc.) differ, but the main content is the same.
One of the biggest challenges for the search engines isn’t necessarily whether or not there is duplication going on, but rather which site or which URL to show in the search results.
Hi Chris,
I’ve heard representatives from the search engines clearly state that duplicate content may be addressed at any stage, from crawling, to indexing, to displaying search results. I’ve also written a good number of blog posts describing patents and white papers that show examples of all three.
For instance, a search engine might analyze the link structure of a site that they are crawling to see if they’ve crawled another site with a very similar link structure. Chances are that if they come across something like that, they may be crawling a mirror site. If a search engine keeps on coming across pages on a iste during crawling that are substantially similar to one another, it might make a decision to allocate less resources to the crawling of the pages of that site, and move on to crawling a different site.
During the indexing phase, a search engine has a lot more time to analyze pages and compare content. Some pages may be excluded from being included in the index, or may be placed within a supplemental index.
If parts of pages are duplicated, and those pages are chosen for display in search results, when snippets are chosen from the pages, those might be compared to see if they are duplicates – if so, the search engines may filter out one or more. Sometimes you get to the end of a set of search results, and the search engine tells you that there are more results that aren’t being displayed because they are substantially similar to the results already shown. They do provide a link that you can click upon to see those, but it’s a clear sign that filtering is going on during the display portion of what a search engine does.
I have seen Google tell me that it indexed one page more than 15,000 times under different URLs, so the possibility of splitting PageRank does exist. If Google thinks a site is important enough, based upon things like the quantity and quality of links to the pages of the site as well as other factors, it might index a good number of pages with duplicated content.
Thanks Bill, this is a good read and helpful to some adjustments we will be making to our site.
Interesting post – and a subtle distinction I got out of it: websites like Sears and Target are so large that url canonicalization causes more problems for usability than it does for their search ranking. For most websites, it’s the other way around.
Still, you’re better off optimizing a site to not need those patents. At the very least, sites should be using something for url design like htaccess rewrites.
Hi Bill,
Thanks taking time to answer me 😉
As you write at the end of your post, Google Webmaster Tools has now a functionnality to handle parameters during crawling.
For me, it’s an evidence, search engines can’t handle alone/algoritmely (even with patents) this very simple kind of duplicate.
A thing I do when working on the SEO of big websites is to find uniq identifier in URL, and to find how many pages there is for each uniq identifier … results are sometimes … sureprising 😉
In my mind, again your are right saying the two problems for Search Engine with duplicates are :
wasting time crawling stupid pages,
which one to show.
thanks for this post, this blog and this discussion 😉
@bill
Thanks for sharing this content. I have been relying way too much on Google to figure out my site. I’ve also been relying heavily on a couple plugins that I have running on WordPress. However, I do not believe they are using the rel=canonical syntax and I will definitely include this in all of my search and archive templates.
You know…..my site isn’t large enough where I would have to worry much about this. However, in my experience, the shorter the url, the more useful. Also, I like to take the same approach to my URL structure that I do with my coding.
That is, the cleaner, simpler, and less duplication that I display, the better it seems to perform in the SERPs. I know that this may be counter intuitive, especially in dealing with a large ecommerce site, but would avoid as much of the clutter as possible. I strongly agree with the gentleman who recommended that you just do the htaccess rewrites to eliminate the slightest possibility of a problem.
Last but not least, that is nuts that they approved both of those patents on the same day, especially given the difference in filing dates…..Maybe that day was “Major Search Engine Patent Day”! 🙂
I agree with Mike. I think the shorter URL always helps the users. My site also doesn’t have enough content, so I don’t need to worry, however, the sites like Sears or major on-line stores will always have this problem. I put this in my mind as I add more pages and contents to my site. Thanks for the information.
I can see how the shopping sites can be affected, but what about the small business? I use SEF URLs for my pages, in hopes that this means one URL – one page (1U1P).
Shouldn’t the burden for 1U1P be placed on the website’s webmaster?
Sorry, I’m not an SEO scientist. 🙂
I’m like Tina and this post has revealed a lot to me. I will surely be reading or resarching more on canonical link element, parameter handling, and XML sitemaps.
Bill,
Up until this point I haven’t given much thought to URLs’ parameters, this gives everyone a better understanding of the implications involved in multiple URLs. Very neat that Yahoo referenced your DUST post as well, congrats. Thanks again for the helpful info.
Dustin
Since I use WordPress I don’t think I need to worry much about canonical link element, parameter handling, and XML sitemaps. Because once you set up your settings in WordPress you won’t have to worry about a search engine accidentally indexing the same page under more than one URL.
Thank you, Michael,
Glad to be able to help.
Hi Kyle,
Interesting point:
Many of the SEO problems I see on sites happen equally on large sites and small ones. It can be hard for a large site to develop a unified strategy to avoid issues like multiple URLs for the same page, but you would think they would have the resources at hand to solve problems like that.
If you can reach the point where you don’t need to have the search engines trying to make a best guess as to what URL is the canonical one, that’s a great place to be.
Hi Paul,
You’re welcome.
I am a big fan of helping search spiders find the pages that I’d like them to index, and not having them index the ones that they don’t need to visit.
You’re welcome.
Hi Tina,
You’re welcome. I’m not sure if you would want Google to index the search pages on your site.
Hi Mike,
That’s something that I’ve experienced with many sites that I’ve worked upon as well.
It was odd that both were granted on the same day, but I’m not willing to call it a major search patent day. 🙂
Hi Dow,
It doesn’t hurt to look for problems like this on small sites as well. As a matter of fact, the smaller a site, the bigger the impact that splitting pagerank between duplicates might have.
Hi Tim,
Search engine friendly URLs can be a start, but there can be other problems that a small site can have, such as being able to visit the same pages with and without a “www,” or when pages that aren’t supposed to have an “https:” in front of them do because of the way that https pages are linked out from.
The burden of one url per page should be upon the webmaster, but the search engines invest a lot of time and money crawling sites, and if they end up crawling a large number of duplicates, they can end up wasting a considerable amount of effort.
Hi Andrew,
Good to hear. Thanks.
Hi Dustin,
Exploring how search engines may treat parameters in URLs, and how you can help them understand which ones to ignore can be useful in getting them to index all of the pages of your site that you would like to have them index.
It was fun to see the mention of my earlier post on the Yahoo patent. I’ve been starting to see a few more of those lately
Hi John,
My experience has been that Google will index the same page under different URLs even in WordPress.
For example, when WordPress introduced (and turned on by default in an earlier version of WordPress) comment pages, and appended a &cpage=1 (and &cpage=2, etc.), to post pages, both the non-comment parameter versions of the pages and the ones with a “cpage=” were being indexed by Google.
There are WordPress plugins that provide link elements with canonical values, and XML sitemaps for WordPress blogs. You don’t have to use them, but they could be helpful to you.
As already discussed, some of the wordpress plug ins are phenomenal for this.
The post you wrote is good but see if your launching a project that is supposed to be SEO friendly then you can take the rewriting as a saftety measure and do it even if searh engines do or do not rewrite them….
This will come in handy as I’m working with SEO professionally. Thanks for the heads up.
Hi Steve,
There are some very helpful wordpress plugins out there that help address these kinds of problems in WordPress. They don’t necessarily solve every problem, but they do go a long way to fixing most of them.
Hi Usman,
I think it’s a very good development practice to strive to have one URL per page, regardless of whether search engines are trying to solve the problems on their own. I’d rather not rely upon the search engines to get it right if I can.
Hi Jontz,
You’re welcome.