Same-Site Duplicate Pages at Different URLs

One of the technical issues that can cause problems with a search engine crawling a site to index its pages is when the content of pages on that site appears more than once on the site at different URLs (Unique resource locators, or web page addresses).

Unfortunately, this problem happens more frequently than it should.

A new patent application from Yahoo explores how they might handle dynamic URLs to avoid this problem. What is nice about the patent application is that it identifies a number of the problems that might arise because of duplicate content at different web addresses on the same site, and some approaches that they might use to solve the problem.

While search engines like Yahoo can resolve some of the issues around duplicate content, its often in the best interest of site owners to not rely upon search engines to fix this problem on their own.

Avoiding the Crawling of Duplicate Pages

Crawling programs browse the world wide Web and identify and index as much information as possible. These programs locate new pages as well as updates on old pages, so that information can be indexed and available to searchers through the search engine.

Web crawlers often start crawling the web at one or more web pages, and follow links to those webpages to other pages, and so on and so on.

A strategy that these programs may follow to retrieve as much information as they can is to try to only “crawl” pages that provide unique content – pages that haven’t already be indexed or that have been updated if they are already in the index.

One assumption that a web crawler could make while following this strategy is that a unique URL (Unique resource locator) corresponds to a unique webpage. As I noted above, this isn’t always true.

A search engine doesn’t want to index the same page on a site more than once, but it happens, and often other pages of a site don’t get indexed while others are indexed multiple times uder different URLs. I recall seeing at least one page on a site indexed many thousands of times in Google.

That problem can happen when a site uses a content management system or ecommerce platform that uses dynamic URLs.

A dynamic URL typically results from search of a database-driven website or the URL of a website that runs a script. In contrast to static URLs, in which the contents of the webpage do not change unless the changes are coded into the HTML, dynamic URLs are typically generated from specific queries to a website’s database.

The webpage has some fixed content and some part of the webpage is a template to display the results of the query, where the content comes from the database that is associated with the website. This results in the page changing based on the data retrieved from the database per the dynamic parameter.

Dynamic URLs often contain the following characters: ?, &, %, +, =, $, cgi. An example of a dynamic URL may be something like the following:

http://www.amazon.com/store?prod=camera

&brand=sony
&sessionid=7ek138-dje72931d91ds.

Multiple Parameters in URLs

The URL of a page can contain many pieces of information in different fields, which are referred to as parameters, and which define different characteristics and classifications of a product or service, or can determine the order in which information might be displayed to a viewer. Here’s an example of a URL for a web page on the JCPenny web site for a Modular Storage Center:

http://www5.jcpenney.com/jcp/ProductsHOM.aspx

?DeptID=40525
&CatID=40681
&CatTyp=DEP
&ItemTyp=G
&GrpTyp=STY
&ItemID=11a46ae
&ProdSeq=5
&Cat=buffet%2bhutches
&Dep=Furniture&PCat=dining%2bkitchen
&PCatID=40530
&RefPage=ProductList
&Sale=
&ProdCount=26
&RecPtr=
&ShowMenu=
&TTYP=
&ShopBy=0
&RefPageName=CategoryAll%252Easpx
&RefCatID=40530
&RefDeptID=40525
&Page=1&CmCatId=EXTERNAL|40530|40681

A search engine may have problems indexing that page at that URL because it contains so many parameters, but it may try. Google has that same product listed seven times under different URLs, with different amounts and combinations of parameters in the URLs of each listing.

When more than one parameter is used in a dynamic URL, it’s possible that if one or more parameters is removed from the URL, the content of the page doesn’t change in any way. The example in the quote above includes a sessionid that if removed doesn’t change the content of the page (a session ID is often used by sites to track the progress of a unique visitor through the pages of a site).

Another common parameter used by some dynamic sites is a source tracking parameter that lets a site owner know where a visitor has come from before arriving at the site.

So, everytime a people arrive at a site that uses session IDs and source IDs in URLs, they may be assigned unique numbers for those parameters, even though they may be visiting the same page. A search engine crawling program may also be given a session ID for a page, as well as a source ID.

If you look through search results in the major search engines, you may see pages in the index which have session IDs and source IDs in their URLs. A website shouldn’t be serving session IDs or source IDs to search engines. Because many do, the search engines may end up indexing pages from a site more than once.

It’s also possible that a URL may change for the same content because of the way that information on the page is sorted or displayed, or because of the path through a site that someone took to get to a particular product.

The content of the page may be sorted differently sometimes, or include a little extra content, like a set of breadcrumb navigation that shows departments and categories, the overall content of the page at different URLs may be substantially the same. There’s a possibility that hundreds of duplicate webpages may exist that provide the same particular content.

And a web crawler may unintentionally send all of the duplicates to be crawled.

Why is Indexing Duplicates a Problem?

Wasting Time Comparing Pages

While a search engine might try to “intelligently analyze a particular webpage and compare the particular webpage against other webpages to determine whether the content of the particular webpage is truly unique,” it’s not unusual for errors to happen during such an analysis. And it takes up a lot of computational resources to access the web pages and compare them.

By spending time performing comparisons of pages on a site, a search engine might not spend time accessing other pages that are valid and non-duplicates.

Given a site with thousands, or perhaps even millions of pages, a search engine crawling program is only going to spend a certain amount of time on that site before it moves on to other sites. If it tries to index and compare pages of a site too quickly, it may negatively affect the performance of the site in serving pages to visitors. There are also a lot of web pages that need to be indexed on the web.

So a site that has the same content that can be accessed under a number of different versions of URLs may end up having the same page indexed a number of times, and have other pages of the site not indexed at all.

Strict Rules for Indexing Pages May Cause Problems

A crawling program may also come up with a set of rules to follow to try to avoid duplicates for particular web sites, such as only looking at a small number of pages that have “similar looking” URLs. Or it might not access URLs that are longer than a certain number of characters. Those rules may result in a significant amount of content being missed.

The Yahoo Patent Application

Handling dynamic URLs in crawl for better coverage of unique content
Invented by Priyank S. Garg and Arnabnil Bhattacharjee
US Patent Application 20080091685
Published April 17, 2008
Filed: October 13, 2006

Abstract

Techniques for identifying duplicate webpages are provided. In one technique, one or more parameters of a first unique URL are identified where each of the one or more parameters do not substantially affect the content of the corresponding webpage. The first URL and subsequent URLs may be rewritten to drop each of the one or more parameters.

Each of the subsequent URLs is compared to the first URL. If a subsequent URL is the same as the first URL, then the corresponding webpage of the subsequent URL is not accessed or crawled. In another technique, the parameters of multiple URLs are sorted, for example, alphabetically. If any URLs are the same, then the webpages of the duplicate URLs are not accessed or crawled.

The patent application provides some details on a number of strategies that the search engine might take to try to index the URLs of a site without capturing too many duplicate pages. The methods described include doing things like removing parameters in URls that appear to be unneccessary as well as session and source IDs, and sorting the remaining parameters in the URLs in numerical and alphabetical order.

Example:

This URL:

http://www.amazon.com/store?prod=camera

&brand=sony&sessionid=2k4gd0-3k9sx1zc8d

might be rewritten to this form:

http://www.amazon.com/store?prod=camera&brand=sony

The other URLs found by the crawler are also rewritten and compared to the shorter form of the URL. If they match then those pages aren’t crawled and indexed.

The search engine may display the shorter version of the URL in its index unless the server where the page is hosted needs to see the longer version to serve the page in question.

Conclusion

The process described in the patent filing may capture a number of URLs that contain duplicate content, but it stands a good chance of missing many others.

I’ve written previously about approaches from Google and Microsoft to attempt to solve this problem of the same content at different URLs of a site:

While it can take some careful work and planning, it’s recommended that web site owners work to avoid having the same content at different pages as much as possible, rather than relying on the search engines to figure out which URLs contain duplicate content or not.

Share

29 thoughts on “Same-Site Duplicate Pages at Different URLs”

  1. Hi Bill, thanks for the article, I have been researching and reading about the subject as I’m checking a Site’s structure, here are some facts for the Site I’m working on:

    -The site has 3500 pages according to Google’s site command results, but it only displays 500 pages, after that it shows the following message:

    “In order to show you the most relevant results, we have omitted some entries very similar to the 500 already displayed.
    If you like, you can repeat the search with the omitted results included.”

    Is this an indicator of duplicate content issues?

    -The site has a Blog and most meta titles and descriptions are the same.

    -The Blog has several categories which shows the same content in different URL’s (tagging), some times an article can be tagged in 5 or 6 different categories. Will this cause also duplicate content issues? What’s the best way to handle this?

    -The Blog also displays archived pages, and a list of post by week. I was told this issues could be solved via robots, in order for the archived and posts by week list not to be indexed by Search Engines. How can this be done?

    -The Site also has a problem of canonicalization, when running the site command at Google it displays some of the URL’s as http and some as https.

    -I was also told to “nofollow” some pages in the Site like: contact us, privacy policy, etc, in order to avoid deluting of PR.

    So as you can see there’s quite a few issues regarding the Site’s structure. Could you please give me some lights as how to solve these issues, without undermining the Site’s health? Thanks in advance.

  2. I have seen a number of high-profile sites that could correct this problem simply by not publishing the same content to multiple subdomains.

  3. Hi Peter,

    I’ve seen that too often, too. I wonder why so many sites make that mistake without trying to talk to someone who knows better first, or at least doing enough research on the topic to know that it might cause problems.

  4. Hi SEO Tips,

    You’re welcome.

    -The site has 3500 pages according to Google’s site command results, but it only displays 500 pages, after that it shows the following message:

    “In order to show you the most relevant results, we have omitted some entries very similar to the 500 already displayed. If you like, you can repeat the search with the omitted results included.”

    Is this an indicator of duplicate content issues?

    There are potentially a number of factors that go into the decisions that a search engine makes to crawl the pages of a site, and show them in its index. The text that appears tells us that

    …to show you the most relevant results, we have omitted some entries very similar to the 500 already displayed.

    When a search engine crawler visits pages, it follows a number of protocols that determine how deeply it might go through a site, and how many of the pages of a site it might display in its index.

    Those protocols, or rules about crawling, indexing, and displaying pages may depend upon a number of different factors, including the number of links to pages of a site, the number of links from the pages of a site, the pagerank for those pages, how unique the content might be on each page, and perhaps others.

    Regardless of those protocols, and how they might be applied to your site, you do want to go through it and clear up as many potential duplication problems as possible.

    There are a number of steps I would take to try to address this behavour. The first would be to use a tool like Xenu Link Sleuth to crawl the pages of the site and see what pages actually exist on the site.

    Spider Traps

    You may discover spider traps on the site, where a search engine crawler can get stuck into an endless loop for one reason or another. There are a number of situations where those can crop up, and they should be resolved so that a spidering program doesn’t get stuck. You may see Xenu continuing to crawl pages and not stopping, and may have to stop Xenu, and use it’s “do not check any URLs beginning with this” feature do keep it from crawling the directory where the spider trap appears to get Xenu to finish crawling a site. If you have to do that, look back at that directory, and try to figure out why Xenu is getting stuck in that directory – some calendar programs, breadcrumb programs, and page expansion widgets may cause the dynamic generation of new URLs when a spider attempts to crawl them, which would result in a spider trap. You may have to use a disallow statement in your robots.txt file to keep that from happening. A poor use of relative links may also cause the same page to get revisited over and over and over with a URL that continues to get longer and longer and longer (“http://www.example.com/contact,” followed by “http://www.example.com/contact/contact,” followed by “http://www.example.com/contact/contact/contact,” and so on.) Fixing the problem with the relative link may resolve that problem.

    Canonicalization

    You may also see canonicalization problems, where some pages can be indexed under more than one URL, and ideally those should be fixed so that a search engine spider only sees one URL per page. The following example shows how a home page of a site might be seen by a search engine as four different pages:

    “http://www.example.com”
    “http://www.example.com/index.html”
    “http://example.com”
    “http://example.com/index.html”

    You’ll want to choose one version, and stick to in consistently in all internal links in your site. A good choice is to not use the default file name in those links (“index.html” in my example. That will leave you with:

    “http://www.example.com”
    “http://example.com”

    Pick one version, and use that in your internal links. Also use a permanent redirect so that the version you decide not to go with redirects to the version that you do. I like using “www” in pages because many people are used to seeing a “www” in a web address, but either version should work fine. People may link to your site from other pages using the version that you didn’t choose, but your permanent (301) redirect should enable you to get link popularity (PageRank) credit for those links.

    There are other canonicalization problems that can spring up, like the same pages appearing with an “http” and an “https” and those problems should be fixed, too. If you have pages that use the https protocol, using absolute URLs in links from those pages is a good way of making sure that pages which aren’t supposed to have “https” in them don’t, and that pages which are supposed to have “https” in them do.

    The Yahoo patent application in this post discusses URLs that will resolve to the same content with and without some data parameters in their URLs or with data parameters that can appear in different orders. In an ideal world, and possibly with some smart programming, or some purchased middleware software, you can avoid using URLs on a site that use unnecessary data parameters. The best approach possible is to only have one URL on a site per page.

    Blog issues

    It’s a very good idea to have unique page titles and unique meta descriptions for every page, and a failure to do so may cause “to show you the most relevant results, we have omitted some entries very similar to” result that you describe. If you are using a wordpress blog, there are plugins that you can use to make it very easy to have unique page titles and meta descriptions for each page of the blog.

    As for blog categories, I try to choose the best category, and only use one per blog post.

    Using robots.txt to disallow the crawling of archives pages isn’t a bad idea, either. The URLs for my monthly archives look like this: “http://www.seobythesea.com/?m=200604″

    The disallow statement in my robots.txt file for monthly archives looks like this:

    Disallow: /?m

    I don’t know how you’ve set up your “list of posts by week.” If it’s just a end of week blog post, listing the posts from the previous week, that shouldn’t be a problem.

    I also use the “post teaser” plugin to only show shortened versions of posts on the front page and on the category pages, and the full posts only appear on actual post pages. There’s some duplication of content that way, but it’s limited by using the shortened version.

    I don’t like the idea of using a rel=”nofollow” to try to limit the flow of pagerank to pages of a site. That’s a little outside the scope of this post, so I’m not going to go into depth on why here.

    Compare Pages on Your Site with Pages in Search Engine Indexes

    I’d also look at the results that appear in the search engines index for the major search engines. Do a site:www.example.com search for each of the major search engines, paste the results for each in separate text files and then into spread sheets, and sort them and delete everything that isn’t a link from the site.

    Then look to see what the search engines are indexing and failing to index. That may help you identify some duplication and indexing problems that you didn’t catch.

  5. Good stuff, I know a number of people who fall into this trap. Duplicate content is such an easy thing to avoid, it just takes a little work on our part, but considering the long-term implications, it’s not really too much of a difficult decision to make. ;)

  6. Thanks Bill, thank you for taking the time explainning these issues, will follow your advice. Keep up the great work!

  7. Thanks, Morgan.

    The tricky part about duplicate content is often understanding why it might be a problem. A lot of ecommerce platforms and content management systems fail to address duplicate content and cause indexing problems with search engines.

    You’re welcome, SEO-Tips.

    I’m hoping that resolving those problems on your site will bring you better treatment from the search engines.

  8. If you have DC, its your own fault.. :-) SEO is hard work and if you try to cheat you can get your site band.. and have to do the hole site over again..

  9. It’s not really hard to make a robots.txt to disallow indexing of some pages (produced by sorting routines for example // ?sort=new, ?sort=popular etc.) and make your site much clearer.

  10. Bill,

    I never thought of having blog entries in more than one category as causing the same kind of problems that e-commerce sites have with dynamic content, but now that you mention it that does happen to my blog. Not sure how much harm is done by the “omitted some entries very similar to” thing… hmm. The more I think about it, the more I suspect that *is* causing relevant results not to display. Yikes!

    What about posting an article at an entirely different site (say, on my own blog and at BlogHer)? I’ve heard different opinions on whether that’s a problem or not, and now that I’ve read this I am wondering again.

    Another very thought-provoking post. Thanks!

    Regards,

    Kelly

  11. Yes, that’s about what I got from this—I should probably rethink how I categorize posts!

    I don’t much like the idea of search engines choosing where I show up (in terms of two different sites), but I suppose as long as it’s me in both places, that’s not a problem.

    Thanks again,

    Kelly

  12. Thanks, Kelly.

    Categories can be a convenient way for people to find topically related posts on your site, but when you add a post to more than one category page, you are duplicating your content at more URLs. The impact of that could be minimized somewhat by publishing only part of the posts on the front page of your site and the category and archives pages, and the full post only on the post’s page. But I do try to avoid adding a post to more than one category.

    Not all duplicate content is bad – for instance, many news stories from wire services are published on many different sites. Republishing your posts at places like Blogher shouldn’t cause your site to be penalized, but it might result in only one version of your content showing up in search results – with the search engine deciding which one to show.

  13. Hi kszprychel,

    Being careful at the planning stages to avoid problems like these internal duplicate issues can be a good thing. Good luck with the launch of your site. :)

  14. But what if I was wanting to market an international company to local markets in several countries? I would want both URL’s recognised.

    Consider… I don’t have company.com but do have company.co.za, company.mu and company.co.uk. I would like to be able to market these to local searches but they are recognised as a single site with 2 duplicate sites. company.co.za won’t show up for a local search in google.co.uk but that’s the one that has been indexed. How can I set it so that each url will be returned for local searches?

  15. Hi Robert,

    If you found it necessary to have multiple websites at different URLs, there are a few things that you could try:

    1.) Custom tailor as much content as you can from each of the different domains to the local markets as much as possible.

    2.) Host each of the domains within the markets that they are relevant for.

    3.) Try to gain links from other sites within each of the markets for the domains from those markets.

    It can be considerably more work to develop unique content for different markets, but in writing for each of those different targeted markets, you likely do want to develop content that will appeal to each. The audiences are different, and the website content should be different, too.

  16. Hi Bill,
    First of all I’d like to say thanks for the good indepth and well explained article. I find duplicate content affects different websites differently. However I am convinced that it dilutes the effectiveness of your pages in SERPS. I think some of the points you mentioned in this article reinforce that fact.

    We currently run several websites one of them being a restaurant review guide, which unfortunately lists the same restaurant under many categories and as a result several url which are not dynamic in appearance as in no parameters, but are clean url’s which result in the same page.

    Our pages tend to rank quite well, but in the interest of expanding activity on our site and our rankings (not just in our local geographic region) but globally I think taking care of duplicate content is the only way forward.

    As you mentioned also including unique meta / title tags is obviously important.

    Again thanks for taking the time to write the article and nice website design also, very clean simple and easy to read. Well done!

  17. Hi Gavin,

    Thanks for your kind words.

    I agree – I believe that in most cases duplicate content on the same site can dilute the effectiveness of your pages in search results.

    Taking care of problems like duplicate content before expanding sounds like a very reasonable approach.

    I hope you come up with a solution that you are happy with as you tackle that problem on your review guide. :)

  18. Hey Bill -

    So I was reading this post and it brought to light, that much of the duplication discussion revolved around url parameters…many which tend to be in relation to CMSs and ecommerce platforms.

    Any takes on the new cloaking developments to assist such sites, the so-called “white-hat” cloaking methods?

    I’m soon to have a discussion and would love to refer to this article for some of the details, as I’d much rather share the love. Let me know, I’d love your input.

  19. Hi Jori,

    Yes, URL duplication does often tend to result from content management systems and ecommerce systems that don’t take the problem of the same content appearing upon different pages into account.

    I’m guessing that you are referring to things like geolocation and Google’s First Click Free programs with your reference to white-hat cloaking. This Official Google Webmaster Blog post might help:

    How Google defines IP delivery, geolocation, and cloaking

  20. Hi Bill, Great post. I’ve got a cold fusion site coming up and this hits the spot. Perfect timing, again.

  21. Hi shadu,

    Great to hear. I hope that you’re able to sidestep any internal duplicate content problems in the development of your pages.

  22. Pingback: SEO Daily Reading - Issue 67 « Internet Marketing Blog

Comments are closed.