The problem with Duplicate Webpages on Websites
One of the technical issues that can cause problems with a search engine crawling a site to index its pages is when the content of pages on that site appears more than once on the site at different URLs (Unique resource locators or web page addresses).
Unfortunately, this duplicate webpages problem happens more frequently than it should.
A new patent application from Yahoo explores how they might handle dynamic URLs to avoid this problem. What is nice about the patent application is that it identifies a number of the problems that might arise because of duplicate web pages at different web addresses on the same site and some approaches that they might use to solve the problem.
While search engines like Yahoo can resolve some of the issues around duplicate webpages content, it is often in the best interest of site owners to not rely upon search engines but rather fix this problem independently.
Avoiding the Crawling of Duplicate webpages
Crawling programs browse the worldwide Web and identify and index as much information as possible. These programs locate new pages and updates old pages so that that information can be indexed and available to searchers through the search engine.
Web crawlers often start crawling the web at one or more web pages, following links to those web pages to other pages, and so on.
These programs may follow a strategy to retrieve as much information as they can is to try to only “crawl” pages that provide unique content – pages that haven’t already be indexed or that have been updated if they are already in the index.
One assumption that a web crawler could make while following this strategy is that a unique URL (Unique resource locator) corresponds to a unique webpage. Unfortunately, as I noted above, this isn’t always true.
A search engine doesn’t want to index the same page on a site more than once, but it happens, and often other pages of a site don’t get indexed while others are indexed multiple times under different URLs. I recall seeing at least one page on a site indexed many thousands of times in Google.
That problem can happen when a site uses a content management system or eCommerce platform that uses dynamic URLs.
A dynamic URL typically results from a search of a database-driven website or the URL of a website that runs a script. In contrast to static URLs, in which the web page contents do not change unless the changes are coded into the HTML, dynamic URLs are typically generated from specific queries to a website’s database.
The web page has some fixed content, and some part of the web page is a template to display the query results, where the content comes from the database associated with the website. This results in the page changing based on the data retrieved from the database per the dynamic parameter.
Dynamic URLs often contain the following characters: ?, &, %, +, =, $, cgi. An example of a dynamic URL may be something like the following:
Multiple Parameters in URLs and Duplicate Webpages
The URL of a page can contain many pieces of information in different fields, which are referred to as parameters, and which define different characteristics and classifications of a product or service, or can determine the order in which information might be displayed to a viewer. Here’s an example of a URL for a web page on the JCPenny web site for a Modular Storage Center:
A search engine may have problems indexing that page at that URL because it contains so many parameters, but it may try. For example, Google has that same product listed seven times under different URLs, with different amounts and combinations of parameters in the URLs of each listing.
Source or Session Parameters Can Cause Duplicate Webpages
When more than one parameter is used in a dynamic URL, it’s possible that if one or more parameters are removed from the URL, the content of the page doesn’t change in any way. The example in the quote above includes a sessionid that, if removed, doesn’t change the content of the page (sites often use a session ID to track the progress of a unique visitor through the pages of a site).
Another common parameter used by some dynamic sites is a source tracking parameter that lets a site owner know where a visitor has come from before arriving at the site.
So, every time people arrive at a site that uses session IDs and source IDs in URLs, they may be assigned unique numbers for those parameters, even though they may be visiting the same page. In addition, a search engine crawling program may also be given a session ID for a page, as well as a source ID.
If you look through search results in the major search engines, you may see pages in the index with session IDs and source IDs in their URLs. A website shouldn’t be serving session IDs or source IDs to search engines. However, the search engines may end up indexing pages from a site more than once because many do.
It’s also possible that a URL may change for the same content because of how information on the page is sorted or displayed or because of the path through a site that someone took to get to a particular product.
Duplicate Webpages Because of Some Unique Content on Those Pages
The page’s content may be sorted differently sometimes or include a little extra content, like a set of breadcrumb navigation that shows departments and categories, the overall content of the page at different URLs may be substantially the same. As a result, there’s a possibility that hundreds of duplicate web pages may exist that provide the same particular content.
And a web crawler may unintentionally send all of the duplicate web pages to be crawled.
Why is Indexing Duplicate Webpages a Problem?
Wasting Time Comparing Pages
While a search engine might try to “intelligently analyze a particular webpage and compare the particular webpage against other webpages to determine whether the content of the particular webpage is truly unique,” it’s not unusual for errors to happen during such an analysis. And it can take up a lot of computational resources to access the web pages and compare them.
By spending time performing comparisons of pages on a site, a search engine might not spend time accessing other pages that are valid and non-duplicates.
Given a site with thousands, or perhaps even millions of pages, a search engine crawling program will only spend a certain amount of time on that site before it moves on to other sites. If it tries to index and compare pages of a site too quickly, it may negatively affect its performance in serving pages to visitors. There are also a lot of web pages that need to be indexed on the web.
So a site that has the same content that can be accessed under many different versions of URLs may have the same page indexed several times and have other pages of the site not indexed at all.
Strict Rules for Indexing Pages May Cause Problems
A crawling program may also come up with a set of rules to follow to avoid duplicate web pages for particular websites, such as only looking at a small number of pages with “similar looking” URLs. Or it might not access URLs that are longer than a certain number of characters. Unfortunately, those rules may result in a significant amount of content being missed.
The Yahoo Patent Application
Handling dynamic URLs in crawl for better coverage of unique content
Invented by Priyank S. Garg and Arnabnil Bhattacharjee
US Patent Application 20080091685
Published April 17, 2008
Filed: October 13, 2006
Techniques for identifying duplicate Webpages are provided. In one technique, one or more parameters of a first unique URL are identified. Each of the one or more parameters does not substantially affect the content of the corresponding web page. Then, the first URL and subsequent URLs may be rewritten to drop each of the one or more parameters.
Each of the subsequent URLs is compared to the first URL. If a subsequent URL is the same as the first URL, then the corresponding webpage of the subsequent URL is not accessed or crawled. In another technique, the parameters of multiple URLs are sorted, for example, alphabetically. If any URLs are the same, then the webpages of the duplicate URLs are not accessed or crawled.
The patent application provides some details on many strategies that the search engine might take to index the URLs of a site without capturing too many duplicate web pages. The methods described include doing things like removing parameters in URLs that appear to be unnecessary and session and source IDs and sorting the remaining parameters in the URLs in numerical and alphabetical order.
might be rewritten to this form:
The other URLs found by the crawler are also rewritten and compared to the shorter form of the URL. If they match, then those pages aren’t crawled and indexed.
The search engine may display the shorter version of the URL in its index unless the server where the page is hosted needs to see the longer version to serve the page in question.
The process described in the patent filing may capture many URLs that contain duplicate content, but it stands a good chance of missing many others.
I’ve written previously about approaches from Google and Microsoft to attempt to solve this problem of the same content at different URLs of a site:
While it can take some careful work and planning, it’s recommended that website owners avoid having the same content on different pages as much as possible, rather than relying on the search engines to figure out which URLs are duplicate webpages or not.
Last Updated May 26, 2019