One of the technical issues that can cause problems with a search engine crawling a site to index its pages is when the content of pages on that site appears more than once on the site at different URLs (Unique resource locators, or web page addresses).
Unfortunately, this problem happens more frequently than it should.
A new patent application from Yahoo explores how they might handle dynamic URLs to avoid this problem. What is nice about the patent application is that it identifies a number of the problems that might arise because of duplicate content at different web addresses on the same site, and some approaches that they might use to solve the problem.
While search engines like Yahoo can resolve some of the issues around duplicate content, its often in the best interest of site owners to not rely upon search engines to fix this problem on their own.
Avoiding the Crawling of Duplicate Pages
Crawling programs browse the world wide Web and identify and index as much information as possible. These programs locate new pages as well as updates on old pages, so that information can be indexed and available to searchers through the search engine.
Web crawlers often start crawling the web at one or more web pages, and follow links to those webpages to other pages, and so on and so on.
A strategy that these programs may follow to retrieve as much information as they can is to try to only “crawl” pages that provide unique content – pages that haven’t already be indexed or that have been updated if they are already in the index.
One assumption that a web crawler could make while following this strategy is that a unique URL (Unique resource locator) corresponds to a unique webpage. As I noted above, this isn’t always true.
A search engine doesn’t want to index the same page on a site more than once, but it happens, and often other pages of a site don’t get indexed while others are indexed multiple times uder different URLs. I recall seeing at least one page on a site indexed many thousands of times in Google.
That problem can happen when a site uses a content management system or ecommerce platform that uses dynamic URLs.
A dynamic URL typically results from search of a database-driven website or the URL of a website that runs a script. In contrast to static URLs, in which the contents of the webpage do not change unless the changes are coded into the HTML, dynamic URLs are typically generated from specific queries to a website’s database.
The webpage has some fixed content and some part of the webpage is a template to display the results of the query, where the content comes from the database that is associated with the website. This results in the page changing based on the data retrieved from the database per the dynamic parameter.
Dynamic URLs often contain the following characters: ?, &, %, +, =, $, cgi. An example of a dynamic URL may be something like the following:
Multiple Parameters in URLs
The URL of a page can contain many pieces of information in different fields, which are referred to as parameters, and which define different characteristics and classifications of a product or service, or can determine the order in which information might be displayed to a viewer. Here’s an example of a URL for a web page on the JCPenny web site for a Modular Storage Center:
A search engine may have problems indexing that page at that URL because it contains so many parameters, but it may try. Google has that same product listed seven times under different URLs, with different amounts and combinations of parameters in the URLs of each listing.
When more than one parameter is used in a dynamic URL, it’s possible that if one or more parameters is removed from the URL, the content of the page doesn’t change in any way. The example in the quote above includes a sessionid that if removed doesn’t change the content of the page (a session ID is often used by sites to track the progress of a unique visitor through the pages of a site).
Another common parameter used by some dynamic sites is a source tracking parameter that lets a site owner know where a visitor has come from before arriving at the site.
So, everytime a people arrive at a site that uses session IDs and source IDs in URLs, they may be assigned unique numbers for those parameters, even though they may be visiting the same page. A search engine crawling program may also be given a session ID for a page, as well as a source ID.
If you look through search results in the major search engines, you may see pages in the index which have session IDs and source IDs in their URLs. A website shouldn’t be serving session IDs or source IDs to search engines. Because many do, the search engines may end up indexing pages from a site more than once.
It’s also possible that a URL may change for the same content because of the way that information on the page is sorted or displayed, or because of the path through a site that someone took to get to a particular product.
The content of the page may be sorted differently sometimes, or include a little extra content, like a set of breadcrumb navigation that shows departments and categories, the overall content of the page at different URLs may be substantially the same. There’s a possibility that hundreds of duplicate webpages may exist that provide the same particular content.
And a web crawler may unintentionally send all of the duplicates to be crawled.
Why is Indexing Duplicates a Problem?
Wasting Time Comparing Pages
While a search engine might try to “intelligently analyze a particular webpage and compare the particular webpage against other webpages to determine whether the content of the particular webpage is truly unique,” it’s not unusual for errors to happen during such an analysis. And it takes up a lot of computational resources to access the web pages and compare them.
By spending time performing comparisons of pages on a site, a search engine might not spend time accessing other pages that are valid and non-duplicates.
Given a site with thousands, or perhaps even millions of pages, a search engine crawling program is only going to spend a certain amount of time on that site before it moves on to other sites. If it tries to index and compare pages of a site too quickly, it may negatively affect the performance of the site in serving pages to visitors. There are also a lot of web pages that need to be indexed on the web.
So a site that has the same content that can be accessed under a number of different versions of URLs may end up having the same page indexed a number of times, and have other pages of the site not indexed at all.
Strict Rules for Indexing Pages May Cause Problems
A crawling program may also come up with a set of rules to follow to try to avoid duplicates for particular web sites, such as only looking at a small number of pages that have “similar looking” URLs. Or it might not access URLs that are longer than a certain number of characters. Those rules may result in a significant amount of content being missed.
The Yahoo Patent Application
Handling dynamic URLs in crawl for better coverage of unique content
Invented by Priyank S. Garg and Arnabnil Bhattacharjee
US Patent Application 20080091685
Published April 17, 2008
Filed: October 13, 2006
Techniques for identifying duplicate webpages are provided. In one technique, one or more parameters of a first unique URL are identified where each of the one or more parameters do not substantially affect the content of the corresponding webpage. The first URL and subsequent URLs may be rewritten to drop each of the one or more parameters.
Each of the subsequent URLs is compared to the first URL. If a subsequent URL is the same as the first URL, then the corresponding webpage of the subsequent URL is not accessed or crawled. In another technique, the parameters of multiple URLs are sorted, for example, alphabetically. If any URLs are the same, then the webpages of the duplicate URLs are not accessed or crawled.
The patent application provides some details on a number of strategies that the search engine might take to try to index the URLs of a site without capturing too many duplicate pages. The methods described include doing things like removing parameters in URls that appear to be unneccessary as well as session and source IDs, and sorting the remaining parameters in the URLs in numerical and alphabetical order.
might be rewritten to this form:
The other URLs found by the crawler are also rewritten and compared to the shorter form of the URL. If they match then those pages aren’t crawled and indexed.
The search engine may display the shorter version of the URL in its index unless the server where the page is hosted needs to see the longer version to serve the page in question.
The process described in the patent filing may capture a number of URLs that contain duplicate content, but it stands a good chance of missing many others.
I’ve written previously about approaches from Google and Microsoft to attempt to solve this problem of the same content at different URLs of a site:
While it can take some careful work and planning, it’s recommended that web site owners work to avoid having the same content at different pages as much as possible, rather than relying on the search engines to figure out which URLs contain duplicate content or not.