Three patents granted today to Google, Microsoft, and Yahoo all describe how each of the search engines might take a close look at page addresses, or URLs on dynamic web sites.
I wrote about the patent from Microsoft back when it had just been published as a pending patent application, in Microsoft Creating Rules for Canonical URLs. It appears that the patent examiner who reviewed the patent saw my blog post, because it is referred to in the patent within the “other references†section (Slawski, “Microsoft Creating Rules for Canonical URLs,” Sep. 29th, 2006, pp. 1-5. cited by examiner.). I don’t know if it is the first blog post to be cited as a reference in a granted patent (probably not), but it’s the first of my posts to be listed in one.
All three patents take a close look at the structures of URLs on dynamic web pages, which can often include large amounts of information within those URLs. For example, here’s a link to a page about a pair of jeans:
“http://www5.jcpenney.com/jcp/X6.aspx?DeptID=53006&CatID=53078&GrpTyp=PRD&ItemID=17bf470&attrtype=&attrvalue=&CMID=53006%7c53018&Fltr=&Srt=&QL=F&IND=3&cmVirtualCat=&CmCatId=53006|53018|53078”
It’s possible that many of the parts of that URL aren’t necessary, and can be removed to show the same page. I started removing a number of the different URL parameters from that URL, and I was still seeing the same content at this much shorter URL:
“http://www5.jcpenney.com/jcp/X6.aspx?&GrpTyp=PRD&ItemID=17bf470”
According to Google’s new patent, which is intended to help the search engine identify duplicate content at different URLs, they might try to identify which parts of a URL can be dropped, and which ones can’t. The parts, or parameters of a dynamic URL that can be dropped would be considered content-irrelevant. In my JCPenny URL example, that would include the following parameters.
DeptID=53006
&CatID=53078
&attrtype=
&attrvalue=
&CMID=53006%7c53018
&Fltr=
&Srt=
&QL=F
&IND=3
&cmVirtualCat=
&CmCatId=53006|53018|53078
By understanding which parts of URLs need to be included to show the same content, and which parts aren’t, it becomes easier for the search engine to identify duplicated content that might exist at different URLs that show the same page.
System for automatically managing duplicate documents when crawling dynamic documents
Invented by Anurag Acharya, Arvind Jain, and Arup Mukherjee
Assigned to Google Inc.
US Patent 7,680,773
Granted March 16, 2010
Filed: March 31, 2005
Abstract
A system of reducing the possibility of crawling duplicate document identifiers partitions a plurality of document identifiers into multiple clusters, each cluster having a cluster name and a set of document parameters. The system generates an equivalence rule for each cluster of document identifiers, the rule specifying which document parameters associated with the cluster are content-relevant. Next, the system groups each cluster of document identifiers into one or more equivalence classes in accordance with its associated equivalence rule, each equivalence class including one or more document identifiers that correspond to a document content and having a representative document identifier identifying the document content.
The Microsoft patent does something that could be considered to be somewhat similar. My post linked to above, and mentioned as a reference in the patent provides a more detailed description.
Systems and methods for inferring uniform resource locator (URL) normalization rules
Invented by Marc Alexander Najork
Assigned to Microsoft Corporation
US Patent 7,680,785
Granted: March 16, 2010
Filed: March 25, 2005
Abstract
Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an “equivalence class” herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules.
There are two rule-learning steps:
step (1), where it is learned for each equivalence class what portions of the URLs in that class are relevant for selecting the page and what portions are not; and
step (2), where the per-equivalence-class rules constructed during step (1) are generalized to rules that cover many equivalence classes.
Once a rule is determined, it is applied to the class of web pages or web resources to identify errors. If there are no errors, the rule is activated and is then used by the web crawler for future crawling to avoid the download of duplicative web pages or web resources.
The Yahoo patent also attempts to “normalize” URLs, but does so for a slightly different purpose, though it’s quite possible that duplicated content at different variations of a URL are also removed by Yahoo. The Yahoo patent describes another step where similarities in the structure of content on pages themselves are also identified, so that similar pages may be grouped together.
Techniques for clustering structurally similar web pages
Invented by Krishna Leela Poola and Arun Ramanujapuram
Assigned to Yahoo!
US Patent 7,680,858
Granted March 16, 2010
Filed: July 5, 2006
Abstract
Web page clustering techniques described herein are URL Clustering and Page Clustering, whereby clustering algorithms cluster together pages that are structurally similar. Regarding URL clustering, because similarly structured pages have similar patterns in their URLs, grouping similar URL patterns will group structurally similar pages.
Embodiments of URL clustering may involve: (a) URL normalization and (b) URL variation computation. Regarding page clustering, page feature-based techniques further cluster any given set of homogenous clusters, reducing the number of clusters based on the underlying page code. Embodiments of page clustering may reduce the number of clusters based on the tag probabilities and the tag sequence, utilizing an Approximate Nearest Neighborhood (ANN) graph along with evaluation of intra-cluster and inter-cluster compactness.
In an ideal world, as a web site developer, it is a good practice to set up the URLs for the pages of your site so that there is only one URL for page. In practice, many sites are set up so that people (and search engines) can access the same content at different URLs. While the patents above describe ways that the search engines may address the problem of multiple URLs for the same pages, eliminating the need for them to do mitigates the risk that they don’t, for one reason or another.
Does this mean that duplicate content is a techincal problem and not considered by the text itself? Don’t do duplicate content because put search engine can’t technically handle this? Interesting post! Tnx for sharing!
Thanks.
This way is for duplicate content inside the same DNS, parameters & pages content.
It may be different outside websites. Duplicate content outside same DNS is more difficult to detect. This second case only implies parts of pages or parts of content. And spammers aim to hide their work. And increasing use of social media / RSS makes duplicate grow.
Duplicate content is rampant in the internet today. I hope this new patent would really help in reducing instances of duplicate content.
Hi Dries,
Search engines might attempt to identify duplicate content at a number of different stages, including during the crawling of pages, the indexing of content found on those pages, and the display of pages to searchers in search results.
These patents are looking at individual sites that use dynamic content that might be available to viewers (and search engines) at more than one URL, and try to identify when the same content can be seen at different URLs so that they could possibly only include one URL per page of content within their indexes. At this stage, they may be looking at the content of the pages, but their interest is to avoid duplicate content on that single site.
Hi Renaud,
Exactly – attempting to find when there is duplicate or near duplicate content on other sites presents some challenges that are different than the ones described in these patent filings.
One might be when there is a mirror site at a different domain, and that might be identified by seeing a very similar linking structure.
Another duplicate content issue might be when some or all of the content on one page has been published elsewhere, either with permission, or under fair use, or through the scraping of content and copyright infringement. I’ve written a few posts about different approaches the search engines might take to identify those. Here are soe of them:
Hi Andrew,
There is a lot of duplicate content on the Web.
In some cases, much of it is legitimate and reasonable, such as news wire stories from different publishers, syndicated content, fair use quoting, and public domain content.
Some of it is questionable, such as stolen or scraped content taken without permission or license or in a way that is done without fair use.
The patents I wrote about above, and most of the patent filings from the search engines aren’t aimed as much at reducing duplicate content on the Web itself as they are in reducing the amount of duplicate content in their indexes, or displayed to searchers.
Bill,
Yeah I have learned with Google, as it should be, has to be more concerned with the load on their index than anything else. My friend used a search for backlinks on his site, and when it didn’t match with how many he knew he had became concerned. I had to explain to him that there is no reason for Google to spend resources making sure it has an accurate backlink count in the index haha.
Worst kind of software patent, on things which are obvious and have been done for years. Patenting that means that every search engine robot that parses dynamic URLs with parameters is at risk of being shut down by any one of these three. This is getting pretty close to evil.
Hi Keith,
The more efficiently a search engine can crawl pages, index them, display them, and avoid recrawling the same pages at different URLs, the better. Instead of hitting the database for a query, if the search engine can grab search results for popular queries out of a periodically refreshed cache, that’s something else they will do.
Google has a long history of only showing a limited number of the backlinks that it might know about for a site. If you do a “link:domain name” search on a site, you’ll usually see many less links than if you log into Google’s Webmaster tools and check to see how many links to pages they report there. Chances are that the webmaster tools link reports may still only show some of the links that the search engine is aware of. And some of that may definitely have to do with deciding that resources are better spent in other ways.
Hi Avi,
I was surprised that all three were granted on the same day, since they have so many similarities. At this point, I’m not sure if the problem might be that these patents could be used to exclude others from doing what they describe, or that anyone could patent very similar processes if they wanted to if they were able to describe in the processes differently enough.
Hi Bill, reading through this, I can’t help but feel that using something like this could increase the risk of the search engine returning the wrong pages?
Hi Web Design Horsham,
It’s possible that I introduced that element of concern into this post when I wrote it. I’m all for the search engines trying to find the right URL for a page when there is more than one possible option, but I’m a firm believer that if a developer or designer can make it so that a search engine doesn’t have to make a choice like that, then it’s better for everyone – One URL per page is ideal.
About duplicate content, it bugs me that people keep stealing my content. Do they get some kind of punishment for doing this? At all? I`ve talked to a lot of people about it, and they keep telling me that their site will be deindexed, but I`m not seeing it.
Hi forbrukslan1,
Good questions to ask when it comes to duplicate content.
There are a number of different ways that search engines respond to duplicate content. When it’s on the same site, they may try to identify one version that is the canonical one – or the best one to show in search results. A problem is that those duplications may mean that the PageRank or link equity to pages may not be as strong as it should be.
Search engines will look for duplicate content during the crawling of web pages, during indexing, and when they display search results. What they look for may vary at the different stages, and how they respond may as well. For instance, if a search engine comes across a mirror of a site, with the same content and link structure (the URLs may vary, but the way pages are connected together may be the same), a search engine may stop visiting and crawling one version. When search engines see a lot of duplicate pages on the same site (or pages that are very similar), the crawling programs may not index the pages of that site very deeply unless it has a lot of links pointing to it and/or a lot of PageRank or link equity. When smaller parts amounts of content are duplicated, and those parts might appear in the snippets for the pages as they would appear in search results, some of those results might be filtered from showing in search results.
Another issue is that while content may be duplicated on pages, it isn’t always an exact duplicate. There’s often different content headings, footers, sidebars and even some changes to the content in the main content areas of pages. “Near Duplicate” content can be a little harder to identify than exact copies. That doesn’t mean that search engines can’t identify it, and the best result for a search engine may not always be to remove versions from their index.
When content is duplicated on other sites, it isn’t always clear to search engines which version is the original and which might be a copy. Some duplication of content is “legitimate” in the eyes of search engines, such as the syndication of content or the use of wire stories by news sources. Others may not be as legitimate. Search engines will try to show pages in the search engines from the most authoritative page, or the one that has the highest pagerank, but that doesn’t mean that they will necessarily remove content from the Web that is duplicate. They will often try to not show duplicates in responses to queries, so that searchers don’t see the same content in every search result.
Sometimes you have to take things in your own hands when someone duplicates your content without permission or a license, such as filing a DMCA complaint with the search engines or a web host, or contacting the host about an authorized use policy infraction.