The great thing about HTML is that it’s so flexible and offers so many ways to do things. The worst thing about HTML is that it’s so flexible and offers so many ways to do things. I’ve looked at a lot of websites and I still see people doing things new ways.
An issue that’s often common to many websites is when a page on a site can be found at more than one URL. This might be done by a site owner for a number of reasons, and in a number of ways. It might be an issue related to a content management system that’s being used as well.
A patent application published by Google explores how the search engine might recognize when it finds a URL through a web crawl and another URL through a feed, such as a product feed, with both URLs referring to the same page, but those URLs are structured differently.
This seems like potentially a lot of work to me, and the patent filing has me shaking my head that Google might use resources to figure out duplicated content on a site, even if it potentially might enable the search engine to understand URLs and associated products and other information that it might identify better.
For example, let’s say that one version of the URL is what Googlebot discovers when it crawls a page. The other version is part of a feed generated from a database that lists products on an ecommerce site, and includes some information that might not be on the page itself, such as a price for the product featured on the page.
Crawling in the DUST: Different URLs with Similar Text
Sometimes those URLs might have tracking parameters attached to them such as session IDs that identify a visitor to a site as unique. Sometimes code is placed on a URL that tells someone where a link appears upon a page, such as in a heading at the top of a page, or in a sidebar, or in a footer.
Back in 2006, I wrote a post about a paper that described how search engines might try to understand URLs that might differ but lead to the same page. The post, Solving Different URLs with the Same Text (DUST).
In 2007, the team that produced the original poster expanded upon it in the paper Do Not Crawl in the DUST: Different URLs with Similar Text (pdf).
How well can a search engine match up different URLs on a site that point to the same page? Ideally, I’d want to make it so that there’s only one URL for each page, but this patent finds some value in situations such as whether URLs on a site though a web crawl can be matched up with different URLs for the same pages on the site that are uploaded as part of something like a product feed. When a product feed is uploaded, it may include additional information, such as prices for products. If Google can match up the crawled URLs with the uploaded URLS, it might be capable of displaying the crawl version of the URLs with the prices that accompany the product feed URLs.
Given that Google is now charging to present product results, I’m not sure that this patent filing will ever be implemented by Google. Can the ideas behind it be used with other feeds, like videos feeds or XML News feeds?
The patent filing is:
Mapping Uniform Resource Locators of Different Indexes
Invented by Oskar Sandberg and Olivier Bousquet
Assigned to Google
United States Patent Application 20130103666
Published April 25, 2013
Filed: October 21, 2011
A server may identify a first address stored in a first search index;
- determine one or more first identifiers associated with the first address;
- identify a second address stored in a second search index;
- determine one or more second identifiers associated with the second address;
- map the first address to the second address based on a first identifier, of the one or more first identifiers, and a second identifier, of the one or more second identifiers; and
- transmit the mapping, of the first address to the second address, to a first server associated with the first search index or to a second server associated with the second search index.
In addition to looking for a key string of numbers that might be unique for URLs matching specific patterns from a crawl that also appear in URLs that are uploaded via a feed, the search engine might look at other information associated with those URLs, such as page titles and meta descriptions and so on that might also match.
The patent filing does indicate that it might be used for things other than just product feeds, such as ” items (e.g., products), published news stories, images, user groups, geographic areas, or any other type of data.” Given that Google product search has become a paid offering at Google, it seems less likely that Google will mine those product feed URLs for additional information to display.
For example, Google News sitemap URLs could be accompanied by a <geo_locations> tag in the feed that accompanies it, which Google could use to associate other version of the URL for the same page with that location.
Honestly, I’m much more likely to believe that Google will first develop self driving cars, multi-functional Google Glass, Smart Watches, and more things that seem to be in the realm of science fiction before they are able to figure out multiple URLs for the same page on a wide variety of sites.
There’s just too much variety of ways to do things with HTML.
14 thoughts on “Google Files Patent for Understanding Multiple URLs for the Same Page”
As much as I suspect that Google doesn’t like seeing the same pages at different URLs, it happens on a lot of sites, and if understanding when such different URLs that point to the same page has some value by enabling Google to associate a crawled URL with a different URL to the same page that appears in a feed with other information in the feed, there’s value in that understanding and association.
And, if the URLs in feeds are the same as the URLs that are crawled on a site, then associated data from the feed should be even easier for Google to extract and associate with those URLs and the extra data in the feed.
Bill this is a great find! We’ve debated over things like first link priority and the reasonable surfer model at Powered by Search a lot but so far we’ve seen a mix of results for one being preferred over the other. We’ve also seen a lot of instance where technically ‘duplicate’ content ranks really well. For example, I’ve seen category and tag pages on WordPress blogs rank better than links pointed to a commercial site’s service page for the same keyword because the category page tends to seem like it has more on-topic long-form content.
I’m not sure that a DUST rule set is possible given so much flexibility with HTML, but I agree that looking through the lens of Google/Local/Product/Social properties might make it easier.
It is possible that Google may have millions of self driving cars on the roadways before they figure out how to tackle Multiple URLs pointing to the same pages. I find myself surprised at least a few times a week at how creative people can be when it comes to setting up a site and introducing something new.
When you start to factor in the large number of CMS’s, custom CMS’s, forum vendors, directories, aggregation, apps, frameworks.. You’re going to be spending quite a while building out your rule set. That DUST isn’t going to settle for quite some time.
It’s interesting to think what could be accomplished by working at the URL level throughout Google/Local/Product/Social properties though.
I think the aims of canonical link elements are different – they are a tool that someone (a site owner) can use to try to make it easier for search engines to understand when a URL for a page might contain substantially the same (or a subset of) the content on another page. With this patent, Google is admitting that sometimes sites are set up so that there are different URLs for the same pages, and sometimes some of those might be associated with additional information, such as genres or geolocation on Google News Sitemaps, or prices on product feeds, or so on. If Google can associate the different types of URLs appropriately to each other, it might be able to associate some of that other information that might come with a feed.
Props on that last note – I have to agree: Self-driving cars seem a LOT more likely. So does Google’s rumored Unicorn Cloning Program. It’s not Google’s fault – I just see web folks find incredibly creative ways to foster duplication on their sites.
Isn’t this what canonical tags are for?
Just when I think Googleâ€™s crawlers are becoming smarter, I find an article like this that makes me question their effectiveness. Itâ€™s kind of like when my 2 year old daughter says a bunch of new words in a day and then tries to put a piece of trash from the sidewalk in her mouth. The truth is that even with all of the updates and algorithm changes, itâ€™s impossible for Google to keep up with (and properly rank) the sheer number of sites that exist today.
I believe that this patent was made for people who are not familiar with the term duplicate content or canonical link and have a website. In addition, Google will reduce the number of pages in the index – which means savings.
Thanks for tackling this issue down. Well, like what you’ve said, “An issue thats often common to many websites is when a page on a site can be found at more than one URL.” This is totally true and I can really prove that. I’ve been searching for sites now and I came across to a url which is . After clicking this url, it then led me to your site which is really unusual since your site url is very different from the one that I posted. So, then I asked myself after this happen, would it affect the sites ranking if two different url’s are being pointed out to 1 page? Will the search engines see this as a spam or something?
I guess this patent will help the canonicalization issues and thus, crawling bandwidth will be utilized to crawl unique content.. Not only URL discovery via different mediums, there is lots of duplicate content while crawling the web, so it is important to identify the same content URLs and include the one single canonical URL.. It will help webmasters as well, the canonicalization issues will be less tedious. While canonical tag as one the comment says, problem with canonical tag is that people dont use canonical tags as often as they should and when tracking parameters and session identifiers are used, I havnt seen much websites using canonical tag.
Thanks for tackling this issue down
I can cite a great example on this. The dailyjot.com who completely ranked down. I watched how they manipulate a single blog (keyword rich) with their multiple hyperlinked text (15) who redirects for almost 15 landing pages of thier website.
Comments are closed.