Google Patent Granted on Duplicate Content Detection in a Web Crawler System

Some patents from the search engines provide detailed looks at how those search engines might perform some of the core functions behind how they work. By “core functions,” I mean some of the basics such as crawling pages, indexing those pages, and displaying the results to searchers.

For example, last December I wrote a post titled Google Patent on Anchor Text and Different Crawling Rates, about a Google patent filed in 2003 which gave us a look at how the search engine crawled web pages, and collected the web addresses, or URLs, of pages that it came across.

The patent the post covered was Anchor tag indexing in a web crawler system, and it revealed how Google may determine how frequently it might visit or revisit certain pages, including crawling some pages daily, and others even on a real-time or near real-time basis – every few minutes in some cases. While there’s been a lot of discussion in the past few months online about real-time indexing of web pages, it’s interesting to note that the patent was orginally filed in 2003.

That older patent also covered topics such as how a search engine crawler might handle temporary (302) redirects differently than permanent (301) redirects, by noting and sometimes following the temporary redirects immediately (to make a decision as to what page to show in search results), and collecting the URLs associated with permanent redirects and putting them into a queue where they might be addressed later – up to a week or more later.

It discussed how text surrounding links and anchor text found during the crawling of a page might be used as annotations for those links, and detailed some of the attributes that the search engine might be looking at when determining whether to associate that text with nearby links.

The patent also covered another very important topic – how to identify duplicate content that it might come across when crawling web pages, and how to identify the best address, or canonical URL for content. This is very important for a search engine – if the same content is found at multiple pages, a search engine may decide that it doesn’t want to spend time and resources indexing and displaying more than one source for the same content.

A related Google patent was granted this week that goes into more detail on how the search engine might handle duplicate content. It shares a couple of inventors with the patent on anchor text, and was filed on the same day. We’re told an early part of the description for this newly granted patent the reason for why Google might look for duplicate content during the crawling of web pages:

Meanwhile, it is becoming more and more common that there are many duplicate copies of a document sharing identical content, even though they may be physically stored at different web servers.

On the one hand, these duplicate copies of document are welcome because they reduce the possibility that shutting a one web server will render the documents on the web server unavailable; but on the other hand, they can significantly increase the workload and lower the efficiency of a search engine on both its front end and back end, if not dealt with appropriately.

For example, on the back end of a search engine, if duplicate copies of a same document are treated as different documents not related with one another in terms of their content, this would cause the search engine to waste resources, such as disk space, memory, and/or network bandwidth, in order to process and manage the duplicate documents.

On the front end, retaining duplicate documents would cause the search engine to have to search through large indices and to use more processing power to process queries. Also, a user’s experience may suffer if diverse content that should be included in the search results is crowded out by duplicate documents.

For these reasons, it would be desirable to develop a system and method of detecting duplicate documents crawled by a search engine before the search engine makes any further effort to process these documents.

It would also be desirable to manage these duplicate documents in an efficient manner such that the search engine can efficiently furnish the most appropriate and reliable content when responding to a query whose result set includes any of these duplicate documents.

The patent is:

Duplicate document detection in a web crawler system
Invented by Daniel Dulitz, Alexandre A. Verstak, Sanjay Ghemawat, Jeffrey A. Dean
Assigned to Google
US Patent 7,627,613
Granted December 1, 2009
Filed: July 3, 2003

Abstract

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents.

Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

The description of the patent shares many details disclosed in the earlier granted patent on how Google might handle crawling and anchor text, describing for instance how some URLs for web pages are crawled on a periodic basis in a round-robin format over days or weeks or longer, some URLs are crawled daily, and other URLs are crawled multiple times during a day.

The duplicate document detection patent doesn’t focus too much upon anchor text, but instead provides more details on how a content filter from the search engine might work with a duplicate content server, or Dupserver as it’s called in the patent. The first step that the search engine may take after receiving a newly crawled page is to consult the Dupserver to see if it is a duplicate copy of another document, and if it is, to determine which version might be the canonical version.

This patent likely doesn’t cover all types of duplicate content that Google might find – many pages that contain duplicate content may differ in a number of ways, such as including very different templates filled with boilerplate content such as headers and footers and sidebars that change from one URL to another. Or pages that may contain some duplicated content and some unique content, or content duplicated from more than one source. The patent does define the kind of duplicate content that it does cover, and also tells us about how it might handle redirects and duplicate content associated with those:

Duplicate documents are documents that have substantially identical content, and in some embodiments wholly identical content, but different document addresses.

Accordingly, there are at least three scenarios in which duplicate documents are encountered by a web crawler:

two pages, comprising any combination of regular web page(s) and temporary redirect page(s), are duplicate documents if they share the same page content, but have different URLs;

two temporary redirect pages are duplicate documents if they share the same target URL, but have different source URLs; and

a regular web page and a temporary redirect page are duplicate documents if the URL of the regular web page is the target URL of the temporary redirect page or the content of the regular web page is the same as that of the temporary redirect page.

A permanent redirect page is not directly involved in duplicate document detection because the crawlers are configured not to download the content of the target page. However, a regular web page or a temporary redirect page may contain a URL in its content, which happens to be the source URL of a permanent redirect page. Therefore, besides detecting duplicate documents, the Dupserver is also tasked with the job of replacing source URLs embedded in the content of a regular web page or a temporary redirect page with the corresponding target URLs of permanent redirects known to (i.e., stored in) the Dupserver.

The patent details some of the different duplicate content detection methods that it might use, such as taking fingerprints of the content found on pages to match content from one page to another, and how that information might be stored within content finger print tables, and the selection of canonical URLs for content.

A very quick reading of the patent might lead someone to think that the URL with the highest PageRank might be the version chosen as the canonical URL for that content, but the patent tells us that sometimes “a canonical page of an equivalence class is not necessarily the document that has the highest score (e.g., the highest page rank or other query-independent metric).”

We are given one example of this – Google might log all of the pages that it finds with duplicate content, and when it comes across a new duplicate, it might look at the PageRank (or some other query independent ranking), and see if that new URL has a PageRank that is higher by some signficant margin before it might name the new URL as the canonical URL. It’s possible that other factors are taken into consideration as well, though the patent doesn’t explicitly name them.

Conclusion

Even though this patent on duplicate content, and the related patent on anchor text were filed more than 6 years ago, they are worth spending some time with because of the way that they lay out in detail the ways that Google might crawl web pages, and collect and process information from those pages. If you are interested in how search engines work, these two documents provide some interesting insights into issues during the crawling of web pages, such as:

  • How Google may handle temporary and permanent redirects,
  • How Google determines different crawling rates for pages,
  • How Google may decide which URL with duplicate content might be considered the Canonical URL,
  • How text around links might be chosen to act as annotations for those links,
  • How anchor text pointed to duplicate documents might be associated with the canonical version of the document.

I’ve written a number of posts on duplicate content before:

Share

49 thoughts on “Google Patent Granted on Duplicate Content Detection in a Web Crawler System”

  1. Interesting stuff, Bill. This information might be old, but it does give us some nice info about the way pages are (or were) crawled. I’m sure that this is still in use, but has been improved since the patent was filed.

    Why did it take so long for this patent to be granted? 6 years seems like quite a long time.

  2. I am far from an expert on this, but it seems to me that what Google is after is eliminating duplicate content across servers but supposedly to be served up through the same end website. Or at least that’s my humble interpretation.

  3. This is very interesting to me for a few reasons, I have currently locked horns with out SEO agency regarding pretty much everything they do because they are just not very good. I have been struggling to get any support internally because they are cheaper than the other options.

    The agency in question still focus heavily on article marketing as a link building tactic (which would possibly be okay as a supporting tactic but not the main/only source, it is also worth mentioning the articles written are terrible – Ill stop dwelling on that now an make the point).

    If this is to happen and also the threat of it happening which is implied by the information above would really reduce the effectiveness of article marketing in a traditional sense. I would think that if multiple copies of an article were not going to be stored then the articles which contained the backlinks would only give 1x the credit no matter how many networks/dumping grounds they were syndicated over.

    A little off topic maybe but if anyone has any thoughts on my theory of the effect this will have on article style marketing then please let me know

  4. Thanks for the useful info. As someone previously posted, I ,too, wonder how this will affect the article marketing set. In seo, one must always be ready to change their sails to match the prevailing winds.

  5. Very interesting. I remember 10 years ago you could post the same article to a large number of websites and get tons of quality links back that shot you to the top, however, now, duplicate content is really being cut down on. Unfortunately duplicate content is creating a cess pool of information that has to cause major problems with stealing search spider resources.

  6. I am agree and have experienced for the fact. If you take article by other and publish it again with you own link, you would got nothing.

  7. Great Post Bill! I’ve been wondering what the deal is with Google and the new duplicate content rules, it seems like these rules change pretty frequently.. Just another topic about Google that you have to stay on top of. I learned a lot on this topic Bill, and thanks for the very informative post. Look forward to more great content!

    Jason Braud

  8. Hi Buzzlord,

    It’s quite possible that Google does still use a process like what is described in the patent, though as you note, it likely has been improved upon over time.

    Six years isn’t unusual when it comes to the granting of a patent. I looked over the history, and there were a number of claims within the original patent that faced rejections of one kind or another, based upon other patents that cover some similar ground.

  9. Hi Scott,

    This patent covers a range of duplicate content issues that might arise on the same web site, or different web sites. For example, it describes when pages might be redirected using a 302 redirect, and the content on those pages may be the same but the domains they appear upon might be different.

  10. Hi Jimmy,

    Google may look at duplicate content during crawling, during indexing, and during the serving of results to searchers. It’s possible that duplicate content from articles might not be impacted much during crawling because enough the content on a page might be different – the heading sections, the footers, the sidebars, and even some of the main content.

    But it’s also possible, and I’ve seen it happen, that articles that are presented on more than one web site may be filtered out of search results in response to a query, so that only one shows. It’s possible that Google may determine a “canonical” URL (or web page) to show when crawling those pages that have duplicate (or very near duplicate content), and use that determination to decide what URL to show when displaying pages in search results, but the patent doesn’t go into detail on that aspect of duplicate content, so we don’t know for certain.

    We also aren’t given any information in the patent on the effect of the duplicate determination as to whether or not links from those pages are given weight as backlinks, but it’s possible that a link from page considered a duplicate might not carry as much weight as it would if it wasn’t considered a duplicate.

    There’s likely some value to article marketing, but I’m not sure that it should be used as a primary source of acquiring links to a site – it’s better to have a link building strategy that is much more diverse.

  11. Hi Jim,

    To further your metaphor, it’s nice to have sails, an engine, oars, and other methods of making sure that you keep on moving forward in case one approach isn’t as effective as it might seem to be. Thanks. :)

  12. Hi Dataflurry,

    It really is necessary for search engines to pay more attention to the same content appearing on different web pages. It’s not just a matter of search engine resources, but also the quality of search results. No searchers want to come across the same information in most of the search results that they see for the same query. Unfortunately, I still see that happen sometimes.

  13. Hi Humza,

    If you take someone elses article without permission, and you republish it elsewhere with a link to your site, I don’t think you should get any value out of it.

  14. Hi Jason,

    Thanks. This patent is fairly old, but I think it provides a nice framework to build upon, when learning how search engines might treat duplicate content. I was ahppy to see more details on how duplicate content might be treated by Google during the crawling of web sites. Chances are, there have been some improvements and changes in the six years since the patent was filed.

  15. Hi Guy,

    There are a lot of places where duplicate content could possibly be called legitimate – online news sources that carry stories from wire services, ecommerce sites that carry publishers and manufacturer’s descriptions (that they have licensed to use), content in the public domain including historical documents, press releases, as well as others.

    I think part of the challenge that search engines face isn’t just determining what content is duplicate, but also what pages to show when it is duplicate. In one of the links at the end of my post, I mentioned an earlier post – Google to Help Content Creators Find Unauthorized Duplicated Text, Images, Audio, and Video?, in which Google mentions the possibility of releasing a duplicate content search engine to help publishers find when their content has been copied without permission and/or licensing.

    Making it easier for site owners to find when and where their content has been duplicated may make it a little easier. But it is a problem for searchers, site owners, and search engines.

  16. Hi SEO Canada,

    I think this has some interesting ramifications for other areas of the web as well, from news wire services, to repositories of public domain information, to people publishing the same content under different domain names for one reason or another, to government and educational sites that republish whitepapers and articles, to merchant sites that use manufacturer and publishers descriptions.

    Just for fun, it’s kind of interesting to look up something like “Declaration of Independence,” and dig carefully through the results, and try to understand why the sites included might be shown in search results, why some others might not be, and why some rank higher than others.

  17. Hi Bill,

    Great post, as usual. I consider you my go-to source for all things patent :) Because I’m a writer and I deal with a lot of article writing and ecommerce sites, the duplicate content issue as it pertains to articles and canned product descriptions is always a timely topic for me.

    I, personally, have not seen articles being filtered from the results. Likewise, although I always recommend fresh content to ecomm clients, I constantly see sites that use only canned product descriptions (no originally-written ones) rank highly in the SERPs. Here’s s recent example:

    When I entered “Le Creuset’s Stoneware Oval Dish is truly an All-in-One Dish” (a snippet from Le Creuset’s website) I found over 600 web pages that used this exact product description. My take, for ecomm, has always been that pages which rank highly are the ones that use additional content besides the actual product description. At least that seems to be the case when looking at the top 10 results on for many search terms I’ve studied. But that wouldn’t explain why articles seem to skate by.

    Anyway, if you have any further insights or information that might possibly shed light on these two situations, I’m all ears :)

  18. Hi Karon,

    Thanks.

    I have seen articles filtered from search results, and we’ve also had pretty clear statements from the search engines that they will filter duplicate content out of search results. For instance, the Google Webmaster Central blog post from 2006, Deftly dealing with duplicate content provides us with the following warning about syndicated content:

    Syndicate carefully: If you syndicate your content on other sites, make sure they include a link back to the original article on each syndicated article. Even with that, note that we’ll always show the (unblocked) version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer.

    I looked at a number of pages for the example query that you provided, “Le Creuset’s Stoneware Oval Dish is truly an All-in-One Dish,” and noticed that while that single string does appear on a good number of search results, the pages that I visited also included additional content on those pages, such as user generated content in the form of reviews as well as recommendations for other products. It’s a good idea to include additional content, like you note.

    The method described in this particular patent focuses upon duplicate content detection during the early stages of indexing web pages, and I suspect that it’s a fairly broad filter at this point. It may be more geared towards duplicate content that substantially similar, such as mirrored pages and pages where there isn’t much in the way of additional content, including boilerplate material on pages such as different headers, footers, sidebars, etc.

    The search engines also look for duplicate content during the indexing and display of pages, and what is often called “near duplicate content” can be a more difficult problem. (See: Detecting Near Duplicates for Web Crawling pdf) There’s not only a challenge to the search engines in determining whether there is content that is substantially similar once you cut away things like the content found in boilerplate, but there can also be issues involved in attempting to find an automated way to determine which pages to show and which to filter.

    There may be times when the same articles, with different boilerplate around them, do show up in search results. And the decision to show those may have to do with things such as a look at the linking structures and anchor text pointing to the pages and the sites that those appear upon, but chances are that anytime you use content that is duplicate or near duplicate on more than one page, there’s a possibility that one version will be filtered out of search results at some point (crawling, indexing, or displaying of results) by the search engines. Recommending fresh content is a good idea.

  19. This would be a major step forward in eliminating redundant search engine results as well as putting limits on sites that plagiarize and scrape content from original sources.

  20. Great article! Other than real time results, this going to be the most important factor moving forward (in terms indexing and showing content). The question I’ve always struggled with is how much content has to be unique in order to have Google consider it unique pages.

  21. Hi People Finder,

    I believe that Google has been using a process like the one described in this patent for a while. As I mentioned in the comments above, this is likely a fairly light filter, that may keep some content out of search results, or even from being indexed if some of the sites are pretty much mirrors. But, there still are issues with scraped and plagiarized content that this likely doesn’t address.

  22. Hi Matt,

    There are lots of places where content is duplicated that are legitimate, such as news wire stories, licensed and syndicated content, as well as the use of quotes pursuant to fair use. That’s one of the issues that a search engine has to contend with in deciding whether to show duplicated content in search results, so they struggle with the same thing – how much content has to be unique for Google to consider pages as unique pages.

    A more current paper, written by people from Google in 2009, which looks at uniqueness in documents that have some duplicate content, and discusses some of the issues around that is:

    Detecting the Origin of Text Segments Efficiently

    It looks at three different kinds of duplication:

    1. Near duplicate content
    2. Partial replication, – one or more paragraphs are copied, but the new document is significantly different from the original.
    3. Semantic duplication – where the content is pretty much the same, but the words used are different.

  23. Hi Lowrider,

    Yes, if Jimmy isn’t satisfied with the SEO agency that he is using, he should consider some alternatives. There are many other ways to build links than just article marketing.

  24. I am doing! And thanks for the feedback, its tricky for me at the moment as the organisation I am working in is very old school and has a long standing partnership with the agency. I know they dont know what they are doing and I am talking people round but in a company as big as the one I am in someone who has not climbed the ladder too high is not often given much attention. Its not ideal but its my job for now :)

  25. Hi Jimmy,

    You’re welecome. Sometimes there aren’t things that you can control, but it sounds like you’re working on learning as much as you can when it comes to things like SEO that can have a positive impact when you get a chance, which is good to see. Keep on learning, and hopefully you’ll be given a chance to help make a difference.

  26. You know i have never found any subject more complicated than seo. Not really complicated but full of myths and different beliefs. This duplicate content thing is one of those that has hundreds of different myths floating around. Thanks for the post by the way.

  27. Hi Mobi,

    I agree – there’s a lot of mythology and folklore surrounding many SEO topics. That’s one of the reasons why I like looking at patents and whitepapers from the search engines. It’s quite possible that methods described in the patents and papers haven’t been implemented, but they often raise good questions and areas to explore, and they are directly from the search engines, which I think gives them some weight.

  28. Hey Bill,

    it´s a great post as usual ; ). I also think that patents can be a very good source (holistic and/or in parts) for getting the sense of thinking in the right way (particular by the many existing not reliable sources), even if they transformed in the meantime.

  29. Hi Vaboos,

    Thank you. I find that I end up with a lot more questions than answers after reading most patents – but they are usually pretty interesting questions. :)

  30. This was a good read! Thank you Bill. Maybe you should write about a good website structure. How crawlers detect navagation, and internal linking. You’re very insightful and a great writer. What I really enjoy here is now I need to check back here because you reply to every single post!

  31. Hm! I find this topic of managing and monitoring SEO very interesting and frequently confusing. I would like to set a framework to give structure for SEO. The goal behind this framework would be to offer a structure for analysing a website’s on-page optimisation for search engines.

    As a result, you could use this framework to analyze how good your own businesses website is optimized or search engines like Google, Yahoo and MSN, and more importantly an outline that highlights current rules and policies to enable professional Search Engine Optimization businesses to maintain a professional integrity whilst reducing the image that all SEO are spamming. An auditing tool for best practicing SEO is also a real option.

  32. Hi Gareth,

    That sounds like an ambitious and aspirational project, and one that might take a lot of work. Given the ever changing state of the Web, and how search engines do what they do, it might be difficult. Best practices evolve over time, and sometimes rapidly. There definitely are certain baseline practices that you could define and describe, and many of those can be found in places like the guidelines from the different search engines.

    There are some aspects of what the search engines do that might be gleaned from places like the patent filings and whitepapers that the search engines apply for and publish, but those cover a whide range of possibilities. One statement that I see repeatedly in a number of the white papers is that the difference between SEO and spam is that SEO adds value to a web site, while spam doesn’t. I think that’s a good place to start from.

  33. Hi,

    I know it has been a while since this post was written. However, surely the whole duplicate content issue and Google removing duplicate content is going to be a tricky one. Take an example of a major event or disaster in the world, which is then broadcasted by news channels and articles on their websites, they are likely to have similar titles and so on…

  34. Hi Mark,

    It is a tricky area, with people syndicating content, using news wire information, republishing quotes from articles under fair use, mashups that reuse mixes of facts and content from other sites, copyright infringement, and more. And the hardest issue a search engine might face is in not being able to decide which site should be displayed in search results after identifying duplicated content.

  35. Very informative. I pay great attention to ensuring that my articles are original, but I must admit to spinning some for entry into article submission websites. I take a lot of time and effort to ensure that all the articles read correctly and that grammar is adhered to. After all this I am still concerned that web crawlers can detect spun articles – is this so? I don’t seem to have any problems at the moment with this, but how long will it be before someone learns how to identify spun articles?.

  36. Hi JayJay,

    It’s possible at this point that the search engines could identify reused articles where words and phrases may have been somewhat altered, and it’s also possible that those may still appear in search results, but possible carry less weight than they otherwise might if they were unique.

    There are a lot of papers that come out from the search engines and academia on a regular bases about duplicate and near duplicate content, and it seems to be something that will continue to be a focus for the search engines.

    I tend to stay away from article submission websites, even though there may be some value to using them. I’d definitely recommend a healthy varied approach to gaining links and speading out information that includes more than article submissions.

  37. If you take someone elses article without permission, and you republish it elsewhere with a link to your site, I don’t think you should get any value out of it.

  38. I agree with Dallas SEO. I’ve recently seen this happen with an article I wrote on seomoz. It got picked up by about 50 other scrapper websites with my link remaining intact. If they were to take snippits and add to it then sure, that would be cool but a carbon copy shouldn’t be rewarded at all!

  39. Hi Ed,

    I think Google is getting better at identifying duplicate and near duplicate content, as well as understanding synonyms and paraphrases, but they still have some way to go, and one of the most difficult challenges they seem to face is automating the process of identifying what the original source of content might be.

    It isn’t always clear or transparent when Google does index and display pages that contain duplicate content how well those pages might rank against each other, and how well Google might decide which site should show in response to queries, and which should be filtered out. I’ve seen Google fail at this in the past, and it’s possible that we’ll see that happen for some time in the future as well.

    Panda does seem like it was geared in part to tackle problems like this one, but it’s almost impossible to gauge how effective or ineffective its been.

  40. This is a very interesting post Bill. I agree with you that it would be a powerful tool for Google to crawl and also rank pages according to the content and page availability. Thanks for sharing this useful information.

  41. Hi Neeraj

    Thank you. It’s quite possible that Google is following a number of the processes described in this patent these days, and what I liked about the patent is that it provided some insights into the issues that a search engine faces when it crawls pages and comes across duplicate content that it might not want to spend the resouces upon to include within its index.

Comments are closed.