A new patent application on near duplicate content from Google explores using a combination of document similarity techniques to keep searchers from finding redundant content in search results.
The Web makes it easy for words to be copied and spread from one page to another, and the same content may be found at more than one web site, regardless of whether its author intended it to be or not.
How do search engines cope with finding duplicate websites – the same words, the same content, at different places?
How might they recognize that the pages that they would want to show searchers in response to a search query contain the same information and only show one version so that the searchers don’t get buried with redundant information?
Duplicate and Near-Duplicate Documents on the Web
Sometimes a creator of content might show that content on more than one page on purpose, such as when:
- They provide a “mirror” of a site – A site, or pages on a site may be copied at different domains to stop potential delays that happen when lots of people attempt to request the same document at the same time, or to keep the delivery of pages from being slow.
- There are different formatted versions of the document – plain text or HTML or PDF or a separate printable version of the same document may be available for viewers to choose from, and special versions for phones and PDAs may also be available.
- Sometimes content is shared with other sources, such as news wire articles, or blog posts that are published at both a group blog and an individual’s blog.
Sometimes content may be duplicated at other pages regardless of its creator’s intent, such as when:
- Someone took some or all of the content for republication pursuant to fair use, or in violation of copyright.
- The publishing system used shows the content at more than one address on the same site, so that it may appear to be unique based upon being located at a different address.
- Content was aggregated or incorporated into another source on the Web, such as in a mashup or search results, or in some other form.
There are other instances where content is duplicated on more than one page, or where documents are very similar. It makes sense for a search engine to try not to show the same content over and over again to a searcher in a list of search results.
It’s a challenge that search engineers need to meet carefully, because there are instances where duplicated content is legitimately on the Web, and other times when it is duplicated without permission and regardless of its creator’s copyright.
Recent Google Efforts towards Duplicate and Near Duplicate Content
One of the more interesting papers from Google employees last year gave a very good overview of processes to detect duplicate and near duplicate processes on web pages – Detecting Near Duplicates for Web Crawling (pdf).
In that paper, one of the processes described in detail was developed by Moses Charikar, a Princeton professor, who has worked for Google in the past. Moses Charikar also is listed as the inventor of a Google patent granted early last year, which discusses ways to detect similar content on the Web – Methods and apparatus for estimating similarity
This past week another Google patent application, from Monika H. Henzinger, explores how duplicate and near duplicate content might be detected at different web addresses. The patent application includes references to a number of different previous methods, including Dr. Charikar’s.
Detecting duplicate and near-duplicate files
Invented by Monika H. Henzinger
US Patent Application 20080044016
Published February 21, 2008
Filed August 4, 2006
The patent application explores how some different existing methods for detecting near duplicate content could be used together to try to identify near duplicates on the Web.
It provides citations to a number of documents on the Web that explore the topic of duplicate, and near duplicate content, including the following:
- Finding similar files in a large file system
- Scalable Document Fingerprinting
- Copy Detection Mechanisms for Digital Documents
- Syntactic Clustering of the Web
- Similarity Estimation Techniques from Rounding Algorithms (pdf)
- Similarity search system with compact data structures
- Methods for identifying versioned and plagiarised documents
From those documents, Dr. Henzinger tests and explores the documents from Andrei Z. Broder (Syntactic Clustering of the Web) and Moses Charikar (Similarity Estimation Techniques from Rounding Algorithms), and compares the approaches from each.
While there were differences in how effective these approaches were according to tests run, the conclusion about their effectiveness in the patent application was that “neither of the algorithms worked well for finding near-duplicate pairs on the same Website, though both achieved high precision for near-duplicate pairs on different Websites.”
We’re also told that:
In view of the foregoing, it would be useful to provide improved techniques for finding near-duplicate documents. It would be useful if such techniques improved the precision of the Broder and Charikar algorithms. Finally, it would be useful if such techniques worked well for finding near-duplicate pairs on the same Website, as well as on different Websites.
Using Multiple Similarity Techniques Together
Techniques similar to those described in the documents from Broder and Charikar could possibly be combined to work in sequence, to improve the detection of similar documents. The patent filing provides a nice overview of how that combined process would work, and why it would be an improvement.
Boilerplate, Similarity Techniques, and Fingerprints
One reason why some of the near-duplicate document detection algorithms perform poorly on pairs of pages from the same site, according to Dr. Henzinger, is because of boilerplate text that appears on those pages. A boilerplate detection process might be used to remove or ignore that boilerplate content in near-duplicate document analysis. I wrote about other reasons why Google might look for boilerplate on pages recently in Google Omits Needless Words (On Your Pages?)
This patent application explored the possibility of using the Broder and Charikar processes together, but it could use other, or additional document similarity techniques.
One approach used to “fingerprint” the content on pages is in creating “tokens” from the content as described in Rabin’s Fingerprinting By Random Polynomials. We’re told that different fingerprinting approaches might also be used, such as those described in the Hoad and Zobel paper Methods for identifying versioned and plagiarised documents.
I wrote about some of the problems around duplicate websites in Duplicate Content Issues and Search Engines, and the post includes links to a number of white papers and patent filings about duplicate content.
The process described in this new patent application doesn’t so much introduce a brand new method of identifying near duplicate content on pages as it comes up with an approach that takes advantage of other detection methods in a new way.
I didn’t go into a lot of specific details on how the different similarity processes work because those are detailed fairly deeply in the papers that I linked to, and the process described in this patent application doesn’t necessarily rely completely upon any one of those processes, but rather on the idea that multiple processes could be used together intelligently.
39 thoughts on “New Google Process for Detecting Near Duplicate Content”
This is going to be interesting, could set quite a cat amongst the pigeons….the bottom line is that quality, original content is invaluable!
I think the real challenge lies in deciding what to do with content that has been identified as near-duplicate. If the search engine only discards that content from current query’s search results I think everything will be okay. But if the search engine takes pre-emptive action against near-duplicate content by penalizing or eliminating it from the index altogether then it risks sacrificing relevance for specialized queries.
I often look for sites that have replicated my content and have been for years in order to enforce my intellectual property rights. If Google were to exclude duplicate content from its index completely, or penalize it, I and many other people would be deprived of a useful research function.
Other legitimate needs for finding duplicate content include tracking trends of use (how widely distributed does distributed content get?), growth of special interest movements that promote specific blocks of shared content, and identifying members of coalitions, alliances, organizations, loose communities, and other groups that have somehow elected to use agreed-upon shared content.
Thank you so much for the Nicholas Taleb books. I received them yesterday – they look pretty interesting. I’m looking forward to spending some time with them.
I experienced the problem that you write about with another blog a couple of years ago, when my results were filtered out of Google and replaced by a Bloglines page showing summary feeds of posts from the blog. I was fortunate enough to resolve the problem – but it’s one that shouldn’t have happened in the first place.
@ William Rock, yes, a date stamp by itself isn’t sufficient, and neither is looking at something like pagerank. It’s a challenge.
Great Post, thx for the great clarification and insight into how Google is working on helping filter the data.
It will be very interesting to see the false positives of these results.
Thx Bill 🙂
Interesting stuff, but the really neat search algorithm trick will be how Google will determine original content credit along with natural search engine rankings between duplicate and “near duplicate” and the original source of the content.
It would be a shame if someone with a largely unknown blog or web site publishes original content first, only to have it scraped, recycled, reworded and then republished on more established sites that will then get the organic search engine rankings and traffic credit for the duplicate content, leaving the author and publisher of the original content in the search engine dust.
Bill, by the way, I sent the two Nicholas Taleb books to you last Thursday via USPS Media Mail. You should, hopefully, have them by the beginning of March. Thanks.
Yes I agree, People Finder with
“unknown blog or web site publishes original content first, only to have it scraped, recycled, reworded and then republished on more established sites ”
This is currently a big issue in the industry already let’s just hope it does not get worse, I think from what I understand from these documents that they should be able to detect based upon not only a date stamp but many other factors and use the comparison of data within a page attributes to make a smarter decision. With the engines crawling patterns getting faster and smarter it should allow the little guys a chance to be found as long as it is properly coded.
@ Jacques Snyman,
Quality original content is invaluable – at least as long as a search engine understand who the originator of that content is. I’ve seen them get it wrong before a number of times.
@ Michael Martinez,
The most difficult issues probably may not come with the identification of duplicate and near duplicate content, as you point out, but rather with what to do once those duplicates and near duplicates are identified.
Google’s Agent Rank patent application provided some ideas involving the use of digitial signatures associated with content, and the use of meta data associated with the publication of that content that might describe syndication and other aspects of content appearing at different addresses for different reasons. I don’t know if the agent rank approach is a viable solution, but it’s one that is out there.
Creative Commons licensing is another approach that can make the use of licensed content easier to identify.
Those are only beginnings. What a search engine does with near duplicates is the real challenge.
@ William Rock, Yes, I think that there will likely be false positives. What will a Google do with duplicates and near duplicates when it finds them? I think that we need to be vigilant, and keep our eyes open for whatever approaches that they may decide to attempt to take.
Hi, and thanks for the information.
This near duplicate content feature, could be a very bad news for webspammers.
Any ideas as on how to make sure that search engines sees your original content as yours?
@ Thomas, It could present some problems to them.
There are a few things that you could try to do.
Controlling the things that you do have actual control over is a start, such as trying to make sure that there aren’t any duplicate content issues involving your own content such as having different URLs with the same content on the pages of your site, or sites. Make sure that page titles, and meta descriptions are unique for each page, and that the content of pages limits the amount of boilerplate that appears upon it.
Include some links to other content on your site within what you’ve written, if those links are relevant and helpful to visitors.
In the event that you do syndicate content from your site at other places, linking back to and citing the original isn’t a bad idea.
Some links to your content from other pages and other sources might be helpful in letting the search engines know that yours is the original.
If your content is scraped or duplicated without permission, considering options like the use of a DMCA notice is something to think about.
Recently I have put an “really my” article on a couple of websites by copy and paste from my own website. Could it be also considered as a violation of copyright. Like I said, its originally my text!
Madness, vastly increases the importance of consistently producing fresh quality content… Leaving those who duplicate it behind
@ Snooper, You should have control over content that you created, as long as it wasn’t something that you made for someone else in a work for hire situation.
If you publish it in more than one place on the Web, that is your right as the copyright holder. However, there is the possibility that the search engines might filter out one or more copies if they believe that it is duplicate (or near duplicate) content.
@ Shane, Some content demands to be repeated and shared with others, like the speeches of Dr. Martin Luther King, and the documents that grant us rights like the Bill of Rights in the US.
Fresh quality content does hold a lot of value, though. 🙂
Michael makes a great point about search for content used from your site. If duplicated content is not in Search Engine indexes how can you know if your content has been reused? Maybe some type of “web specific” copyright system can be created (it probably already exists) so that a flag goes up if duplicate content is found and the original creator is notified.
Hi Chris (Vertical Measures),
Michael’s point is a good one. Presently, we do see some content that is filtered show up in search results at Google, when they present a link at the end of the results that they display telling us that the remainder of the results are very similar to the ones that are shown. So it can be possible to find duplicated content by clicking on that link.
However, there may be times when instead of filtering duplicate content, a search engine decides not to crawl or index content that it believes may be duplicated. This could include sites that are mirrored elsewhere, or pages on a site that are so similar from one to another that the search engine decides to leave the site before indexing everything, and go elsewhere.
As for a Web specific copyright system, Google did file a patent application for something that they called Agent Rank, which would rely upon people using something like the OpenID initiative to label content they create (even as comments on someone’s blog). That labeling could also include meta data that might indicate when someone syndicated their content, and where. I don’t know if we will ever see that developed, but we may see similar ideas appearing on the Web.
LSI will help with finding near duplicate content, the main thing Google need to look at is content ownership, this is currently an issue due to index frequency, if your content is copied and not indexed first is not attributed to you. I know Google are looking at this very complex issue.
Likely that Google is not using latent semantic indexing in what they do.
Content ownership is a big issue, and it’s going to be one for a while. I have seen a number of instances where content that was indexed later ended up replacing content indexed earlier in search results, and I’ve personally experienced that. It is a difficult problem that the search engines don’t always get correct.
Thx for this post!
No matter what techniques Google invents there will always be a way to change or combine *syndicated* content automatically in order to create (not really) new content. There are even scraper sites saving historical serps of different se in their datebase reoffering this automatically remixed crap to Google and Google takes it (most of the time to sandbox at least but this still means traffic). There are endless sites with similar concepts like that and Google will never manager to handle this problem because this is an open war like in it security.
There is always going to be duplicated content on the web, whether with permission or without, via infringement or fair use, by purposeful syndication or scraping.
There are mashups, archives, mirrors, wire service syndicated content, automated scraping of sites, scraping of search results, scraping of scraped sites…
There are public domain documents that are free to publish and share with others, variations of news stories, and content that is very similar from one page to another.
I’m not sure that the search engines really want to be copy right police at all. But they do want to try to display unique content in search results – if all they showed in the top ten results in response to a query was the same content, in different wrappings, at ten different pages, people would be less likely to use that search engine.
One difficulty is deciding which versions of pages to show, and that’s a problem that is going to plague search engines unless there’s some way to actually verify copyright ownership. Google did come out with a patent application on something that they call agent rank. I wrote about it here:
We are seeing a lot of people pushing for something like OpenID, also.
My post discusses ways that search engines may identify near duplicate content, but the bigger issue may just be what to do with it once it is identified. It’s going to be interesting…
Getting into duplicate content was a natural way forward for any search engines. However it is still controversial issue.
The best example would be Press/News releases. How the algorithm is going to deal with this?
This is just one example.
The bad guys are always one step ahead, they can clone your website and you get banned for no reason. How does Google cope with this? I think it can’t.
Search Engines are still in infancy stage of evolution and they have to travel a long way before they can come up with something perfect which will be as illusive as mirage.
Hi Ajay Kumar Singh,
Some great questions. Thanks for asking.
The search industry is really young, and there are a lot of questions that do need to be ironed out.
With Google News results, one of the things that Google has done is to limit the number of sources that might be shown in results. That’s one step towards avoiding duplicate content. When you do a search in Google News, you are given the option of having duplicate and near duplicate results shown if you want.
Many news articles are published through a wire services, and while papers can add content to wire stories, or even remove some content, often they don’t. A search engine being able to cluster all of those results together, so that searchers don’t have to see the same story over and over again is a positive step.
Yes, there have been problems with sites being copied. I think in most instances, there isn’t a ban as much as there is filter that may show that other site instead of yours. Unfortunately, search engines will sometimes get the wrong site.
Working on building strong content and attracting links to your pages can be a good first step towards avoiding someone’s scraped replacing yours in search results, as your pages are filtered from those results.
But, you are right – choosing the correct site may be the most difficult issue when it comes to a search engine handling duplicate content.
Duplicate content exists out there right now and Google Patents are not working at all. Many of my blog posts have been copied exactly with points and commas for people that can not create original content. They are just trying to capture more traffic and earn extra cash. Fortunately, when people read both sites will discover who’s the owner content. Search engines seems to be sleeping about this illegal practice.
Web page content duplication is a very serious issue but I’ve noticed that it’s not as such an issue in English sites as in other language sites with the language, around which Internet content is less developed. I do the search and the completely identical results would come out on Google on the first page. This doesn’t happen in English. Does Google treat languages differently and “closes eyes” on duplication in other languages?
And one more point. While the content duplication still exists, it’s pretty easy to check who unlawfully copied it from your site and then – if one so desires – write a letter to Google and explain the situation. I use copyscape.com.
Some interesting questions and points.
I’m not sure if there is a major difference between how Google deals with duplicate content appearing in search listings in English, and in languages other than English. But I know that every language poses its own unique challenges, and that’s a possibility.
Google does provide information about how to file a DMCA complaint with them, which can be one way to address duplicate content that shows up in search results.
Hi SEO Mexico,
One of the big challenges that a search engine faces is in determining which version of content is the original. It isn’t really clear on many sites whether or not a copy has been made with permission or not, for instance. Google did write about this topic on their blog not too long ago:
Duplicate content due to scrapers
It’s worth a look if you haven’t seen it.
I realise this is quite an old thread now, but just a quick question really about duplicate content, and the â€œpunishmentsâ€ Google are supposedly hand out for cheating sites, I was wondering whether this sort of website would constitute as duplicating content (big style)?
each and every one of the town links within this page has duplicate content, other than a difference in the town name.
I have seen many websites like this, and I have also informed Google by using the webmasters tools (twice) but nothing has ever happened.
Itâ€™s websites like this that spam their way to the top of Google search results, and itâ€™s cutting out the ability to provide search engine users of more relevant smaller web design agencies or freelancers in those local towns.
I am interested to hear any thoughtâ€™s by either yourself or any other commenters!
It is frustrating to see sites rank well that offer many pages that are almost exact duplicates of each other, with town names or other keywords inserted into pages that are so substantially similar.
Over the years, it has appeared that Google doesn’t approach issues involving duplicate content and spam on a site-by-site basis, but rather in a way where they can make changes to their algorithms that can affect many sites and pages at the same time.
I did do some queries for a couple of the towns listed with “web design” included in the query, and looked at the results, and some of the pages ranking ahead of the site that you listed really haven’t done much to help themselves compete against that site in search results, and yet they are still ranking ahead of it. For example, it wouldn’t hurt for them, and for others who may offer services in those towns to put their actual addresses or locations on their pages and provide some more content for search engines to index, and attract or find more links to point to their web sites. I only looked at a very few results, but it’s possible that site ranks well because the design agencies and freelancers competiting against it could do more to help themselves.
Many thanks for your reply, as I was half expecting not to get one, due to the thread being rather old now.
I appreciate you taking the time into analysing said website, and after you pointing out to me, it does make sense that rather than Google sorting each individual website that spams their index on this scale, it is easier to adjust their algorithm instead.
I take your point about website’s in local area’s not helping themselves (myself and my website very much included in that statement!), and it is definitely something I will work on I think, if only to better myself and my websites ranking.
Once again, much appreciated response!
You’re welcome. I try to keep up with comments here, even from the older posts.
I do believe that the search engines do pay attention to the comments and spam reports that they receive – and that those can help them refine their algorithms to address spam and other problems that they see. But yes, if they can make a change that can affect manay thousands of sites rather than just one or a few, they will likely allocate their attention on what they can do that will have the greatest impact.
I think putting even a little effort into ranking for some of those location based queries would pay off for you. Good luck.
My major worry is that If Google in uses such technology in Search Engine It may hurt websites which provides Free Content to use.
May be websites like Wikipedia may not suffer but large numbers of small websites in which some may be providing content to Wikipedia may get penalized. I hope Google won’t use such technology in his search engine because it will do more harm than good.
I understand the dilemma that you’re writing about, but I think the problem might be more complicated than that.
Search engines don’t want to show search results pages that contain the same content at many different sites. To that end, they try to focus upon crawling and indexing unique content as much as possible, and will try to filter out of search results pages that contain substantially the same content. The search engines have been trying to do this for years, and are exploring new ways to do so all of the time.
While I think that it’s great that some sites do provide content that is free to use by others, the risk in offering such content, and in using it is that a search engine may not show it in search results.
As for wikipedia, regardless of its mission and nonprofit status, if content in entries on wikipedia are taken directly from someone else’s web site, and more content than just what might be acceptable under fair use, then copyright infringement is happening. Such articles should be edited to remove infringing material.
Comments are closed.