I’ve written in the past about many of the reasons why you might find the same content at different pages on the Web, and some of the problems that duplicate content might present to search engines.
When someone performs a search on the Web, a search engine doesn’t want to show more than one page that contains the same or very similar content to that searcher. A search engine also doesn’t want to spend time and effort in crawling and indexing the same content on different sites.
One of the challenges that a search engine faces when it sees duplicate content is deciding which page (or image or video or audio content) to show to a searcher in search results. If a search engine provided a way for creators of content to find unauthorized uses of their content on the Web, it might take some of that burden off the search engine.
A newly published patent application from Google describes a process that could be provided for people to search for duplicate copies of their content on the Web, even if their content isn’t readily available online.
Duplicate Content Search
Invented by Clarence Christopher Mysen and Johnny Chen
Assigned to Google
US Patent Application 20080288509
Published November 20, 2008
Filed May 16, 2007
Abstract
A system may store information regarding a set of items of content, receive sample content from a user, determine whether the sample content matches content of one or more of the items of content, and notify the user whether the sample content matches one or more of the items of content without identifying the one or more items of content to the user.
The patent application provides some details of how people could provide a sample of their text, or images or video or audio content, and perform a search on a duplicate content search engine to find places where their content possibly could be being used without permission.
While there are services online like Copyscape, which people can use to find out if text on their pages has been plagiarized or duplicated, those services may not help with images or video or audio.
The duplicate content search described in the patent application could also be used to find content that might be copied from sources that aren’t available online.
The patent application provides some detailed descriptions and examples of different techniques that it might use to detect duplicate content for text, images, videos, and audio.
While the technology behind detecting duplicates (and near duplicates) is interesting, the idea of providing a tool that content creators and owners can use to find duplicates is really smart. This approach has the potential to make it more likely that Google would show the original version of content in search results, at least for content creators who would take the time to use this tool.
It would be great if Google provided a duplicate content search engine like the one described in this patent filing. Until they do, they have provided some information that might be helpful in understanding some of the issues involving duplicate content:
- Deftly dealing with duplicate content
- Duplicate content due to scrapers
- Demystifying the “duplicate content penalty”
- Webmasters/Site Owners Help – Duplicate Content
Thanks for that post! It will be really interesting to see how the dup content will play out. How do you feel about mashups then..? Where content is scraped or added together like listpic, etc. What do think about services like dapper.com that mashups RSS’s and things like that? I don’t know how they can ever eliminate duplicate content…? Interesting though.
Hi Matthew,
You’re welcome. If Google moves forward with something like this duplicate search engine, it will be interesting to see what kind of impact it has.
Great question on mashups. There are lots of sites that take content from other sites, remix that content in useful ways, and do so without payment or express permission. Google could be said to do that themselves in their search results, and in the cached copies of pages that they make available.
I hadn’t seen listpic before, but I like what it adds to craigs list listings.
The use of other people’s content in Mashups might be said to fall under fair use, but that doctrine is somewhat ambiguous. The following is from the US Copyright Office, on a page about fair use:
Some organizations like Google encourage the creation of mashups, and even provide a Google Maps API to help people create maps related mashups. Others that don’t provide application programming interfaces like Google does might be open to a mashup, and may provide permission (or licensing fee terms) to use their data.
I do think that mashups can be very innovative and interesting, but the legal issues that surround them could potentially be troublesome. Are they fair use, or are they copyright infringement?
By making a tool available to content creators to find duplicates of their content, Google doesn’t have to be the one in the middle of a dispute. I don’t think they want to be, either. 🙂
It will be interesting if google really can help content creator to find duplicate content. But I still not understand how “G” will do this
Hi DStudioBali ,
The focus of this patent application wasn’t so much how the search engines will uncover duplicate content as it was on providing tools to help people find when their content was duplicated by others, without being authorized to do so.
But, if you look in the patent application, they do provide some information about methods that they might use to identify duplicated text, images, audio, and video. I’ve listed some links to documents about how a search engine might find duplicate and near duplicate content pages in this post: New Google Process for Detecting Near Duplicate Content.
The patent application also describes methods it might use for finding duplicates of images, video, and audio. I didn’t go into detail on those in my post, but it’s worth looking at the patent filiing itself if you want to find out more.
Seems like this issue was tailor made for a new WordPress plugin. Google alerts is a potential solution, but it’s limited to short amounts of text. Doesn’t seem too hard to create a Google Alerts-type application that allows more than 32 words and is updated automatically when you post.
Hi Copywriter
As I was reading through the patent filing, I was wondering if this duplicate content search engine might be something that Google would incorporate into their Blogger software. I’d imagine that plugins for other blogging tools would make a lot of sense as well. 🙂
I would expect that Google would want to make a tool like this available to a larger audience as well, but it wouldn’t hurt them to build it into the publishing tools that they have now.
Some kind of alert would be interesting as well….
It’d be interesting to know what parameters they put on finding this duplicate content.
For example – if there was just 2 copies versus 10,000 copies.
I read somewhere else that there is a bit of a Vigulante group helping Microsoft find sites that sell links. I wonder if something similiar might startup for this too.
Hi Mark,
If someone was copying the content of your site, what actions would you take? I’ve seen site owners send out cease and desist letters for a single copy.
Providing a tool that can help people find copyright infringement enables those people to make a decision on their own as to the steps that they might take, whether email, or cease and desist letter, or lawsuit.
The search engines are still going to use whatever algorithms they might to filter some duplicate results out of the search engine, and in some cases stop indexing some pages that might be duplicates. In some cases, the original creator of content might be the one filtered out (I’ve had it happen to me).
I think this is a smart move in that it could empower individuals and small businesses to find where their copyrights are being infringed, even though they might not have the resources of larger organizations. I wouldn’t call it vigilantism because it ultimately involves helping people protect their own copyrights in a legal manner.
People who post articles in article directors actually WANT people to use their articles on their sites so that they get a nice link back to their article or site from it.
Hi Joe,
That’s a really good point.
Articles from article directories are put there with the express purpose of others using those articles, and permission is given for that use. There’s a license for the syndication of those articles to other places.
Of course, that’s not copyright infringement, or the unauthorized use of content.
One thing that I do want to mention though, is that if someone writes an article for their own site, and submits the same article to an article directory for others to use, it’s possible that someone else might use it, and might rank higher in search results for the article than the original. While syndicating articles can be a way to earn a link, or a number of links, those links may come at the cost of having another page rank for the article with the original filtered out of search results.
While its good to get any type of links, I really think that the value of article directory links has diminished greatly over the last year or so, and especially so with the last update. I have seen some boost in my site from article submissions but nowhere near the level it was previously. It looks like google is turning the knobs again.
Hi Jeff,
I haven’t used article submissions for linking and ranking purposes very much. After spending a fair amount of time writing something, it’s just hard for me to post it somewhere other than the site it was written for, especially if I spent a fair amount of that time considering different keywords.
I do think there’s some value in creating articles that link back to your site. As you note though, article submissions may not have the value that they might have in the past.