I’ve written in the past about many of the reasons why you might find the same content at different pages on the Web, and some of the problems that duplicate content might present to search engines.
When someone performs a search on the Web, a search engine doesn’t want to show more than one page that contains the same or very similar content to that searcher. A search engine also doesn’t want to spend time and effort in crawling and indexing the same content on different sites.
One of the challenges that a search engine faces when it sees duplicate content is deciding which page (or image or video or audio content) to show to a searcher in search results. If a search engine provided a way for creators of content to find unauthorized uses of their content on the Web, it might take some of that burden off the search engine.
A newly published patent application from Google describes a process that could be provided for people to search for duplicate copies of their content on the Web, even if their content isn’t readily available online.
Duplicate Content Search
Invented by Clarence Christopher Mysen and Johnny Chen
Assigned to Google
US Patent Application 20080288509
Published November 20, 2008
Filed May 16, 2007
A system may store information regarding a set of items of content, receive sample content from a user, determine whether the sample content matches content of one or more of the items of content, and notify the user whether the sample content matches one or more of the items of content without identifying the one or more items of content to the user.
The patent application provides some details of how people could provide a sample of their text, or images or video or audio content, and perform a search on a duplicate content search engine to find places where their content possibly could be being used without permission.
While there are services online like Copyscape, which people can use to find out if text on their pages has been plagiarized or duplicated, those services may not help with images or video or audio.
The duplicate content search described in the patent application could also be used to find content that might be copied from sources that aren’t available online.
The patent application provides some detailed descriptions and examples of different techniques that it might use to detect duplicate content for text, images, videos, and audio.
While the technology behind detecting duplicates (and near duplicates) is interesting, the idea of providing a tool that content creators and owners can use to find duplicates is really smart. This approach has the potential to make it more likely that Google would show the original version of content in search results, at least for content creators who would take the time to use this tool.
It would be great if Google provided a duplicate content search engine like the one described in this patent filing. Until they do, they have provided some information that might be helpful in understanding some of the issues involving duplicate content: