Identifying Primary Versions of Duplicate Content
We know that Google doesn’t penalize duplicate content on the Web, but it may try to identify which version prefers to other versions of the same page.
I came across this statement from Dejan SEO on the Web about duplicate content earlier this week, and wondered about it, and decided to investigate more:
If there are multiple instances of the same document on the web, the highest authority URL becomes the canonical version. The rest are considered duplicates.
>
The above quote is from the post at Link inversion, the least known major ranking factor. (it is not something I am saying with my post. I wanted to see if there might be something similar in a patent. I found something closer, but it doesn’t say the same thing that Dejan predicts.
.
I read that article from Dejan SEO about duplicate content and thought it was worth exploring more. As I was looking around at Google patents that included the word “Authority” in them, I found this patent which doesn’t quite say the same thing that Dejan does, but is interesting in that it finds ways to distinguish between duplicate content on different domains based upon priority rules, which is interesting in determining which duplicate content might be at the highest authority URL for a document.
The patent is:
Identifying a primary version of a document
Inventors: Alexandre A. Verstak and Anurag Acharya
Assignee: Google Inc.
US Patent: 9,779,072
Granted: October 3, 2017
Filed: July 31, 2013
Abstract
A system and method identify a primary version of different versions of the same document. The system selects a priority of authority for each document version based on a priority rule and information associated with the document version. It selects a primary version based on the priority of authority and information associated with the document version.
Since the claims of a patent are what patent examiners at the USPTO look at when they are prosecuting a patent and deciding whether or not it should be granted. I thought it would be worth looking at the claims contained within the patent to see if they helped encapsulate what it covered. The first one captures some aspects of it that are worth thinking about while talking about different document versions of particular duplicate pages and how the metadata associated with a document might be looked at to determine which is the primary version of a document:
What is claimed is:
1. A method comprising: identifying, by a computer system, a plurality of different document versions of a particular document; identifying, by the computer system, a first type of metadata that is associated with each document version of the plurality of different document versions, wherein the first type of metadata includes data that describes a source that provides each document version of the plurality of different document versions; identifying, by the computer system, a second type of metadata that is associated with each document version of the plurality of different document versions, wherein the second type of metadata describes a feature of each document version of the plurality of different document versions other than the source of the document version; for each document version of the plurality of different document versions, applying, by the computer system, a priority rule to the first type of metadata and the second type of metadata, to generate a priority value; selecting, by the computer system, a particular document version, of the plurality of different document versions, based on the priority values generated for each document version of the plurality of different document versions; and providing, by the computer system, the particular document version for presentation.
This doesn’t advance the claim that the primary version of a document is considered the canonical version. Instead, all links pointed to that document are redirected to the primary version.
Another patent shares an inventor with this one that refers to one of the duplicate content URLs being chosen as a representative page, though it doesn’t use the phrase “canonical.” From that patent:
Duplicate documents sharing the same content are identified by a web crawler system. Upon receiving a newly crawled document, a set of previously crawled documents, if any, sharing the same content as the newly crawled document is identified. Next, information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Next, duplicate documents are included and excluded from the new set of documents based on a query-independent metric for each such document. Finally, a single representative document for the new set of documents is identified following a set of predefined conditions.
In some embodiments, a method for selecting a representative document from a set of duplicate documents includes: selecting the first document in a plurality of documents on the basis that the first document is associated with a query independent score, where each respective document in the plurality of documents has a fingerprint that identifies the content of the respective document, the fingerprint of each respective document in the plurality of documents indicating that each respective document in the plurality of documents has substantially identical content to every other document in the plurality of documents, and a first document in the plurality of documents is associated with the query-independent score. The method further includes indexing, per the query independent score, the first document, thereby producing an indexed first document and the plurality of documents, including only the indexed first document in a document index.
This other patent is:
Representative document selection for a set of duplicate documents
Inventors: Daniel Dulitz, Alexandre A. Verstak, Sanjay Ghemawat, and Jeffrey A. Dean
Assignee: Google Inc.
US Patent: 8,868,559
Granted: October 21, 2014
Filed: August 30, 2012
Abstract
Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents because the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, following the query independent score, the first document, thereby producing an indexed first document. Concerning the plurality of documents, only the indexed first document is included in a document index.
Regardless of whether the primary version of a set of duplicate pages is treated as the representative document as suggested in this second patent (whatever that may mean exactly), I think it’s important to understand what a primary version of a document might be.
Why One Version Among a Set of Duplicate Content might be considered a Primary Version
The primary version patent provides some reasons why one of them might be considered a primary version:
(1) Including different versions of the same document does not provide additional useful information, and it does not benefit users.
(2) Search results that include different versions of the same document may crowd out diverse content that should be included.
(3) There are multiple different versions of a document present in the search results. The user may not know which version is most authoritative, complete, or best to access. Thus, it may waste time accessing the different versions to compare them.
Those are the three reasons this duplicate content patent says it is ideal for identifying a primary version from different versions of a document that appears on the Web. The search engine also wants to furnish “the most appropriate and reliable search result.”
How does it work?
The patent tells us that one method of identifying a primary version is as follows.
The different versions of a document are identified from several different sources, such as online databases, websites, and library data systems.
For each document version, a priority of authority is selected based on:
(1) The metadata information associated with the document version, such as
- The source
- Exclusive right to publish
- Licensing right
- Citation information
- Keywords
- Page rank
- The like
(2) As a second step, the document versions are then determined for length qualification using a length measure. The version with a high priority of authority and a qualified length is deemed the primary version of the document.
If none of the document versions has a high priority and a qualified length, then the primary version is selected based on the totality of information associated with each document version.
The patent tells us that scholarly works tend to work under the process in this patent:
Because works of scholarly literature are subject to rigorous format requirements, documents such as journal articles, conference articles, academic papers, and citation records of journal articles, conference articles, and academic papers have metadata information describing the content and source of the document. As a result, works of scholarly literature are good candidates for the identification subsystem.
Meta data that might be looked at during this process could include such things as:
- Author names
- Title
- Publisher
- Publication date
- Publication location
- Keywords
- Page rank
- Citation information
- Article identifiers such as Digital Object Identifier, PubMed Identifier, SICI, ISBN, and the like
- Network locution (e.g., URL)
- Reference count
- Citation count
- Language
- So forth
The duplicate content patent goes into more depth about the methodology behind determining the primary version of a document:
The priority rule generates a numeric value (e.g., a score) to reflect the authoritativeness, completeness, or best to access a document version. In one example, the priority rule determines the priority of authority assigned to a document version by the source of the document version based on a source-priority list. The source-priority list comprises a list of sources, each source having a corresponding priority of authority. The priority of a source can be based on editorial selection, including consideration of extrinsic factors such as the reputation of the source, size of the source’s publication corpus, recency or frequency of updates, or any other factors. Each document version is thus associated with a priority of authority; this association can be maintained in a table, tree, or other data structures.
The patent includes a table illustrating the source-priority list.
The patent includes some alternative approaches as well. For example, it tells us that “the priority measure for determining whether a document version has a qualified priority can be based on a qualified priority value.”
A qualified priority value is a threshold to determine whether a document version is authoritative, complete, or easy to access, depending on the priority rule. When the assigned priority of a document version is greater than or equal to the qualified priority value, the document is deemed to be authoritative, complete, or easy to access, depending on the priority rule. Alternatively, the qualified priority can be based on a relative measure. Given the priorities of a set of document versions, only the highest priority is deemed a qualified priority.
Duplicate Content Take aways
I was in a Google Hangout on air within the last couple of years where I and several other SEOs (Ammon Johns, Eric Enge, Jennifer Slegg, and I) asked some questions to John Mueller and Andrey Lipattse, and we asked some questions about duplicate content. It seems to be something that still raises questions among SEOs.
The patent goes into more detail regarding determining which duplicate content might be the primary document. Unfortunately, we can’t tell whether that primary document might be treated as at the canonical URL for all duplicate documents. As suggested in the Dejan SEO article, I started with a link to this post. Still, it is interesting seeing that Google has a way of deciding which version of a document might be the primary version. I didn’t go into much depth about quantified lengths being used to help identify the primary document, but the patent does spend some time going over that.
Is this a little-known ranking factor? The Google patent on identifying a primary version of duplicate content does seem to find some importance in identifying what it believes to be the most important version among many duplicate documents. I’m not sure if there is anything here that most site owners can use to help them have their pages rank higher in search results, but it’s good seeing that Google may have explored this topic in more depth.
Another page I wrote about duplicate content is this one: How Google Might Filter Out Duplicate Pages from Bounce Pad Sites
Hi Bill,
I am always thankful for the informative posts that you share.
I have been following your blog for a long time and this ti8me I got something that I was looking for.
Thanks for the great share.
have a good weekend.
Hi Robin,
Glad that you liked this post. I didn’t know that Google would spend a lot of effort determining which page was a primary version of a set of duplicate pages. You have a good weekend, too.
Usually, I never comment on blogs but your article convinced me to comment on it as is written so well. And telling someone how awesome they are is essential so that on my part I convince you to write more often.
Hi Vivek,
Thank you. Glad you liked this post.
Bill
One of the factors you listed from the patent for determining the primary document is “the source”. This to me means Google is using a quality, authority, and/or trustworthiness score as a factor for determining which domain wins the primary document SERP battle for duplicate content. Interestingly hasn’t Google also been pushing published and modified date in schema more recently too?
Hi there, simply became aware of your blog thru Google, and located that it’s really informative. I will be grateful for those who continue this in future. Lots of people will be benefited from your writing. Cheers!
Google spends a lot of money when it comes to its search engine so there’s a lot of effort put in to determine which page is a primary version of a set of duplicate pages.. But many SEOers feels (argue) they don’t see any of this in action… What do you think Bill, is it true?
Found the information a worthy consideration. Thank you Bill…
Wow! I really appreciate the fact that you have written on topic and made it so clear, it is a different topic and very less people can write in a manner that everything gets clear. Also, I love the layout of your page and the images used are very attractive.
Thank you Bill…
worth reading
Duplicate Pages I was always wondering how Google manage this, thanks for making me understand better this issue
Thank you so much for posting this. I hard heard that you “weren’t supposed to duplicate content” but I didn’t know if that were true, or why it might be the case if it were. I now get it – finally – and thoroughly intend to read through lots more of your blog seeing as I’m trying to sort out my own site. I made a lot of mistakes at the beginning but you don’t know what it is that you don’t know! 🙂
Hi Erika,
Google spokespeople state that there is no such thing as a duplicate content penalty. Ideally, you don’t want to duplicate content, because every page on your site is a new opportunity to rank for something else that your visitors might be interested in visiting and reading. It’s possible that if Google sees the same title and snippet from two different pages, Google may filter one of those pages out of search results because they want to provide diverse results that are unique.
Hi Targetmedia,
SEOs wouldn’t see duplicated pages in search results if one or more of the duplicates have been filtered out of those search results.
I Like your post thanks for sharing
Bill, thanks for adding another ranking factor on my website checkings.
Useful information like this one must be kept and maintained so I will put
this one on my bookmark list! Thanks for this wonderful post and hoping to
post more of this!
Its really very helpful post for us.Thanks for sharing your great experience.
Hi Sir,
I am first time visit your website and read this post. really informative. I have 7 years experience in SEO. but daily learn.
I learn lots of points from this post. this query already in my mind today clear all those things.
Thanks you so much sir.
Hi there, simply became aware of your blog thru Google, and located that it’s really informative. I will be grateful for those who continue this in future. Lots of people will be benefited from your writing. Cheers!
Hii Bill Slawski, Your topic about identify primary versions of duplicate pages is very informatic thanks to sharing this type of knowledge, actually, I am a doctor and I have a website that’s why I am very curious about Google updates and algorithm. Keep sharing bill
Wow, that’s interesting – I wasn’t even aware that Google is able to identify primary versions of duplicate pages (shouldn’t be a surprise though, after all it’s Google…). But I wonder what’s the use of it and if they actually act upon it and penalize the duplicates or..?
Hi Kas,
Google has told us that they don’t penalize duplicate content, but they have shown that they don’t like serving search results that appear to have multiple copies of the same page (showing off identical snippets), and might filter duplicates out of search results. Identifying a primary version of duplicates would give them an idea of which one to continue to show in search results when there is more than one version.
plagiarism is most danger to your website. if your website has duplicate content google robots will easily find it and send your website into sandbox.
Hey, Bil. I came across this on Google, and I am stoked that I did. I will definitely be coming back here more often to read your quality content. These kinds of quality content are hard to find. Thanks for sharing
Just stumbled upon your website, so glad that I did, so many great articles, I have 10 years experience in SEO. but every day is a learning experience, that’s the best about SEO, You never stop learning, just when you think that you know everything, Google throws you a curve ball, great post and looking forward to read more.
Hi Kumar,
I am not a fan of plagiarism, but I am also not a believer in a sandbox at Google. There is no duplicate content penalty at Google, even though duplicate content may be filtered from search results (which is why Google would try to identify a primary version of duplicate content.) Google may start identifying sites that engage in plagiarism, and may consider those to be engaging in spamming the Web, rather than performing some useful service, such as acting as a mirror when an originial site becomes too busy for visitors (something that was more common before sites started figuring out load balancing better.)
Some Google spokes people have stated that there are times when Google does act as if there is a sandbox, but they do not have a sandbox penalty that new sites go into when they are just starting out (the whole myth about a Google Sandbox has nothing to do with duplicate content.)
Hi John,
Thanks. There is a lot of misinformation about things like duplicate content and how a Google might handle it, so when I saw these two patents, I felt I had to write something about them. 🙂
Hi James,
Happy to hear that you are liking the site. I keep on looking for new stuff to write about, and keep on finding it week after week, especially looking at the patents that Google is being granted.
Nice one Bill! you made it so easier to understand all the procedure thanks a lot
Primary versions of duplicate pages is a hot topic recently. Your comment on this subject is very valuable, and I can agree with you that it is crucial to be aware of the primary version of the document.
Thanks for sharing this Bill. It’s always interesting to find how Google works behind the scenes even for the simplest thing that I tend to ignore. Reading the blog has given me a great insight into understanding the whole concept of duplicate content and duplicate pages. Looking forward to reading more of your blogs.
I was confused for a long time about duplicate pages. Your article cleared all my doubts. Thanks a lot for sharing this article, Bill!
Such a informative blog. Thank you for sharing this post.
Hi Bill,
First of all want’s to thanks you for writing this article in a simple way. So, that new blogger’s like me can understand this so, easily. Keep writing and helping us.
At first thanks for sharing such kind of great article with us and its really Helpful for others, Thanks for sharing this amazing information.
Again awesome posting @Bill. today actually I have a problem to share you, if I have posting my fresh content at my blog and after sometime(at least after a month) I want to publish same content at another platform. Then how it will be taken or treated by search engine.
thanks for sharing such kind of great article with us.
Reading the blog has given me a great insight into understanding the whole concept of duplicate content and duplicate pages
thank you once again for sharing valuable post.
I do not know what to say except that I have enjoyed reading your article. I will keep visiting this blog very often. I’ll use this information for my work.
I have been following your blog for a long time and this ti8me I got something that I was looking for.
Thanks for the great share.would like to appreciate your knowledge
Duplicate pages has been in Google’s radar for a long time now but there are still some duplicate pages that exist online so this is a nice and informative way that we readers know how Google does his thing in identifying those type of pages.
Interesting post, I was curious how Google would filter out plagiarized content but I can see it becoming an issue for new sites to counter copied content.
Thanks for the info Bill!
I think for the most-part, people are generally safe in that Google’s index /crawls are much faster and the source of originating content is nearly always the author of the content rather than content scraping/aggregate scraping – although, I think this varies depending on niche. I would say content issues tend to arise more for sites in the “NEWS” sector which can effectively scrape content before the original is indexed.
In terms of content and link authority – agree that Google can and does favour content depending on internal and external link equity.
Things like SCHEMA (attribution for author) is likely to help reduce issues with duplicate content being scraped.
I wonder also if Google relies on any third party platforms like copyscape or siteliner?
nice piece of information. great content. keep writing and sharing content like this.
Thank you for the information. I had the misconception that Google did penalize duplicate content.
Hi Cris,
Google spokespeople insist that they do not penalize duplicate content. However, there is a chance that they could filter duplicate content out of search results if they think some results contain the same content as something they are already showing. You can see this if you perform a search, and look through the results to the last page that shows results. There will sometimes be a linked statement there that says something like “There are more results for this query, but they are substantially similar to results we are already showing. Click here to see those results.”
Thanks Bill, we had issues with duplicate content for our sites and what google did was not to index the supposedly duplicate pages. Now it has become a bit more clear for me as what google may consider duplicate
Hey ! I read this really so useful and beneficial information. Thanks for share.
Hi Bill
Wonderful article information you share about Identifying Primary Versions of Duplicate Pages. I am your big funs yours since I follow your all article really awesome information. I really thankful to you to sharing amazing content produce.
Keep it up!
What a great little article which immediately put my mind at rest about some duplication caused by our having a Dev website accidentally indexed on the Dev url before going live!
O Boy, what a piece of information is given on this blog, this cleared the concept of patent. I was under the impression if there are duplicate pages then google is penalizing the duplicate pages, glad to hear that google is deciding on URL authority to prefer the page.
Thanks Bill.
This is a very informative blog. Got insights on the duplicate content and how google identifies the primary version of the duplicate page.
This is certainly one of the most eloquent and educative content i’ve ever read about how google identifies duplicates. i was wondering how an institution can be penalized for duplicating content without knowing the criteria used to do so.
Your work goes along enhancing my approach while blogging and searching for domains.
Thank you so much
I had gone through this article and found analysis on strictly not to publish duplicate content in any blog or pages. I’m grateful to author for the analysis report on how google finds duplicate pages.
Duplicate content is an interesting topic for sure, I’ve personally viewed it to be ok on a local level e.g using the same content for “Plumber London” and “Plumber Manchester. Just obviously swap out the areas.
It’s certainly not ok for more broader terms like “Best wireless mouse”. Copying someone else’s content will not get you far!
Great article and well worth the read.
Thank you Joshua.
I agree that there isn’t much value in copying other people’s words.
It’s true that Google always tends toward original content and ideas. The main motive is to provide proper relevant content to the one who searches. Copied content and identification process described here is the fact.
Excellent piece of information which has helped me to get rid of so many misconceptions. Keep sharing the high quality content.
Hi Bill,
Your all contents are very well explained with figures. I enjoyed reading your articles as they are easy to understand. Keep sharing like these awesome content!
Thanks for the article I think that you can add and the topic of the website. For example, if you are a news website you can have duplicate content without a penalty.