How Google might Identify Primary Versions of Duplicate Pages

Sharing is caring!

We know that Google doesn’t penalize duplicate pages on the Web, but it may try to identify which version it prefers to other versions of the same page.

I came across this statement on the Web about duplicate pages earlier this week, and wondered about it, and decided to investigate more:

If there are multiple instances of the same document on the web, the highest authority URL becomes the canonical version. The rest are considered duplicates.

~ Link inversion, the least known major ranking factor.

Man in a cave
unsplash-logoLuke Leung

I read that article from Dejan SEO about duplicate pages, and thought it was worth exploring more. As I was looking around at Google patents that included the word “Authority” in them, I found this patent which doesn’t quite say the same thing that Dejan does, but is interesting in that it finds ways to distinguish between duplicate pages on different domains based upon priority rules, which is interesting in determining which duplicate page might be the highest authority URL for a document.

The patent is:

Identifying a primary version of a document
Inventors: Alexandre A. Verstak and Anurag Acharya
Assignee: Google Inc.
US Patent: 9,779,072
Granted: October 3, 2017
Filed: July 31, 2013

Abstract

A system and method identifies a primary version out of different versions of the same document. The system selects a priority of authority for each document version based on a priority rule and information associated with the document version and selects a primary version based on the priority of authority and information associated with the document version.

Since the claims of a patent are what patent examiners at the USPTO look at when they are prosecuting a patent, and deciding whether or not it should be granted. I thought it would be worth looking at the claims contained within the patent to see if they helped encapsulate what it covered. The first one captures some aspects of it that are worth thinking about while talking about different document versions of particular duplicate pages, and how the metadata associated with a document might be looked at to determine which is the primary version of a document:

What is claimed is:

1. A method comprising: identifying, by a computer system, a plurality of different document versions of a particular document; identifying, by the computer system, a first type of metadata that is associated with each document version of the plurality of different document versions, wherein the first type of metadata includes data that describes a source that provides each document version of the plurality of different document versions; identifying, by the computer system, a second type of metadata that is associated with each document version of the plurality of different document versions, wherein the second type of metadata describes a feature of each document version of the plurality of different document versions other than the source of the document version; for each document version of the plurality of different document versions, applying, by the computer system, a priority rule to the first type of metadata and the second type of metadata, to generate a priority value; selecting, by the computer system, a particular document version, of the plurality of different document versions, based on the priority values generated for each document version of the plurality of different document versions; and providing, by the computer system, the particular document version for presentation.

This doesn’t advance the claim that the primary version of a document is considered the canonical version of that document, and all links pointed to that document are redirected to the primary version.

There is another patent that shares an inventor with this one that refers to one of the duplicate content URL being chosen as a representative page, though it doesn’t use the phrase “canonical.” From that patent:

Duplicate documents, sharing the same content, are identified by a web crawler system. Upon receiving a newly crawled document, a set of previously crawled documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query-independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

In some embodiments, a method for selecting a representative document from a set of duplicate documents includes: selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score, where each respective document in the plurality of documents has a fingerprint that identifies the content of the respective document, the fingerprint of each respective document in the plurality of documents indicating that each respective document in the plurality of documents has substantially identical content to every other document in the plurality of documents, and a first document in the plurality of documents is associated with the query-independent score. The method further includes indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including only the indexed first document in a document index.

This other patent is:

Representative document selection for a set of duplicate documents
Inventors: Daniel Dulitz, Alexandre A. Verstak, Sanjay Ghemawat and Jeffrey A. Dean
Assignee: Google Inc.
US Patent: 8,868,559
Granted: October 21, 2014
Filed: August 30, 2012

Abstract

Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.

Regardless of whether the primary version of a set of duplicate pages is treated as the representative document as suggested in this second patent (whatever that may mean exactly), I think it’s important to get a better understanding of what a primary version of a document might be.

The primary version patent provides some reasons why one of them might be considered a primary version:

(1) Including of different versions of the same document does not provide additional useful information, and it does not benefit users.
(2) Search results that include different versions of the same document may crowd out diverse contents that should be included.
(3) Where there are multiple different versions of a document present in the search results, the user may not know which version is most authoritative, complete, or best to access, and thus may waste time accessing the different versions in order to compare them.

Those are the three reasons this duplicate pages patent says it is ideal to identify a primary version from different versions of a document that appears on the Web. The search engine also wants to furnish “the most appropriate and reliable search result.”

How does it work?

The patent tells us that one method of identifying a primary version is as follows.

The different versions of a document are identified from a number of different sources, such as online databases, websites, and library data systems.

For each document version, a priority of authority is selected based on:

(1) The metadata information associated with the document version, such as

  • The source
  • Exclusive right to publish
  • Licensing right
  • Citation information
  • Keywords
  • Page rank
  • The like

(2) As a second step, the document versions are then determined for length qualification using a length measure. The version with a high priority of authority and a qualified length is deemed the primary version of the document.

If none of the document versions has both a high priority and a qualified length, then the primary version is selected based on the totality of information associated with each document version.

The patent tells us that scholarly works tend to work under the process in this patent:

Because works of scholarly literature are subject to rigorous format requirements, documents such as journal articles, conference articles, academic papers and citation records of journal articles, conference articles, and academic papers have metadata information describing the content and source of the document. As a result, works of scholarly literature are good candidates for the identification subsystem.

Meta data that might be looked at during this process could include such things as:

  • Author names
  • Title
  • Publisher
  • Publication date
  • Publication location
  • Keywords
  • Page rank
  • Citation information
  • Article identifiers such as Digital Object Identifier, PubMed Identifier, SICI, ISBN, and the like
  • Network locution (e.g., URL)
  • Reference count
  • Citation count
  • Language
  • So forth

The duplicate pages patent goes into more depth about the methodology behind determining the primary version of a document:

The priority rule generates a numeric value (e.g., a score) to reflect the authoritativeness, completeness, or best to access of a document version. In one example, the priority rule determines the priority of authority assigned to a document version by the source of the document version based on a source-priority list. The source-priority list comprises a list of sources, each source having a corresponding priority of authority. The priority of a source can be based on editorial selection, including consideration of extrinsic factors such as reputation of the source, size of source’s publication corpus, recency or frequency of updates, or any other factors. Each document version is thus associated with a priority of authority; this association can be maintained in a table, tree, or other data structures.

The patent includes a table illustrating the source-priority list.

The patent includes some alternative approaches as well. It tells us that “the priority measure for determining whether a document version has a qualified priority can be based on a qualified priority value.”

A qualified priority value is a threshold to determine whether a document version is authoritative, complete, or easy to access, depending on the priority rule. When the assigned priority of a document version is greater than or equal to the qualified priority value, the document is deemed to be authoritative, complete, or easy to access, depending on the priority rule. Alternatively, the qualified priority can be based on a relative measure, such as given the priorities of a set of document versions, only the highest priority is deemed as qualified priority.

Take aways

I was in a Google Hangout on air within the last couple of years where I and a number of other SEOs (Ammon Johns, Eric Enge, Jennifer Slegg, and I) asked some questions to John Mueller and Andrey Lipattse, and we asked some questions about duplicate pages. It seems to be something that still raises questions among SEOs.

The patent goes into more detail regarding determining which duplicate pages might be the primary document. We can’t tell whether that primary document might be treated as if it is at the canonical URL for all of the duplicate documents as suggested in the Dejan SEO article that I started with a link to in this post, but it is interesting seeing that Google has a way of deciding which version of a document might be the primary version. I didn’t go into much depth about quantified lengths being used to help identify the primary document, but the patent does spend some time going over that.

Is this a little-known ranking factor? The Google patent on identifying a primary version of duplicate documents does seem to find some importance in identifying what it believes to be the most important version among many duplicate documents. I’m not sure if there is anything here that most site owners can use to help them have their pages rank higher in search results, but it’s good seeing that Google may have explored this topic in more depth.

Another page I wrote about duplicate pages: How Google Might Filter Out Duplicate Pages from Bounce Pad Sites

Sharing is caring!

22 thoughts on “How Google might Identify Primary Versions of Duplicate Pages”

  1. Hi Bill,
    I am always thankful for the informative posts that you share.
    I have been following your blog for a long time and this ti8me I got something that I was looking for.
    Thanks for the great share.
    have a good weekend.

  2. Hi Robin,

    Glad that you liked this post. I didn’t know that Google would spend a lot of effort determining which page was a primary version of a set of duplicate pages. You have a good weekend, too.

  3. Usually, I never comment on blogs but your article convinced me to comment on it as is written so well. And telling someone how awesome they are is essential so that on my part I convince you to write more often.

  4. One of the factors you listed from the patent for determining the primary document is “the source”. This to me means Google is using a quality, authority, and/or trustworthiness score as a factor for determining which domain wins the primary document SERP battle for duplicate content. Interestingly hasn’t Google also been pushing published and modified date in schema more recently too?

  5. Hi there, simply became aware of your blog thru Google, and located that it’s really informative. I will be grateful for those who continue this in future. Lots of people will be benefited from your writing. Cheers!

  6. Google spends a lot of money when it comes to its search engine so there’s a lot of effort put in to determine which page is a primary version of a set of duplicate pages.. But many SEOers feels (argue) they don’t see any of this in action… What do you think Bill, is it true?

  7. Wow! I really appreciate the fact that you have written on topic and made it so clear, it is a different topic and very less people can write in a manner that everything gets clear. Also, I love the layout of your page and the images used are very attractive.

  8. Goodness! I truly welcome the way that you have composed on subject and made it so clear, it is an alternate theme and less individuals can write in a way that everything gets clear. Additionally, I adore the design of your page and the pictures utilized are exceptionally alluring.

  9. Thank you so much for posting this. I hard heard that you “weren’t supposed to duplicate content” but I didn’t know if that were true, or why it might be the case if it were. I now get it – finally – and thoroughly intend to read through lots more of your blog seeing as I’m trying to sort out my own site. I made a lot of mistakes at the beginning but you don’t know what it is that you don’t know! 🙂

  10. Hi Erika,

    Google spokespeople state that there is no such thing as a duplicate content penalty. Ideally, you don’t want to duplicate content, because every page on your site is a new opportunity to rank for something else that your visitors might be interested in visiting and reading. It’s possible that if Google sees the same title and snippet from two different pages, Google may filter one of those pages out of search results because they want to provide diverse results that are unique.

  11. Hi Targetmedia,

    SEOs wouldn’t see duplicated pages in search results if one or more of the duplicates have been filtered out of those search results.

  12. Useful information like this one must be kept and maintained so I will put
    this one on my bookmark list! Thanks for this wonderful post and hoping to
    post more of this!

  13. Hi Sir,

    I am first time visit your website and read this post. really informative. I have 7 years experience in SEO. but daily learn.

    I learn lots of points from this post. this query already in my mind today clear all those things.

    Thanks you so much sir.

  14. Hi there, simply became aware of your blog thru Google, and located that it’s really informative. I will be grateful for those who continue this in future. Lots of people will be benefited from your writing. Cheers!

  15. Hii Bill Slawski, Your topic about identify primary versions of duplicate pages is very informatic thanks to sharing this type of knowledge, actually, I am a doctor and I have a website that’s why I am very curious about Google updates and algorithm. Keep sharing bill

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.