Google on Duplicate Content Filtering and News Attribution Metatags

If you have an interest in how Google addresses duplicate content on the Web, today’s been an interesting day.

Google was granted a patent this morning that describes how Google might identify duplicate or near duplicate web pages and decide which version to show in search results, and which ones to filter out. It’s a process possibly close to what Google has been using for a while.

But…

Identifying the original source of content can be a pretty hard problem to solve on the Web.

What if Google had a smaller search vertical, where they carefully screened and identified all of the web publishers involved, and could convince them to help identify which content is original, and which is copied or duplicated?

To do that, Google introduced a new set of Source Attribution Metatags for Google News articles, which allow the publishers of breaking stories to tag those stories with an “original source” meta tag, and publishers who are syndicating those stories to use a “syndication-source” metatag. Google controls which sources show up in Google News results, and they note in their page about the source attribution metatags that:

If we find sites abusing these tags, we may, at our discretion, either ignore the site’s source tags or remove the site from Google News entirely.

The metatags would look something like this example from Google’s Page on the attribution tags:

<meta name=”syndication-source” content=”http://www.example.com/wire_story_1.html”>
<meta name=”original-source” content=”http://www.example.com/scoop_article_2.html”>

Reading through Google’s help page for those tags, I realized that this wasn’t the first time I’ve seen something from Google about letting publishers use metadata to indicate whether content was original or syndicated.

A Google patent application from a few years ago titled Agent Rank took a much broader approach to the use of the metadata though. Instead of limiting it to a small search vertical like Google News, it could be applied to all content published on the Web. I wrote about it at Search Engine Land in Google’s Agent Rank Patent Application.

What I didn’t include in my Search Engine Land article was the mention of how Google might distinquish between original content, and syndicated content. From the patent:

[0023] Signatures can be portable or fixed to a particular web page or uniform resource locator (URL). For example, a syndicated columnist may wish to sign a column once upon creation, and have the signature follow the document wherever it is published. In other cases, the agent signing the content may wish to prevent their reputation from being used to draw traffic to sites they do not control.

In either instance, the metadata associated with the digital signature can indicate whether or not the reputation associated with the signing agent is portable or not. For example, in one implementation, the signature is linked to the URL of the site where the content is located by including the URL as metadata within the signed content.

The Agent Rank approach hinges upon every publisher on the Web having a unique digital signature, that can follow them around from one site to another.

Write a blog post on your blog – you sign it with your digital signature.

Write a guest blog post on someone else’s blog – again, you sign it with your digital signature.

Leave a comment on a blog you’ve never seen before – you attach your digital signature to it.

Your “reputation” follows you around to different sources, and the ranking of things you write, whether on your own pages or those owned by others, can be influenced by a reputation score for your work.

You can also assign metadata, as noted in the passage above, to indicate the original source of your material, and to prevent your reputation score from being used to rank other pages where your material may appear, such as upon a copy of something that you’ve written.

The source attribution metatags for Google News sound like a limited version of the Agent Rank approach, in a much more controlled environment, and focusing upon known news sources as a whole rather than individual authors.

If the metatags work well on Google News, will we see something like them spread to other Google properties, like Web search?

Are they a stepping stone towards the use of something like Agent Rank?

Regardless of whether they are or not, they may be helpful in letting Google have to make the decision of whether a particular piece of content is the original, or a duplicate or near duplicate.

Which brings me back to the newly granted Google patent. The patent is:

Clustering by previous representative
Invented by Joachim Kupke, David Michael Proudfoot
Assigned to Google
US Patent 7,836,108
Granted November 16, 2010
Filed: March 31, 2008

Abstract

A method may include identifying documents in a current clustering operation:

  • Assigning the identified documents to one or more clusters,
  • Selecting a current representative document for each of the one or more clusters,
  • Determining whether the current representative document has been re-crawled,
  • Determining a previous representative document with which the current representative document was previously associated in a prior clustering operation,
  • If it is determined that the current representative document has not been re-crawled, determining one of the one or more clusters to which the previous representative document has been assigned in the current clustering operation,
  • Combining one of the one or more clusters associated with the current representative document that has not been re-crawled with the one of the one or more clusters associated with the previous representative document into a combined cluster, and
  • Storing information regarding the combined cluster.

The patent doesn’t pinpoint one particular approach to identifying duplicate or near duplicate content over another. Instead, it picks up after that determination has been made, and clusters together documents that are very similar.

Google then picks one of the documents in the cluster to be representative of the others. It will show that one in search results, and leave the rest out.

If you want, you can see some of the pages that have been filtered out of search results.

Chances are, you’ve seen a statement like this at the end of a set of search results from Google:

In order to show you the most relevant results, we have omitted some entries very similar to the 26 already displayed. If you like, you can repeat the search with the omitted results included.

In that sentence, the “repeat the search with the omitted results included.” is a link that you can click upon to see the other results.

How does Google decide which page becomes the representative page that it will show in search results?

Under this patent, it appears that Google will look for what they call “quality information.”

Quality information can include information associated with a page such as:

  • Link information – about links pointing to and from the page, to other pages or within the same page.
  • The date a document is created,
  • A page (or document) rank (possibly PageRank in some instances),
  • Anchor text information,
  • How the URL looks (a short and/or a word-based URL might be better than a long and/or non-word based URL),
  • Popularity information,
  • Quality of a web site,
  • Age of a web site, and/or
  • Other kinds of information

These aren’t signals about how well a page might rank for a particular query, but rather they are used to decide which page might be chosen in search results when there is more than one choice and those pages have duplicate or near duplicate content.

It’s possible that the decision on which pages to show in Google News when there are multiple articles that have the same or similar content follows a similar analysis.

With the new source attribution metatags, it looks like Google is adding another signal that may make that determination easier.

If those tags work well in Google News, might we see something like Google’s Agent Rank applied to a larger part of the Web in the future?

It’s possible.

Share

39 thoughts on “Google on Duplicate Content Filtering and News Attribution Metatags”


  1. * Link information – about links pointing to and from the page, to other pages or within the same page.
    * The date a document is created,
    * A page (or document) rank (possibly PageRank in some instances),
    * Anchor text information,
    * How the URL looks (a short and/or a word-based URL might be better than a long and/or non-word based URL),
    * Popularity information,
    * Quality of a web site,
    * Age of a web site, and/or
    * Other kinds of information

    This sounds pretty much like the information that is already used to rank web pages in the “normal” web search, so it’s not exactly anything we didn’t know before, I guess.

    The interesting part – in my opinion – would be the actual algorithm used to identify near duplicates – which is not part of the patent if I got that right (“The patent doesn’t pinpoint one particular approach to identifying duplicate or near duplicate content over another.”). As far as I know, Google uses “some sort of” shingles like explained in Identifying and Filtering Near-Duplicate Documents from Andrei Z. Broder but that’s sure not all they use, I bet…

  2. I like the agent rank approach, having a digital signature can really distinguish you from the rest and with it, you can leave breadcrumbs so that Google would have a hint of what’s originally yours and what’s not. It’s kind of clever actually.

  3. Well, I am not sure that identifying which is the original content is in Google’s primary interest. Google is interested in serving the “most important” page or the most relevant. We might argue – and the authoring webmaster surely will – that the original rendition should be served, but the duplicate copy might be better for the user. For example, perhaps there is additional, relevant information on the same page or the website has far more resources on the topic. It’s an interesting issue.

  4. Hi Andrew,

    I like the Agent Rank approach as well. The only issue that I see is how likely might it be that enough people sign up for digital signatures to make it work well.

    With Google News, you have to be an approved site to have your stories included, so those sites are clearly identified even though it’s on an organizational level rather that an individual one. Something like the source attribution metatags may work better within the context of Google’s News search than an agent rank system might in Google’s Web search because of that.

  5. Hi M.-J. Taylor,

    In some instances, you may just be right. One example that I can think of is when a news wire story is added to by a paper local to where a news event took place, and the local reported added significant details to the story. Or when a blogger takes part of a news story or a blog post and adds a good number of details, creating something that has just as much value.

    I know that is a concern to the people at Google as well, and a Google paper from last year explores that topic: Detecting the Origin of Text Segments Efficiently (pdf).

  6. Agreed, Bill Slawski. :) And it’s good to see that Google does make some effort to determine the original iteration. However, with links as such a central part of their algo, if content is published on a site with more authority the second instance will probably outrank the original content. Webmasters of newer sites who republish their own content in article directories run into this issue, for example.

  7. Hi Pascal,

    A number of the elements in that list of signals that Google might use to determine which document in a cluster of similar documents is the one which should be representative of the cluster are likely very similar to the algorithms that might be used to determine which page should be shown in search results, and which page or pages might be filtered out. But, there’s a good chance that there are some differences that may be significant.

    The focus of the patents that I mention in the post, and the new source attribution metatags really aren’t upon the methods used to detect similarity amongst web pages, even though that is a fascinating topic in its own right.

    I’ve personally had some pages containing my own original content filtered out of search results and replaced by sites that duplicated my content.

    One, for instance was a blog home page with a PageRank of 5, being filtered out of Google’s search results by a public Bloglines page containing excerpts of posts from that blog, with a PageRank of 0. I happened to be sitting in the front row of a conference during a “meet the crawlers” presentation, which included representative from Ask.com, Yahoo, Google, and Microsoft, and managed to raise that problem, and ask the Ask.com representative if it was unethical that they were allowing those pages to be indexed. The Google representative looked mystified that the bloglines feed was PageRank 0, and my PageRank 5 page was being filtered out of Google’s results. Within a week, Bloglines stopped allowing those types of pages to be indexed, and my blog homepage started appearing again in Google’s search results.

    Since that experience, I’ve felt like I have a personal investment in understanding why the search engines might filter out pages that they shouldn’t be, and trying to make sure that similar experiences don’t happen to other creators of original content that is copied by others.

    As for some of the approaches that Google might take to identifying near duplicate content, the Broder approach isn’t bad, but it has some limitations. I wrote about a patent from Google that pokes a hole or two in that approach in New Google Process for Detecting Near Duplicate Content, but notes that Google might choose to use that shingles based method in some instances, and use other methods, such as one from Moses Charikar in others.

    Another paper from Google on the topic that’s worth a look is: Detecting Near Duplicates for Web Crawling (pdf), which explores Dr. Charikar’s approach more fully, and builds upon it.

  8. Interesting…I do get a lot of my content “duped” as I see it on a number of sites. I’ve always been worried that I am being hurt by it… good to know!

  9. Great info Bill. I have always liked to think that myself and others that put the effort into writing original and unique content will be rewarded. There are so many article spinners and non-original content creators online its scary!

  10. [Quote]which allow the publishers of breaking stories to tag those stories with an “original source” meta tag, and publishers who are syndicating those stories to use a “syndication-source” metatag[/Quote]

    …right…no potential for abuse here…

    IF this tagging protocol ultimately ends up applying to normal search in addition to news items, I can already see junior webmasters the world over applying the “original source” meta tag to their duplicate content in an attempt to claim authorship.

    Interesting list of attributes used to track the original source. I have been experimenting with article marketing to see if my article would/could outrank higher PR article directories and have found that it seems to require a link to the original article on my site as one of my two resource links…

    Another great article, Bill. Well researched as usual…:)

  11. Hey Bill,
    thank you very much for your response. I’ve been searching for a long time by i never felt this close to the “real” information on how Google identifies duplicate or near duplicate content. I’ve just read some of your articles about it and I had a glance at the Detecting Near Duplicates for Web Crawling paper.

    Especially this part is interesting, because it seems likey (imho) that Google makes use of it:

    […]we first convert a web-page into a set of features, each feature tagged
    with its weight. Features are computed using standard IR
    techniques like tokenization, case folding, stop-word removal,
    stemming and phrase detection.[…]

    Unfortunately I m not really familiar with the techniques that are mentioned. If there happens to be a time when you feel really bored, it would be great if you could shed some more light on this :)

    Cheers
    Pascal

  12. Google granted also the “Approximate hashing functions for finding similar content” patent number 7,831,531 in these days (9 November 2010).

  13. There is a long standing myth about duplicate content in Google – I think this step will help clarify things. It’s scary knowing there are companies who spin articles for you, not knowing it’s such a bad step to quickly get a high number of links from article directories with the same content & anchor text!

  14. Hi MJ,

    I don’t usually recommend that site owners publish their content on their own sites, and then republish it elsewhere for that very reason – there’s really no control over what the search engines will do in terms of filtering duplicate content. Often a better approach is to create independent but related articles.

    For instance, if you write an article on “10 New Ideas for Thanksgiving Turkey,” to publish on your site, a syndicated article on “Great Ways to Start Your Thanksgiving Celebration,” can be a great lead-in to your Turkey article, and an ideal way to lead people to your site.

  15. Hi Steve,

    There is always the possibility that if someone copies your content and republishes it, that it may outrank you in search results, and may even cause your copy to be filtered out of those results.

  16. Hi Mike,

    Thanks. I know what you mean. I spend a lot of time trying to come up with something unique and different when I post here, and it’s disappointing seeing someone taking my effort, and credit for it by posting it on their own site.

  17. Hi Mark,

    Thank you. I’m not sure that the source attribution metatags by themselves would work well the way that they are, outside of Google News. I’m not sure that Google is convinced that they will work well just within the confines of Google News, either.

    I’ve experimented a little with article marketing, but I’d honestly rather not try to compete with myself too much.

  18. Hi Durham

    There’s a very good reason for the continuation of the confusion and myth about duplicate content when it comes to Google.

    The myth is that Google will not penalize your site for duplicate content. But, saying that is a myth is like saying that swiss cheese isn’t filled with a lot of holes.

    If Google identifies a site or sites as mirrors of another site, chances are that it may stop crawling one or more of the copies.

    If a site is purposefully been found to duplicate content for purposes of spamming a search engine, it will be penalized on the basis of “spamming” the search engine, even though that method of spamming is duplicating content.

    If more than one page publishes the same content, or very nearly the same content, it’s possible that only one page will appear in search results, and the duplicates or near duplicates will be filtered out of search results.

    If a site contains pages that are very similar, and possibly differ in very limited ways (such as the use of an almost fully competed template where only a few keywords are added), the search engine may not spend much effort in crawling and indexing those pages, and may focus upon finding and crawling more unique pages either elsewhere on the site, or on some other site completely.

    If a page doesn’t show up in search results, regardless of whether it’s filtered or penalized, it’s still not showing up in search results. If you want it to show up in search results, it really doesn’t matter whether Google says the result is being filtered, or the page is being penalized. It’s still not there either way.

  19. Hi Pascal,

    You’re welcome. Those techniques mentioned in the Detecting Near Duplicates for Web Crawling paper are fairly standard information retrieval approaches. I’ve written about a number of them here either as the focus of a post, or to help explain the context of a method described in a patent.

    Here are blog posts or articles that might help.

    On stopwords:

    Google Stopwords Patent

    On Stemming:

    Context Sensitive Stemming for Web Search (pdf)

    On Tokenization (within the context of URLs):

    Do Search Engines Look at Keywords in URLs?

    On Phrase detection:

    I’ve written a number of articles on Google and phrase based indexing. Here’s a link to the latest:

    Phrasification and Revisiting Google’s Phrase Based Indexing

  20. Yet again another amazing post…. i’ve became a fan of your blog… Keep up the good work!!! I can just rank your blog as one of the most informative blog i have ever read about SEO..

  21. Customers of mine forced to use horrible CMS systems as a function of a lack of what is typically ‘seo friendly’ versions don’t realize just how lucky they’ve been BEING indexed, as the host/root/parasite domain is first in line relative to link-juice.
    In the end though it’s all a subjective dance SEOS are forced to pretend they can manipulate. OH YEAH… links.

    BUT … Don’t stop writing or I’ll have no reason to use fire fox.

  22. Pingback: Contenuto duplicato e Meta Tag di Attribuzione della fonte | Posizionamento Zen
  23. I wonder how many publishers will use this News Attribution Metatags – wouldn’t that just make it less likely that their story will appear over the original article?

  24. I do hope Google gets better at handling duped content no matter which method they use. In the past it has seemed quite abundant and never accurately distinguished for most relevant. Great report.

  25. This is an interesting article. Sounds like it applies most directly to news content and other frequently re-published information, but I hope this affects big-bully SEO providers as well. It’s frustrating to see big companies scarfing up SERP real estate with poorly-designed landing pages that are merely re-posted copy from a company’s official website.

    Thanks for the quality post. Interesting analysis.

  26. The information you give out in every post is like a milestone in the history of seo.. how do you know so much?

  27. Hi Lisa,

    I’m not sure how many publishers will adopt these new meta tags – and I haven’t been looking too hard, but I haven’t seen anyone create any plugins for them in different content management systems.

    I don’t know how much of an impact these might have – Google already decides which versions of news articles to show, and which ones to not feature as the main version of an article.

  28. Hi Buffalo SEO,

    There are a number of issues that Google faces when it comes to choosing which version of duplicate or near duplicate content to display. One of those is that Google looks at more than just “relevance” in making that decision. If two pages have the same main content, but one has a higher PageRank, for instance, it might be the one displayed.

    At this time, the source attribution meta tags do only apply to articles appearing in Google News. It’s hard to tell if it might be applied beyond those any time soon.

  29. As good as they are at finding out the original publisher….there are still those sites that scrape content and get crawled first. I had a run in with one of these website and Google could not figure out their content was all stolen. And they would rank above my post (the original source).

    This site even included a link back to the original but I could not overtake them in the rankings! Finally I had to threaten legal actions.

    I hope Google figures this out. I worry that many will not use the new meta tags. How about instant crawling…that would alleviate much of the problem.

  30. Hi Thomas,

    Unfortunately there are sites using scraped content that end up ranked ahead of the same content from the original publishers. I’ve experienced that a few times myself.

    These new meta tags are only for sites that are included and ranked in Google News, and there’s a good chance that we may not see their use broadened for use with other kinds of web sitea, and in Google’s organic search results.

    I don’t think instant crawling would be sufficient – if the pages are dynamic – created when clicked upon, the time stamp associated with those show the current time of the crawler’s visit. The page visited first would seem to be the oldest, even if it were a copy.

  31. Pingback: Google News Introduces Meta Tags for Source Attribution | Vertical Leap Blog

Comments are closed.