How Google May Use Schema Vocabulary to Reduce Duplicate Content in Search Results

One of the challenges of optimizing an e-commerce site that has lots of filtering and sorting options can be to try to create a click path through the site so that all the pages on the site that you want indexed by a search engine get crawled and indexed. This could require setting up the site so that some URLs are stopped from being crawled and indexed by use of the site’s robots.txt file, the use of parameter handling, with some pages having meta robots elements that are listed as being set as noindex.

If that kind of care isn’t performed on a site, a lot more URLs on the site might be crawled and indexed than there should be. I worked on one e-commerce site that offered around 3,000 products and category pages; and had around 40,000 pages indexed in Google that included versions of URLs from the site that included HTTP and HTTPS protocols, www and non-www subdomains, and many URLs that included sorting and filtering data parameters. After I reduced the site to a number of URLs that was closer to the number of products if offered, those pages ended up ranking better in search results.

Faceted Search organization
The structure of a site, and filtering and sorting options may cause lots of duplication.

What if a search engine could better identify what products or entities that the pages on a site are about. It appears that there may be a way to make that happen. And that could lead to less search results that might potentially contain a lot of duplicative content:

camera store duplicative search results
It’s possible for pages with duplicative content to rank in the same search results

A Google patent application published last year describes how the search engine could use Schema vocabulary that describes entities, like that found at Schema.org, to identify when pages are about the same entities, and reduce the rankings for a duplicated page, or remove it from search results, or possibly cluster it in results with other pages about the same entity. The patent tells us that these are the advantages of following the processes described within the patent:

A search system can map structured data items to entities to determine what entities, if any, are referenced by a web page. Reducing duplicative search results from the same site can provide a user with a greater diversity of search results that identify a larger number of sites.

The patent points out an example of web page content that is marked up with vocabulary from schema.org to refer to a specific camera model:

<div itemscope itemtype=“http://schema.org/Product”>
<div itemprop=“name”>CameraFX Q410 Digital Camera</div>
<div itemprop=“manufacturer”>CameraFX</div>
<a itemprop=“url”
href=“http://www.camerastore.com/products/
CameraFX_Q410.html”>
</a>
<div itemprop=“description”>The CameraFX Q410 Digital Camera is ideal for any photographer, combining both high quality imaging that makes taking pictures easy.
</div>
<div>Product ID: <span itemprop=“productID”>32720176</div>
</div>

The schema vocabulary may enable entities to be associated with specific web pages, and help the search engine to avoid showing a lot of pages about the same entities.

associate entity with resource
Entities on pages might be identified and associated with those pages.

The patent is:

Using structured data for search result deduplication
Publication number US20140280084 A1
Publication type Application
Publication date Sep 18, 2014
Filing date Mar 15, 2013
Inventors Daniel W. Dulitz
Assigned to: Google

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for providing deduplicated search results. One of the methods includes receiving a plurality of search results obtained in response to a query, wherein the plurality of search results identify respective resources that include markup language structured data items, wherein each resource is associated with an entity set of entity identifiers corresponding to respective structured data items of the resource. If a particular entity set of the plurality of entity sets is duplicative, a ranking score of a particular search result that identifies a resource associated the particular entity set that is duplicative is modified.

Take-Aways

The text on a page might identify the entity a page is about, but that might be a challenge sometimes. The patent gives us the example of the real-world entity the Statue of Liberty, which could be associated with aliases “the Statue of Liberty” and “Lady Liberty.”

It also might be described as being related to other entities such as being in a “located in:” relationship with New York City. The alias information and location information could be more easily identified using schema, and pages about the same entity could be more easily identified.

By including schema.org vocabulary markup on webpages, structured data items found on those pages are better understood to be on those pages, perhaps on more than one page when they are on multiple pages.

On some sites, the same camera model might be presented in multiple ways, such as sorted A-Z by name, sorted from highest price to lowest price, and or by popularity. The search system might map these structured data items on the pages of the site to potentially “remove potentially duplicative search results that refer to the same camera models.”

The patent talks about assigning reference scores for entities that may be seen as related entities and entities which are the same entities but which use alias names.

This does seem to be a transition to a ranking system that better understands the entities found on pages, and can map those entities to pages on a site, and might then care less about links on pages. The entities found on pages might be grouped into pages that can be seen as parts of entity sets:

For example, suppose that web pages of a camera merchant’s website are associated with camera model entities to represent four camera models. Structured data items on web page A can be parsed and mapped to entities to generate an entity set with elements {c1, c2, c4}. Likewise, structured data items on web page B can be parsed to generate an entity set with elements {c1, c2}, and structured data items on web page C can be parsed to generate an entity set with elements {c3, c4}.

entity set pages
Entities may be identified on pages of the site, and sets may be identified as appearing on pages.

The search engine being aware of which entities are covered on which pages, and which pages duplicate coverage of entities, can decide which pages to display to searchers, in an intelligent manner that sounds similar to Google, using pagination markup.

Summary
Article Name
How Google May Use Schema Vocabulary to Reduce Duplicate Content in Search Results
Description
When you use Schema Vocabulary in Markup on web pages, you make it easier for a search engine to know what a page is about, enabling it to reduce duplicative content being indexed
Author

39 thoughts on “How Google May Use Schema Vocabulary to Reduce Duplicate Content in Search Results”

  1. Hi Bill,

    Thanks for sharing, you’re article is spot on. When I have to do site architecture analysis I do a Silioing approach and get rid off all the non-valuable content either removing them or via Robots.txt

    Have you seen a big impact whenever moving from non-using schema.org site to using it?

    Cheers,
    Julio

  2. Hi Julio,

    I try to use robots.txt and parameter handling and meta robots elements to get rid of the non-valuable content still now, too.

    Using schema.org vocabulary on pages does seem to have an impact on the rankings of pages; I just came across this patent application, and there’s no way of telling if Google has implemented what it describes yet, but I would say it probably makes sense to use it, and give Google more certainty regarding what pages are about.

  3. Google will surely use the schema markup in coming time and try to utilise it more as compared today. Removing duplicate content is a challenge for Google and that’s why it must be using this trick to avoid it in future.

    Thanks for sharing such great article with us. Cheers!

  4. Great stuff Bill, thanks! If only there was a way to know if Google uses this more SEOs would adopt….

  5. Hi Jice,

    The title may sound obvious, but if no one has applied for the patent, and the process involved makes sense, and is actually innovative in some ways, it makes sense for Google to pursue it, and file for it.

  6. Hi BM Central,

    You’re welcome. Setting up a site so that content isn’t duplicated across a number of sorting and filtering pages can be a challenge, and I’ve seen sites where it hasn’t been done correctly, so this was good to see.

  7. Hi Brian,

    There seem to be a number of reasons to set up Schema vocabulary on pages of a site, from rich snippets involving ratings and events and the display of contact information and phone numbers, to more certainty that Google is indexing the right items, to the possibility that schema could help in the presentation of knowledge panel markup for a site. It’s worth trying out, for all of the SEOs that haven’t been.
    t

  8. Great post. I think its must be helpful for us.
    Thanks for your informative post.

  9. Ugh ~ It would surly be nice if the winds of change would settle for microformats (et al.) Implementing template refactoring work for sites already using legacy markup is quite labor intensive…

  10. Trouble is implementing schema is a non trivial cost – hard to sell a client on spending limited dev resource on a hunch

  11. Hi Bill,
    Pairs like “MainEntityOfPage/MainEntity” will do the job. Looks very possible, even obvious.
    Great news if they do implement, IMhO this needs to happen and will happen, ugly side is Structured Data being abused in search of fake relevance aiming at “not being deduplicated”.
    Question is, will Google find a “nofollow” solution for Schema lovers? And will this solution be so controversial like everything behind link juice? Yes there’s a clear statement about manually banning offending domains from SD processing, but will this ever become algo-driven? I say yes, and will be worse than “link wars”.
    Thanks for sharing this patent, I feel on good track after reading this 🙂

  12. Great article Bill. I’m a fan of Schema & I’m in the middle of proposing a lot of Schema markup for my company’s site’s page elements.

    Doesn’t using the rel=”canonical” tag solve any duplication issues?

    When I change the view of a product listing page therefore changing the URL we have a canonical tag in place for the static URL and that’s what Google sees and has indexed.

  13. Hi Maurice,

    A patent for a process from Google is a little more than a hunch. But I understand why your clients might want a clear and positive statement from Google that they are going to do something like this for certain. They could say something like that at the Google Developer’s Blog, or one of their other blogs; there’s no way of knowing for certain if they might, or when they might. We have seen Google move towards better understanding entities on the web, with their move towards the Knowledge Graph. There is value in having Schema vocabulary on your pages when it results in rich snippets, in rich answers like showing a phone number when someone searches for the company name and the word “phone”, in contact information showing for a company when the company name is searched for, we have seen event information show up in Google search results when event schema has been added to pages of a site; so we do seem to be moving towards a time when there are benefits that are visible for having Schema vocabulary on your pages.

  14. Hi Matt,

    Good to hear that you’re a fan of Schema. It does seem to be a direction that the search engines are headed towards.

    I’m afraid to say that the use of a canonical link element solves duplicate content issues, because it doesn’t necessarily solve all duplicate content issues, though it does solve some:

    (1) It points out the preferred domain and protocol for a page when those might somehow change.
    (2) If additional parameters are added to a page for purposes of tracking (sessions or users) it can show the URL for a product, without those additional tracking parameters added to it.

    If a page is part of a pagination series, unless you have a view-all page for that series, it’s often recommended that you make canonical link elements for those pages self-referential (pointing to themselves); sometimes many people point to the first page in that series with the canonical link element, which if a search engine treats like a 301 redirect as it suggests they should, means that any products on that pagination page don’t get crawled and indexed.

    If you have a series of pagination pages that show off products priced high to low, and then again low to high, your pagination links and canonical links should show those off as two different sets of pages, and while they end up having duplicated content, the canonical link elements on those don’t tell the search engine that.

    If Google maps out schema for products to series of pages, it can tell that the series showing products priced high to low, and low to high, and also the same products sorted by popularity, all contain the same entities or products within them. Under this patent, with schema set up for all products, Google might decide that it could demote or remove a couple of those series in search results for the site they appear upon.

  15. Hello Bill.
    Very interesting as usual.
    But this time, I think you took a shortcut somewhere.
    Your title, the examples, and all of your paper talks about schemas, so I presume that you were thinking all the way about schema.org vocabulary.
    But even the abstract (I haven’t read the patent, so perharps it’s more precise) is only talking about markup language structured data items, which could include any kind of documented rdf markup.
    Don’t you think that this patent could cover the use of >any< markup to dedup, not just schemas ?
    What about html for instance, that can easily be seen as rdf triples, and are structured data items from a structured language ?
    More. There are relationships between vocabularies, like sameas predicate for exemple (but many more exist in the rdf world of data) that could be used to consider visually differently marked data as having a common signature. Don’t you think inference could be used to detect identity between entities marked with different vocabularies, and then used to dedup ?
    More again, but we go far beyond application of this patent:
    It could be used for any kind of data from which an algo could extract rdf triples carrying enough semantics…

    I must be already sleeping now, to imagine this: it’s very late in france…

  16. Hi Mathieu,

    The patent focuses upon the use of structured data and explicitly uses examples from schema.org vocabulary. We can read more into it, but need to be careful if we do that. It does tell us:

    The structured data item specified by the markup language can correspond to a real world person, place, thing, or idea, for example. An example of a markup language schema for defining structured data items can be found at http://schema.org.

    We don’t know if the process described in the patent involves using structured data of other types. If it did, it might possibly be expanded to tell us about those as well. At this point, it appears to protect a process of using structured data in the form of schema vocabulary only, the way that it is written.

    I agree with you about using other structured data in similar ways, The patent can point us to other possibilities, but it’s limited in what it can exclude others from doing to what it actually covers.

  17. Really interesting one, but I did not find any proven store for the same statement. May you give me some store name which is having profited from mentioned markup. There are lots of canonical issues and CMS pages that should not be indexed or count write here.

  18. Content is always the king in SEO strategy and your article has helped me a lot in understanding minor things,

  19. Well I have implemented schema on one of my blog but till now I have not received any positive response from Google in rankings.
    Hope it will work soon and hoping for my keywords to jump on top 🙂
    Thanks for this valuable post on Schema 🙂

  20. Hi Vikas,

    This patent has just been granted, and so it’s possible that there aren’t any stores that are known publicly that have benefited from this, not any pointed to by Google. Google rarely points out when they are using an approach like this, and I haven’t seen them publish anything other than this patent that describes them doing so.

  21. You are absolutely right Bill Slawski!!! I will try to implement it and if I get rid of all canonical issues I will share it here 🙂

  22. This website is very informative to read. I am a huge follower of the things you talk about. I also love reading the comments, but it seems like a great deal of readers need to stay on topic to try and add new things in the original topic.

  23. Great post Bill! I agree that we don’t know, however, managing clients so that you can “future proof” their SEO. This is even more important with small businesses that compete against major players in an industry. Building for the future is how they maintain the edge. I don’t think there is any disagreement that structured data and identifying entities and extraction of info for the knowledge graph is the future it won’t be deprecated for many years so any use of it in a site is looking to the future. I was sceptical at first, especially when people I know and respected claimed to see advantages in ranking by using it. However, I did start adding it to new sites and redevs and did implement Organization and location type schemas to monthly SEO clients and my own sites. After 3 years of implementing it and watching I now see it as a viable reason to re-dev sites that need a way to reach the next level in their SEO. These would be ecommerce and niches where the data being used for the content may be duplicated across many sites.

    Just my .02 Ca.

  24. Simple, but useful things, you wrote in article. Thank you. I hope I’ll receive a lot of other interesting information from your posts

  25. It really one of the knowledgeable article, by which we come to know more details about that.

    Regards,
    T&S Associates

  26. Very informative. Thanks for sharing about the schema vocabulary that how it embed the structured data to reduce duplicate content in search results. Very nice and great explanation.

  27. I am very happy to read your article. Basically I studied a article like those. But I understand nothing. You article is so awesome. Now, i understand.

    Thank You Bill Slawski

Leave a Reply

Your email address will not be published. Required fields are marked *