Twitter Poll – How Does Google Index Content on the Web?

Google Indexes by Websites, Pages, or URLs

I thought this was an interesting question to ask people because I think it’s often misunderstood. Google treats content found at different URLs as if it is different content, even though it might be the same, such as in the following examples:

http://www.example.com
https://www.example.com
http://example.com
http://example.com/index.htm
http://example.com/Index.htm
http://example.com/default.asp

One of the most interesting papers I’ve come across on this topic is this one (One of the authors joined Google shortly after this was released – Ziv Bar-Yossef):

Do Not Crawl in the DUST: Different URLs with Similar Text

What do you think?

40 thoughts on “Twitter Poll – How Does Google Index Content on the Web?”

  1. Hi Bill, another thought provoking and interesting topic. The research paper is really well structured and well researched. This s the type of problem that we are struggle with at some time or other.

  2. Thanks, Alan.

    That paper is one of my favorites on Search, because it is a problem that I have seen frequently and struggled with myself. I’ve also been happy to see the author who went from writing it to working for Google on some patents from Google as well.

  3. A lot of the crawling and indexing patents speak of a ‘URL fingerprint’ (i.e. all variants of equal content checksums assigned to a single URL)

  4. I think Google indexes documents (or maybe what you name pages). However, a document may have multiples URLs, and these URLs maybe on differents domains.
    I noticed that when the pagerank toolbar was still working: just copy a page from any website (with exactly the same content, should be exactly the same, not near-duplicate), put it on another website with a totally differents domains, wait a bit : the page will have exactly the same page rank of the original page (I think you didn’t have even to wait the next toolbar update to see the pagerank, but really not sure about that).

  5. Yes Bill, Google going to indexing thoses url into differents index like index principal and index secondary because they are duplicate content… Penalty Penda.

  6. Hi serphacker,

    The Google toolbar is completely out of the picture now, and we will never see it used again. But I remember when it used to show the front page of the New York Times website at http://nytimes.com as PageRank 7 and http://www.nytimes.com as PageRank 9 (and those didn’t redirect so that the non-www version would redirect to the www version (they fixed that before the toolbar disappeared. The toolbar wasn’t showing the same amount of PageRank if the content was the same. Google was creating Pagerank amounts based upon the URLs that pages were showing at, and not the content on those pages (PageRank isn’t based upon the content showing on pages).

    In the eraliest version of PageRank, it would show in the search results what the PageRank was for URLs in Google’s Index. See the images from the PDF I linked to from this page:

    http://www.seobythesea.com/2011/12/10-most-important-seo-patents-part-1-the-original-pagerank-patent-application/

    Google would not give the same content on different domains the same amount of PageRank. PageRank was based upon the URLs that were linked to, and not the content found on pages. Here is a paper from the earliest days of Google that describes PageRank in more detail that you may find value in looking at:

    The PageRank Citation Ranking: Bringing Order to the Web
    http://ilpubs.stanford.edu:8090/422/

    Google has stopped updating PageRank in the Google Toolbar, so there will no longer be any Toolbar updates.

  7. Hi Claude,

    Thanks for bringing those issues up.

    Google spokespeople have told us time after time that they do not have a duplicate content penalty. I asked a couple of Googlers in an on-air-hangout in June:

    http://webpromo.expert/google-qa-duplicate-content/

    Where you may have the same content at more than one URL, it’s often a good idea to reduce those as much as possible so that people don’t link to the different versions, and you don’t dilute the amount of PageRank that may go to one of those. That could be done with redirects and canonical link elements. So, if you have the same content at http://example.com/ and http://www.example.com, you would want to 301 redirect pages on your site so that the non-www versions of pages no longer show and are redirected to the www versions (thus passing along PageRank as well).

    What the Panda Update was intended to stop was when someone had a lot of thin or low quality content on the pages of their site. The DUST (Different URLs similar text) problems described in the paper I linked to in the post aren’t the types of things that lead to a decrease in traffic based upon the Panda update. It can take some work and effort, but it is possible to eliminate those DUST problems. Cleaning up low quality content or thin content on a site can involve different efforts.

  8. “reducing crawling and indexing overhead” in their conclusions is interesting since most think about it via duplicate content issues vs. saving money/resources, and how they can do it via urls vs. content is fascinating

  9. Hi Bill,
    I was amazed to see your post.
    I was thinking earlier that goggle ranks the content based on the website, although it emphasize so much on the URL rather than Website or Webpage. This means the post having different url will be indexed each time.
    Does it mean having different url of same webpage increases the traffic or not?
    Thanks for sharing with us.
    With regards<
    Saurav

  10. Hi Saurav,

    When Google notices that the same content exists at more than one URL. it might only show one of those URLs instead of both of them within search results. So, results from http://www.example.com/ and http://www.example.com/index.htm may be identical (and may have different amounts of PageRank depending upon whether or not there are links to them from other URLs), but Google might only show one of those URLs in the set of search results it shows because it may consider the content at those URLs to be “substantially similar.” The possibility does exist that other URLs on the same or other domains may link to both URLs that have the same content – and the problem with that is that PageRank that might be accumulated by just one of those URLs could be split between them. This could mean that the URL with the most PageRank might not rank as highly as it could if it had all of that PageRank, so having two different URLs could mean less traffic to the higher ranking URL. That’s why it is important to set up 301 redirects so that pages on the site that aren’t set up the way you want redirect to versions that you prefer and pass along PageRank. So, I have this site set up so that http://seobythesea.com 301 redirects to http://www.seobythesea.com, so that the “www” version of the URL gets all of the PageRank that could otherwise be split between the two URLs.

    You’re welcome.

    Bill

  11. Interesting stuff Bill and also that Ziv joined Google shortly after is telling. Liked your response to serphacker! Thanks for your time as ever.
    Pete

  12. Hi Gary,

    You would want to have the Version that you want indexed by Google to be in Google’s Search Console, and not the versions that you don’t want indexed, because those versions shouldn’t be getting any traffic, if you have redirects set up correctly.

  13. After coming LSI consent I absolutely love your blog and find a lot of your post’s to be exactly I’m looking for. I am also a blogger and I was curious about your situation; many of us have created some nice practices and we are looking to swap solutions with other folks, please shoot me an e-mail if interested. Great work.

  14. Yes but a page or a website is a URL (with good canonical and redirection) no ? The address of the page or website is defined by the URL… i miss something maybe.

  15. Hey Bill, Pleasure to meet you here.

    Amazing post indeed. Glad to know that google indexed the content on the web by URL. It was difficult to find out me before, that how google indexed? I was thinking before that google index content by page.

    You have instructed very well and looking examples which is given here, I got understood it very well. Thanks for sharing.

    I highly appreciate you for sharing such wonderful post. It clears my misunderstanding on google index.
    Have a great day.
    – Ravi.

  16. Hi Julien,

    Pages are the content that may exist at pages, but sometimes the same pages on the web exit at more than one URL, like a page that might appear at both http://example.com and http://www.example.com. It does happen that sometimes a site isn’t set up to avoid having pages appear at more than one URL like that, and it’s something that site owners should be careful about.

  17. Hi Bill,
    The duplicate content penalty is something I have misunderstood for so long – thanks for clearing it up – and it’s true – the results for my site show differently depending on the prefix.

  18. Hi Bren,

    It’s technically not really a penalty. Instead, the fact that there may be more than one URL linked to show the same page means that PageRank is being split between the two different URLs, which means that the page won’t rank as highly as you would want it to. If the page is linked to at http://example.com and http://www.example.com, and under both URLs it shows the same content and page, the fact that you have two different URLs means that PageRank is getting split between them. If you were to 301 redirect one of those URLs to the other one, you would consolidate PageRank to one version of the URL, which would mean that version would be getting all the PageRank it should be, and ranking highly.

  19. Hi Krafting Networks,

    I was in an on air hangout with John Mueller and Andrey Lipattsev in June 2016, and asked them if there was a such thing as a duplicate content penalty at Google, and both Google spokesman insisted that there was no such Google penalty You can watch them answer here:

    https://www.youtube.com/watch?v=KxCAVmXfVyI

    Thanks. Google treats those URLs exactly the same and will use the links pointed at them to calculate PageRank for each. That difference in those URLs should be solved, but canonization isn’t necessarily the best solution. A simple 301 redirect from one version of the URLs on the domain to the other version is a directive to treat the URLs that way, while a canonical link element is only a suggestion to redirect future visits to the specified domain and send PageRank to that version of the URL. I would much rather have the redirect in place than the canonical link element. I like having canonical link elements on my pages for when people append tracking codes, such as those started by utm parameters on URLs to my site to count how often those URLs might be followed in social networks when referenced.

  20. Whenever I visit your visit, all I find is information and only information. I like reading your articles and getting a lot of information that I need to add to rank my website.

  21. I am really glad to have visited this page, I love your website and what I see and learn in it, thanks for this amazing post.

  22. Good read, all this mention of dust makes me think I’m watching the Golden Compass!

    Was thinking of making the switch to https but a little aprehensive as to the duplicate content issue for a while..

  23. Hi Bill, your way of simple explanation is ultimate. I was in search of a person like you who can clear my doubts.
    1. I would like to clear doubt about limitation of crawl engine. Is that true that almost all search engines does not crawl more than 125 to 150 links linking to an URL? Almost every SEO people warn for excessive links on a page.
    2. Up to how much links we can go for Redirect(301) and how much juice / power will it pass to the link??
    Every expert gives different opinion on the said topic. Hope you will surely clear my doubt.

  24. Hi Maqsood,

    If there are many links linking to a URL, it is quite possible that Google will find and identify many more than 125 to 150. In the past, Google engineer Matt Cutts used to warn people to not link from one page on a site more than 100 links, but as he admitted after doing so, that wasn’t because of a limitation on Google’s part, but rather it was because he felt that it was a bad user experience decision to offer that many links to a visitor to a page. I’m not quite sure I understand your question about redirects; but I’ll guess. You do want to avoid redirects that lead to redirects that are then again redirected – these are know as redirect chains; and we’ve been warned by Google Spokespeople also that Google may stop following redirects that are too many levels deep. I remember coming across a website from a fortune 50 business that had redirect chains that were up to 8 redirects deep – they really would have been doing themselves a favor changing those redirects so that the URLs pointed directly to their final destinations, instead of going through all of those redirects.

  25. Hey Bill, thanks for that, always interesting. Slight tangent but I was wondering how this affected sites with mobile versions. The Huffington Post has m.huffpost.com/story but generally the links in the mobile serps show up as http://www.huffingtonpost.com/story. Oddly their “tag” pages seem to show as m.

    Should they include the m.huffpost.com as a separate property in Search Console? It’s a different url but same content but even with a canonical and the two linked together as Google recommends you’d assume one would have more equity? So is it best to just have the www in the serps and allow the redirect to happen post click?
    Appreciate your thoughts

  26. Hi Pedro,

    Ideally, they would only one one URL per page of content, rather than two different URL showing pages with the same content on each. I’m not quite sure I’m following your question. I would check to see if the tag pages on the huffingtonpost site are noindexed.

Leave a Reply

Your email address will not be published. Required fields are marked *