For years the New York Times website was a great example I could point people to of a very high profile site doing one of the basics of SEO very very wrong.
If you visited the site at “http://newyorktimes.com/” you would see a toolbar pagerank of 7 for its homepage. If instead you visited the site at “http://www.newyorktimes.com/” you would see a toolbar PageRank of 9 for the site. The New York Times pages resolved at both sets of URLs, with and without the www hostname. Because all indexed pages of the site were accessible with and without the “www”, those pages weren’t getting all the PageRank that they should have been, splitting PageRank between the two versions of the site, and that probably cost them in rankings at Google, and in traffic from the Web. Google likely also wasted their own bandwidth and the Times returning to crawl both versions of the site instead of just one of them.
A few years ago, someone with at least a basic knowledge of SEO came along and fixed the New York Times site so that if you followed a link to a page on the site without the “www”, you would be sent to the “www” version with the use of a status code 301 redirect. The change ruined the example that I loved showing people, primarily because even very well known websites make mistakes and ignore the basics. It’s one of the things that makes the Web a place where small businesses can compete against much larger companies with much higher budgets.
Many SEO basics don’t have written documentation from the search engines that they work one way or another, or a video from Google’s head of webspam Matt Cutts, weighing in on the issue. But I love it when a patent comes out and gives us a glimpse of problems like that from the search engine’s perspective. And I love it when they give me a reference that I can point to when people want to question something that so many within the SEO industry take for granted.
Yesterday, Google was granted a patent on Detecting mirrors on the web (US Patent 8,055,626), invented by Arvind Jain, which was originally filed on August 9, 2005. The patent describes the problem with having a site accessible under two different host names and the issues that causes.
The patent doesn’t have the visual and visceral impact of pointing to a site that seems to be doing everything right on the surface and yet is making a surprisingly simple and easily fixable error that probably cost them a great amount of search traffic over a few years.
But it’s nice being able to point to some statements like the following from the search engine itself, which defines the problem.
When multiple hostnames refer to the same content (i.e., the multiple hostnames are “mirrors” of one another), problems can be created for search engines that “crawl” and index content associated with the multiple hostnames.
If, for example, a search engine does not recognize two hostnames, that refer to the same content, as being the same, the search engine will crawl and index pages from both hostnames. This wastes crawl bandwidth and index space, and puts twice the crawl load on the website with the two hostnames. Also, multiple hostnames that refer to the same content can create problems in ranking search results.
Using existing ranking techniques, a given web page will be more highly ranked among other search results if it is pointed to by a large number of other pages. Therefore, if two hostnames, that refer to the same content, are treated separately for the purpose of ranking, the ranking of each hostname may only actually be about half what it would be if the hostnames were ranked together. (my emphasis)
While the patent describes a way it might use to help solve this problem, Google’s had questionable success in determining when this kind of mirroring takes place. The best approach to fixing a problem like this is to not rely upon the search engine getting it right, but to take action so that it doesn’t have to. This hostname mirroring is usually easily solved by setting up a server based permanent (301) redirect so that either the version with the “www” or the version without the “www” is chosen and is accessible by visitors.
Google has included information about this problem in a few places within their help pages and a way to make Google understand which version you prefer to be shown through Google Webmaster Tools when your site is showing up with and without a “www” on the page, Preferred domain (www or non-www).
At the bottom of that page, Google does provide the following helpful hint:
Note: Once you’ve set your preferred domain, you may want to use a 301 redirect to redirect traffic from your non-preferred domain, so that other search engines and visitors know which version you prefer.
The search engine also addresses this problem on a page titled Canonicalization.
Around a week ago, Google also told us that they would be sending out messages to people who had a problem like this, and related problems, when it made a difference in which pages that people might select in search results, in the Google Webmaster Central blog post, Raising awareness of cross-domain URL selections.
Of course, the best solution is not relying on Google being able to understand when this problem of having the same site appear with and without a “www” exists, and fixing it with a 301 redirect. Sometimes Google gets it right, but I do see examples of where they don’t.
I no longer have the New York Times website as an example of a site handling this problem the wrong way, but I’m happy that they did fix the problem. Don’t make the same mistake, and don’t rely upon Google fixing it for you. This patent was filed almost 7 years ago, and Google wasn’t doing well at fixing the problem for the New York Times. Fortunately for them, they fixed it for themselves…
I do agree with you that a lot of big companies do get it so wrong. I used to work with a big brand a few years ago and their website was so anti SEO, the whole website was built on Flash.
Good post by the way.
What’s your guess on the percentage of webmasters who actually set their preferred domain in Google Webmaster Tools?
Hi Dan,
I would suspect that Google could produce that information for us with a couple of pushes of a button. Hopefully they do have those types of reporting tools at their fingertips. I only have access to a pretty relative small sample size in comparison, so I can only guess and it’s likely that my guess would be off.
Ideally people shouldn’t have to use that setting in Google’s Webmaster Tools if they take the time to set up 301 redirects so that it isn’t a problem. But sometimes that’s really hard to do based upon the ecommerce platform that you might be using. Making that change in Webmaster Tools only doesn’t help when it comes to other search engines such as Bing, either.
I would guess that it’s only a pretty small percentage that have set their preferred domain in Webmaster Tools though.
Hi Waiss,
Thank you.
I do try to take a look at some pretty big sites from time to time to see what kinds of things they might be doing positively and negatively from an SEO perspective, and I find myself surprised at some of the problems that I do see spring up that could be easily fixed.
A couple of those surprises include seeing a “revisit” meta tag on the site of one of the largest finance house’s sites in the world, and an “invalid” security certificate on the non-www version of one of the largest banks in the world (which should have also been using a 301 redirect to point to the “www” version). The bank is such a household name that it’s likely that many people were visiting them by typing in the bank’s name into their browser window and adding a “.com” to it. Chances are that many of the people who did that saw the invalid security certificate (tied to the “www” version of their name, but not the non “www” version) and left, and looked for another bank.
Hello Bill, thanks for finding the error in New York Times website. I just checked out for this PR difference. But they have fixed it. They might have read your post.
Thanks for helping them!
Hi Ashish,
As I noted in the post, the problem was one that the New York Times corrected on their own a couple of years ago by using a 301 redirect. But the problem did exist, and for a few years. I kept on expecting that they would figure it out on their own and fix it, but it took them much longer than it really should have.
We see this a lot, both with sites we work with and sites we build. With sites we build this is a real concern because we are definately only building one site to be hosted on one domain
Organising a 301 re-direct is an easy fix (in most but not all cases, some hosting companies make life difficult) but it would be better if the problem never exisited in the first place. Why exactly does it happen? Is it associated with the hosting company protocols?
What if you add a rel=canonical pointing to the www version from a non www version, does this fix that issue?
I would love to know how much money and traffic that mistake cost the new york times. Its kinda cool that google has came up with a way to save companies that do this from themselves. It seems like most companies are starting to realize the value in having a well designed and ran website and our either hiring or outsourcing SEO tasks to make sure they get it done right.
It takes 5 seconds in c-panel to do a wildcard 301 redirect for a site, its so surprising when some companies don’t bother – and what annoys me SO much is when you type in a site address leaving out the www (as lets face it those three www chaps take so long to type) and you get an error. I usually abandon the site in a petulant indignant huff…
Hi Bill! Have you ever wrote a book about everything you post here?
Hi Bill,
I understand what you are saying about the rel=canonical. Actually reading your comment it actually put a question on my mind.
‘Some reasons from Matt Cutts for why Google doesn’t always trust the canonical tag:’
Most of us use rel=canonical to fix the pagination issue, so does that mean in some cases even when you have rel=canonical in place, it may or may not fix the duplication issue?
Hi Andrew,
I’m not sure why this problem happens in the first place. I’ve asked a number of webhosts why they set sites up so that they can be accessed at both a version with and without a “www,” and the hosts all appear to be convinced that they are doing new site owners a favor by doing that.
Setting up a 301 redirect to one version or another isn’t that difficult to do, and in the instance where a host can’t or won’t do that for you, it’s probably a good time to think about getting a different host or ecommerce platform.
Ultimately though, it’s the choice of a site owner as to whether or not they use a “www” or don’t, and it would be great if hosts asked new site owners at the time that a site is set up whether they preferred one or the other and set it up that way for them.
Hi Saurav,
Trying to fix that issue with a canonical tag instead of a 301 redirect isn’t really the ideal approach. The search engines will definitely recognize that there only should be one version of the page if you use a 301, where they insist that a canonical tag is a “suggestion” that they may or may not follow.
Good advice here from Google’s Matt Cutts in the days before the canonical tag:
http://www.mattcutts.com/blog/seo-advice-url-canonicalization/
Some reasons from Matt Cutts for why Google doesn’t always trust the canonical tag:
http://www.mattcutts.com/blog/rel-canonical-html-head/
The official Google Help page on the canonical tag:
http://www.google.com/support/webmasters/bin/answer.py?answer=139394
As Google says on that last page:
If you use a 301, you are sending a much stronger signal that you prefer one version of the other, and your visitors will end up on the version that you’ve redirected to. It’s also much more likely that if anyone links to your pages that they will link to the version that you’ve set via a 301 redirect. Don’t rely on just a canonical tag if you can help it when using a 301 is an available option.
Hi Saurav,
It appears that Google doesn’t want you to use the canonical tag when it comes to pagination, and has introduced a rel=”next” and rel=”prev” for pagination. See:
http://googlewebmastercentral.blogspot.com/2011/09/pagination-with-relnext-and-relprev.html
Paginated pages aren’t pages that contain duplicate content, but will sometimes contain duplicated titles and duplicated meta descriptions based upon things like a content management system that you might be using.
Hi Bill,
Yes I agree there are some CMS that create duplicate meta data but also content to be honest, unless I am talking about the entirely different thing here.
For example: link no longer available – Volusion site. Has content on the page. If you go to page 2 the same contain appears. And has a rel=canonical tag point to the main page.
Hi Saurav,
No, I don’t think that you are talking about a different thing. It’s just that the combination of canonical and prev/next is a messy idea.
There are different sets of URLs associated with that main page. One is the set of URLs that can be generated based upon the different sorting methods available, and the other are the sequence pages when you go to a second page under each sorting method.
When you go to the second page of the cuffsmart page you’ve linked to (link no longer available) it is the second in a sequence, showing the same boilerplate content, but a new set of products. The rel = prev/next attributes mentioned by Google would fit there as indicating a sequence of content at different URLs, and the search engine would know that it should probably show people the first page in the sequence.
However, Google says to use a canonical tag for the same page sorted differently.
Useful to know thanks. Think I’ll still be making sure all 301s are set up correctly for www/non-www versions though, just in case!
Hi Jeff,
I agree with your skepticism. I’m going to continue to make sure that 301s are set up correctly as well in the future, instead of relying upon any of the search engines getting it right.