How Search Engines Might Identify and Handle Soft 404s and Login-Required Pages

When people in the mideastern United States don’t hear something that someone says, they may say “excuse me,” to ask the person whom they are having a conversation with to repeat what they just said. If you’re having a conversation in the Southern United States and you say “excuse me” to get someone to repeat themselves, it might evoke a blank stare (I’ve seen it).

Non-verbal communication that doesn’t seem to match the message sent with words might also cause confusion and misunderstanding (been there, too).

Many websites are set up incorrectly, in a way that when a visitor or a search engine crawling program attempts to reach a URL that doesn’t exist on the site and is redirected from that inaccessible URL to a dedicated error page showing the visitor a 404 (not found) or 403 (forbidden) or 5xx (server error) message on their screen, the message in the header from the site’s server may be a “200” ok message, which indicates that there isn’t a problem – even though there is. Some pages are only inaccessible temporarily, like when a database may be down. When a server error shows for those, the message that is sent from the server shouldn’t be a 200 (ok) message either.

Sometimes visitors are redirected from inaccessible URLs to a site’s main homepage as well.

That kind of miscommunication creates confusion and can mean that non-existent pages at accidentially mistyped or miswritten URLs or pages that may have been removed from a Web site may be added to or kept in a search engine’s index, even though those pages shouldn’t be included or should be removed. And would possibly be removed if the correct 404 or 403 or 5xx error message was sent back to a search engine.

Some other links that might be found on the web may point to pages that aren’t acccessible unless someone is logged in to a site, and if they aren’t, a redirection may take them to a login page or to a page that tells them that authorization is required to view the page. And those pages on the other side of the redirect may also send 200 (ok) messages back to a search engine, which can’t login. These links point to pages that also shouldn’t be included in a search engine’s index.

Because a search engine receives the 200 (ok) message, it may treat those pages as if they are actual live web pages.

When a visitor sees a page that tells them there has been a 404 error, but the header message sent from the server indicates a 200 (ok) page, those errors have been called “soft 404″ pages.

A new patent application from Yahoo tells us that soft 404 error pages exist in large numbers on the Web:

According to one article, “Sic transit gloria telae: towards an understanding of the web’s decay“, by Z Bar-Yossef et al. (2004), it is estimated that soft 404s account for more than twenty-five percent of the dead links on the web. The Z Bar-Yossef article proposes a method to detect whether a particular web page is a soft 404 page.

In an ideal World Wide Web, the right error messages should be sent through a server error message, and miscommunication should be avoided. Site owners should check to make sure that this kind of misunderstanding doesn’t happen. But, as the quote above indicates, this kind of soft 404 problem happens frequently. It is to the benefit of site owners and search engines to avoid problems like that.

The patent application tries to identify soft 404 errors, redirects to login pages, and other similar problems by clustering together web pages from a site that share many similarities based upon “characteristics of the content of the web pages” in each of those clusters.

After pages are clustered together like that based upon their content, the process described in the patent filing tells us that it looks for a metric involving a similarity between the URLs For each of the pages in each of the clusters, and the similarities based upon content and URL structures can be used to determine “similarity classes” for the URLs of pages on a site. For example, one such class might be a “soft 404 similarity class”

The patent application is:

Unsupervised Detection of Web Pages Corresponding to a Similarity Class
Invented by Mahesh Tiyyagura
Assigned to Yahoo
US Patent Application 20090157607
Published June 18, 2009
Filed December 12, 2007

In addition to a class for soft 404 error pages, other classes might also be determined, such as for pages that indicate:

  • Out of stock
  • Program exception
  • Permission denied and
  • Login required

The crawling of web pages usually happens independently of the indexing of content on those pages. Before the pages are indexed, some analysis of the content and URLs found on a site may take place, including a process like the one described in this patent filing, which may determine similarity classes of the web pages.

Why a Search Engine Might Want to Identify Soft 404s

Some of the reasons why a search engine might want to determine if there are soft 404 pages on web sites can include:

1) A recognition that that the soft 404 pages and their URLs do not pertain to useful information, which means that a search engine wouldn’t need to index those pages.

2) Reducing (or decaying) a “freshness” value for pages linking to those soft 404 pages, which those pages might have gained based upon a link-based ranking algorithm. In other words, pages with dead links may rank less highly in terms of “freshness.” If a search engine doesn’t recognize that one or more links on a page point to soft 404 pages, it might rank that page more highly based upon a freshness factor. Identifying soft 404s means that a search engine won’t give a page a ranking boost based upon freshness.

3) For pages on sites that might show advertising from search engines, where a soft 404 is shown or a requirement to login, or another similarity class that doesn’t provide useful information, the patent filing tells us that it is assumed that visitors are likely to want to navigate quickly away from such pages. We’re also told that more generic advertising might be shown on those pages, or ads that occupy more screen real estate than for other pages on a site.

The patent filing provides some details on how pages might be clustered together based upon their content, and how URLs might be determined to be similar. The paper Syntactic Clustering of the Web is mentioned as an example of a clustering and shingling technique that could be used, as is the process described in the patent Method for Clustering Closely Resembling DataObjects.

Conclusion

This patent application from Yahoo describes a process that might be used when a site isn’t set up properly to communicate such things as a proper 404 (not found) server message when a visitor might see a 404 message on a page that they view, but their browser and search engine crawling programs get a 200 (ok) message instead.

It’s recommended that site owners fix problems like soft 404s rather than relying upon processes like the ones described in this patent filing. It’s to the benefit of the search engine and site owners to reecognize when miscommunications like soft 404s happen, but it’s even better if the wrong messages weren’t sent in the first place.

Share

16 thoughts on “How Search Engines Might Identify and Handle Soft 404s and Login-Required Pages”

  1. A very interesting filing. It is not surprising that Yahoo would look into this. In the past month I have seen two “search engine optimisation professionals” who have gone a step beyond not using canonical URLs and not bothered to add both variants of their domain as virtual hosts on their web server.

    There are a lot of people championing valid HTML, OOP and modern web design while the not so glamorous world of sound URL architecture has been left on the slag heap. It is good to see that some of the new open source platforms are taking this into account though. Packages such as Magento and SymphonyCMS.

  2. Hi David,

    Very good points – thank you.

    The search engines have been trying to provide tools to site owners, so that they can help themselves. Over the past few years, we’ve seen tools for webmasters come along from Google, Yahoo, and Microsoft/Bing, where those site owners can create verified accounts that let them see some information that the search engines know about their sites, including robots.txt validation, some crawling and content errors, links pointing to their sites, and others. The canonical URLs tag and XML sitemaps were surprising introductions, adopted by the major search engines quickly.

    One of my concerns about things like the canonical link value and XML sitemaps is that site owners may be less inclined to fix their canonical and site structure problems, and use these methods instead. Much like many sites that were trying to use the nofollow link value to try to control the flow of PageRank through their pages instead of developing an intelligent site structure for their pages.

    I wasn’t surprised to see something like this patent application involving soft 404s from one of the search engines. Unfortunately, there are many sites that do send mixed signals – showing error pages or redirecting all errors to their home page, and sending a 200 (OK) message instead of the proper error message. While it’s a good sign that Yahoo published an approach like this, I’m hesitant in relying upon a search engine to get a problem like this right, especially when it’s often in the power of a site owner to fix a problem like this themselves.

    It is good that there are some open source packages becoming available that do address intelligently designed site and link structures. Hopefully that’s a trend that will grow.

  3. It is hard to know what SEs can do to educate people on these basics. Most people and web agencies don’t submit sites to Webmaster Tools, but pretty much everyone uses Google Analytics. I thought that it would be a good idea for Google to do something like the crawl stats for analytics; linked to documents explaining the problems certain architectural issues can cause them.

    Out of interest, what are the issues of 301ing all non-existent URLs to the homepage? I quite often do this on domains that have links pointing to removed URLs. I don’t actually have any 404s being fired as all HTTP requests are feed to PHP rather than Apache.

  4. Hi David,

    One of the difficulties that SEs may face in providing information about architectural problems is that there are such a wide range content management systems and home made solutions available to people, and each may present its own unique problems.

    There are a number of issues in using a 301 redirect for all non-existent URLs to the homepage – some of them deal with the indexing of your content, and some involve providing a good experience for visitors of your pages:

    1) Redirecting someone to your home page instead of sending a 404 error message when you’ve removed a page from your site means that a search engine doesn’t know to remove the old URL from its index, and instead may see duplicated content at each of those removed URLs. It’s still a soft error of the type that search engines are trying to avoid, with processes like the one described in the patent filing above. Visitors are also likely to be confused about why they are on your homepage instead of the page that they thought they were going to be visiting.

    2) Redirecting all non-existent URLs to your home page means that links pointing to your pages from other sites that have errors in the URLs go to the home page instead of possibly a better page. If you can identify such bad links through something like Google’s webmaster tools, and there’s the potential to get some significant traffic and some possible PageRank from those, it might be better to set up a permanent 301 redirect to the intended page, or to create a new page with the bad address that benefits both visitors and your site.

    3) A well done custom error page will send the right 404 error message and can give visitors an idea of what they can find on your site and provide strong navigation to travel there. A custom error page can include a friendly and positive message, a link to a sitemap (or an abbreviated sitemap), a link to a site search or a search form, the ability to contact the site owner in case visitors were looking for something specific. This can be a much better experience for a visitor because it avoids confusion about why they didn’t arrive at the page that they thought they were going to be visiting.

  5. Search engines really don’t need to be involved in managing 404 codes but it’s not a big deal to comply with their expectations. In fact, I would say in retrospect that it probably works better for my own visitors (who come in through hundreds of mistyped links every month) than to simply throw up a sitemap or main index page with a 200 code.

    Google and Yahoo! therefore get a thumbs up from me for requiring that I do 404 codes right.

  6. Hi Symbyo Technologies

    I look at this patent application as somewhat of a warning shot fired across the bow. Yahoo’s saying that if site owners can’t properly manage the way that they handle soft errors, that the search engine may try to do it on its own. I’d rather try to make sure that I’ve done it right then leave it to them – especially since they have billions of pages to try to include in their index…

  7. Hi jlbraaten,

    I was wondering that, too. I’d imagine that there might be some gains and some losses in terms of resources used. They would have to take some extra steps in their analysis of links before indexing, but they may end up with less indexed pages to sort through and present to searchers. If the 2004 statistic on the percentage of broken links on the Web is around 25%, that could possible mean some significant differences.

  8. Hello,

    I am a absolute beginner in the SEO space, and wondering whatever it would hurt my page that at the moment there is no 404 at all. Would it be a good choise to create a 404 that just create a 301 back to the homepage itself?

    Geirr

  9. Hi Geirr,

    It wouldn’t be a good choice to use a 301 to point all requests for missing pages to the homepage of a site. See my comment here for a few of the reasons why it’s better to have a custom 404 error page.

  10. Thanks for nice tips, I would like to ask you something I have a website and we redesigned it 2-3 months back and in this process we deleted 40-50 pages and we have custom 404 error page for our website but our ranking in google on some keywords falling dramatically. If I’ll up those pages again thn would it make any changes in my keyword ranking? please suggest something

  11. Hi Rohit,

    I would definitely need much more information about your redesign before I could even begin to start answering your question. I don’t know how search engine friendly your old design was, and how your new design might have addressed any SEO issues that they old design may have missed out on. I can’t tell you that just re-adding pages that you may have deleted from the old site will fix falling rankings, but there are a number of steps that a site owner can take when redesigning that could end up resulting in stronger rankings rather than diminishing ones.

    In transitioning one design to another, there are many issues that need to be addressed, especially if you might be changing domains, URLS for pages, removing old pages, rewriting content, and so on. Paying attention to every small detail, and every potential problem that could go wrong is essention in making sure that you don’t make things worse intead of better.

    One insignificant appearing mistake in the wrong place could have devastating effects. For instance, many developers will put up a redesign on a development server, and use robots.txt to block the indexing of all pages of the redesigned site. In moving the redesign out on a production server, where it replaces the old design, if the robots.txt file isn’t changed to allow search engine crawlers to start crawling the new pages (and properly blocking anything that shouldn’t be crawled), a site could disappear from search engines. It does happen.

    Things need to be done such as intelligently redirecting pages at old URLs to the appropriate new ones, avoiding duplicate content issues that might be caused by having the same content available at multiple URLs, creating a robots.txt file that avoids spider traps, planning on keywords and phrases to optimize pages for and checking on how well the old pages are optimized, having a strong plan in place to address any other SEO issues that might arise in the transition.

    If your dropping rankings for keywords were for keywords that appear on the pages that were deleted, then re-adding those pages might be something to consider, but I don’t know enough about your old site and the transition to the new design to tell you anything with certainty.

  12. Thanks for this useful tips! If you’re going to change a lot of your url’s for example 20% or more. Of course you can expect a dropdown in the serps. For this typical words you should write some more and fresh content and make new content pages into your website! Worked for me after changing urls.

  13. Hi Marco,

    I agree – changing URLs for pages is a risk. I don’t think it hurts to get some new links to your pages when you do that either. If you can attract those with your new content, that would be ideal.

Comments are closed.