As a webmaster, when you put a page up on the web, there may be parts of that page that you may not want to have indexed by a search engine.
Many web pages contain information that isn’t unique to each page, such as the navigation for a site, copyright notices, advertising, links to other sites such as blog rolls, and other sections that may not contain information about the main topic of the page itself.
Yahoo’s Robots-Noindex Classes
In May of 2007, Yahoo made a post on the Yahoo Search Blog about how webmasters could let the search engine know that content in certain sections of pages shouldn’t be returned in search results to searchers, titled Introducing Robots-Nocontent for Page Sections.
The Yahoo Search Help pages provide details on how to assign a class of “robots-noindex” to HTML elements, so that the content inside of those elements aren’t recalled by the search engine in response to a search, in “How do I mark web page content that is extraneous to the main unique content on the page?” (No longer available)
Yahoo Patent Filing on No-Recall Sections
A Yahoo patent application was published last week which looks more deeply into how the search engine would follow directions not to recall sections of pages pursuant to “robots-nonindex” tags.
It also describes how the search engine might decide on its own that some sections of web pages shouldn’t be returned to searches regardless of whether we use the “robots-noindex” or not, after they break a page down into sections, and analyze the content of those sections.
The Yahoo filing provides a way for it to rate different sections of a page against a main topic for a page, and designate some sections as no recall sections, that won’t be returned in a search for the content that they contain.
Method for improving quality of search results by avoiding indexing sections of pages
Invented by Priyank S. Garg, Amit J. Basu, Timothy M. Converse
US Patent Application 20080168053
Published July 10, 2008
Filed January 10, 2007
Abstract
A method and apparatus for improving search results is provided. The method works by delineating sections of a document that are not relevant to the main content. The document content is subjected to ranking analysis in entirety. In response to a query results are recalled omitting terms included in the no-recall sections.
Terms in the no-recall sections are not used in titles and abstracts of the results. The results are ordered at least in part by the rankings attributed to the identified no-recall sections.
An Overview of the No-Recall Sectioning Process
Some of the method involved in the patent filing:
1) When a search crawling program visits a web page, it might pay attention to the structure of the page, breaking it down into sections.
2) The crawling program may identify sections to ignore, and to not index in the search engine and present (recall) to searchers.
3) Sections to be ignored may be referred to as “no-recall” sections, and sections of pages that are indexed may be referred to as “recall” sections.
4) The search engine crawling program may ignore sections of pages that have been marked by webmasters who have used a “robots-nocontent” class in an HTML tag around that section, such as a “div” or a “span” or other types of HTML elements that have opening and closing tags, such as paragraphs and other sections.
5) The search engine crawling program may also ignore sections of pages that may have been identified by analyzing section content rather than “robots-nocontent” classes.
6) Terms inside those no-recall sections do not contribute to the document term frequency counts in the search engine index, so words in those no-recall sections aren’t considered when determining which words a page may be relevant for in ranking a page for search queries.
The content in those no-recall sections are also not used for recalling the pages in response to search engine queries.
7) While the information in the no-recall sections are ignored in search results, it is included as input to the analysis of pages that can affect such things as a page’s ranking.
8 ) Links within no-recall sections may be followed by the search crawling program to discover new content.
9) A page may also be analyzed for the amount of advertisements or other features that it contains, even though those may have been placed in no-recall sections using a “robots-nocontent” section by a webmaster.
10) The reason why Yahoo might explore what is contained in these no-recall sections is to keep people from including search engine spam in those sections. For example, a page that contains a very large amount of advertisements or low quality links, even within no-recall sections, might be identified and “ranked accordingly.”
Example of the Use of a No-Recall Tag
A webmaster uses a <div class=”robots-noindex”> tag around a pages copyright notice, navigation pane section, links to related blogs, and an ad section.
Inside the ad section, the term “shoes” appears, and it doesn’t show up anywhere else on the page. The page will not be recalled by the search engine on a search query for the word “shoes”.
If the word “shoes” is included in other portions of that page, then the page will be recalled for the query.
While content within a section marked by a class=”robots-noindex” in a HTML element is not indexed by a search engine index, when the page is recalled by a search engine for a search query, the element is considered for spam and relevancy analysis, with attributes in all of the sections of the page, such as “links”, frequency of terms, coloring, font, etc.
When Yahoo Determines No-Recall Sections Itself
Webmasters can mark sections of pages so that content in those sections aren’t returned for searches for content contained within no-recall sections. The search engine may decide to designate some sections of pages as no-recall sections on its own.
Here are the steps involved in this process:
1) The search engine parses through the HTML code of a page to determine various logical sections.
2) The content within each section is analyzed , by creating an abstract document model using a number of possible approaches. Attributes are looked at in that analysis, such as ‘the number, frequency and order of appearance of terms, fonts, and colors.” In addition, outgoing links within different sections are also analyzed, reviewing such things as “where the links lead, the link text and link quantity and quality.”
3) The sections of a page are rated based upon how relevant they may be to the main topic of the page, using a number of possible methods.
4) Some other methods may be used to identify a no-recall section, such as frequency of change of the contents of a section, compared to the contents of the rest of the page. For instance, a section for ads may change on every visit to a page, while the rest of the sections don’t change at all. Some sections may be the same on every page of a site, such as the copyright and title sections.
5) Sections with ratings that indicate they are no-recall sections are designated by the search engine, as well as the sections marked with a “robots-noindex” tag by webmasters.
Conclusion
I’ve written previously about how Yahoo might break down a page into sections, and attempt to find the “most important section” of that page in The Importance of Page Layout in SEO. This newer patent application shows us how a search engine might take an analysis like that, and use it to ignore some of the sections of pages.
When you put your pages together, keep in mind that a search engine might only be indexing some parts of your pages, regardless of whether you use something like a “robots-noindex” tag around some of those sections or not.
As always this is an excellent well written post. It opened my eyes to a lot of new ideas.
Thanks!
The Yahoo search engine is also supposed to respect the NoFollow attribute for links on web sites, but they haven’t in the past. There are numerous sites that use the NoFollow attribute for links, but Yahoo has counted the links anyway, including ( in the past ) links placed on Yahoo’s own site in Yahoo Answers – which also uses the NoFollow attribute.
Hi People Finder,
I’m not sure what to make of Yahoo and the nofollow attribute.
When it was first announced by the search engines in early 2005, a Yahoo search blog post told us that they would be honoring nofollow – A Defense Against Comment Spam.
The author of that post, Jeremy Zawodny, wrote a post a year later on his personal blog, titled Nofollow no Good? He writes that people changed the way that they linked as if they were rationing out some commodity that was limited in supply. He wrote:
A recent interview with Priyank Garg, who is is one of the inventors listed on this patent application provides some more information about how Yahoo treats anchor text, and links appearing in different sections of pages, as well as the use of a robots-nocontent tag. See: Director Of Yahoo! Search Talks Role Of Links In Algorithm!.
He makes some interesting statements too, about what he calls boilerplate pages, and how Yahoo detects which sections of pages are less important sections algorithmically, and how links might be valued differently in each of those sections.
Hi Chris.
Thanks. It’s always good to hear that a post inspired some new ideas. ๐
I actually believe that no-follow is a big farce. Only people who actively use it are SEOs.
Most other don’t even know or care about.
I still believe that internet was made to enable open and free communication (also, link to and be linked to) and search engines need to figure it better on their own.
In either case, no-follow or follow, search engines will continually get spammed, so they have to deal with one set of problem or other.
Rajat
Great article. I think the problem here is two-fold:
– There are valid reasons to exclude portions of a page from being indexed.
– There are people who might try to game this very mechanism for fun an profit.
OTOH, the latter part of the webcommunity will find other ways to game the search-engines anyway, don’t they?
Regarding valid reasons, it is a PITA when SERPs show completely unrelated hits containing the searched expression in some remote part of the page, be it either for navigation or a “most read” section. Given that, I’d rather have a way to exclude portions of a website from being indexed.
We have to agree with Rajat…most web owners have no idea what a no follow code is. We have heard of search engines not honoring the code so you never know with them…
Hi Rajat,
While SEOs may pay more attention to nofollow, a number of services include nofollow in links by default, such as wordpress and blogger. So it’s likely that nofollow is being used by many people who have no idea that it even exists.
Hi Erik,
Thank you.
Having a way to not have a search engine find a page for certain sections of that page makes sense to me, too. From the way that it is described, it doesn’t seem that someone can use a “robots-nocontent” section to hide web spam on a page. Yahoo applies the same spam analysis regardless of whether or not sections are within one of those “robots-nocontent” sections.
Hi Search Engine Optimization Journal,
As I responded to Raja, while many webmasters may not know what the attribute value “nofollow” is, that doesn’t mean that it doesn’t possibly affect them.
What is it is everything about Yahoo? Why not Google? Perhaps we can control bots access to certain content via Sitemaps but I expect a similar article to do so for Google. There is a lot of difference between both, Yahoo even count a “no follow” flagged link as a link but Google not.
This is really a great reading. I did never know there is such thing like this. I though it is always the meta that search engines are focusing to..
Rajat, i agree with you to a certain degree, im not too sure that no-follow is a farce, although i agree that it’s only SEO’s that care about it, i know a few designers who don’t do SEO that havent heard of no-follow – i think it keeps us search engine optimisers busy …
great post, keep them coming =)
If this was google I would say they are going to use this to ignore sections of the page that they deem to be advertising. but since this is yahoo I don’t know what to think…
Hi Matt,
I’ve written more than a few posts about Google and Microsoft’s search patent filings here, about how both might segment pages into different sections. A couple of good starting points for the Microsoft research are: ObjectLevel Ranking: Bringing Order to Web Objects (pdf) and Object-level Vertical Search (pdf). A post I wrote on how Google might identify and ignore boilerplate might also be something you may find interesting.
Hi Jun,
Thank you. Glad to hear that you found this post interesting. Search engines have been looking at a lot more signals than just meta data the past few years. Meta data has a lot less value than in may have had in the early days of search engines.
Hi Kevin,
Thanks. It would be great if more designers had the chance to follow some of the things that search engines are trying, though I suspect that it can sometimes become overwhelming with all of the things that search engines have been introducing.
Hi Oral,
They do mention in the patent application that one of the types of things that might be ignored are advertising sections. It’s hard to tell if they’ve done everything that’s mentioned in the patent filing, but it’s a possibility.
This is valuable, I also like your article on page layout. This noindex idea is good (well, good for my needs) for contact pages, privacy policy–pages that you don’t need, and aren’t going to get, search engine traffic for, but folks can find them once they are already at the site. I am using ideas like this to “channel” page rank to search-friendly landing pages, rather than “chaff pages”.
I personally think that the no follow attribute is useful in SEO, but is not the most important of attributes when it comes to indexing. As we know, spiders do not always spider the entire site for SERPs, and our SERPs listings usually carry the pages with the largest ‘popularity and content’ weight.
Although, I do agree that if some of the pages you do not want to have indexed, if it shows in the SERP’s, than it is best to insert a robots.txt. But if not, I don’t think it will be very important for the webmaster and that they should focus on content and building web popularity.
Hi David,
All very good points. I do think that focusing upon building strong content and upon building web popularity are two of the most important focuses that a webmaster should address.
Building a solid technical foundation for a site, where search engine spiders don’t run into problems crawling and indexing pages is a necessity if site owners want their pages and content found at all.
One of the things that I think we can take away from this patent filing is that search engines may purposefully decide not to index some parts of pages, or provide less weight to those parts.
If nothing else, that tells us that if you are going to put something on a page, and you think that it is important to the topic covered by that page, it’s worth considering including that information in the main content area than in some place like the footer to the page.
Hi Michael,
Thanks very much. I do think that contact pages and directions pages and other pages that might be considered “chaff pages” can provide some helpful and useful information to visitors of sites, and may be worth getting indexed. I’d rather spend the energy making those pages as valuable as possible than spending too much time figuring out how to divert PageRank to other parts of a site.
I was just going through your blog, you really have an informative and nice blog. I think your style is better than some of the so called professionals – all credit to you!These days there are so many blogs with content that is duplicated.I find relief while visiting your blog with its unique content and great topics,And makes for good reading.
Good luck with your blog and keep up the good work.
Hi khemraj,
Thank you very much for your kind words, and well wishes. They are very much appreciated.
It makes it more interesting for me to research and write about things that aren’t being written about in other blogs, too.
I stumbled across this blog and boy have I found a gem. The information you are providing is very comprehensive and second to none. Definitely one for the favourites.
Thanks ๐
Hi Justin,
Thanks for your kind worlds. ๐
Although the discussion of nofollow is interesting, I came to this page looking for tips on how to properly implement “noindex” for sections of a page that you would prefer to be excluded from search engine results. Any thoughts?
Thanks,
Dan
Hi Daniel,
If you want to apply noindex to a page using a meta tag, there are a couple of ways to do that, by placing one of the following two meta tags in the head section of a page:
<meta name=”robots” content=”noindex, nofollow”>
<meta name=”robots” content=”noindex, follow”>
If you’re asking about how to apply a “noindex” to only parts of pages, there is no way presently to do that for web searches at the major commercial search engines.
It’s funny, this is one of those issues most people overlook. You’d think that the value of SEO would be enough that people would invest in SEO sitemaps before they design the same old boring pages that every site has. We, as an example, replaced “about_us” with “seo_ethics” since the page about our company, is about our approach to SEO Ethics, and “seo ethics” is a popular search term. I’m not expecting to rank #1 for “SEO Ethics” because of this, but every page we label properly, is one more page the search engines will not ignore.
Thanks for sharing this, this is a great reference point for rel no-recall.
Cheers,
Matthew
Hi Matthew,
Good points. I’m not sure that we will ever see this implemented now that Yahoo has let Microsoft take over their search database for them. But, I agree completely that it pays to be creative, and try to provide information in a way that might be a little different than what everyone else is doing.
I do think an “about us” page can be something people do with creativity and in a way that can make it a page that not only ranks well for a business, but also can influence visitors to become customers.
I had issues between different search engines. Basically I was targeting a certain group of keywords and Yahoo was indexing my pages and had me at the top for most of the keywords even though the site doesn’t have good metrics right now.
The fault I found was with Google. Even though I had submitted site map after sitemap, the pages were being indexed but did not rank anywhere for the terms I was targeting.
Bing also had me ranking highly but not as high as Google. I soon learned that all the search engines are completely different in the way that they rank pages. I am now in the tip 50 of Google with my site and climbing.
As my site is news orientated if I ‘break’ a story then the pages rank high and I get lots of traffic, but if I miss out then I don’t rank and my overall rank goes down too…
Any ideas?
Hi Philip,
An XML sitemap can help search engines find the URLs of your pages, but it may not help much in having those pages rank well. For that to happen, it can help if the pages are both relevant to specific queries and have some text-based links pointing to them so that they can accrue some link equity/PageRank.
Google and Bing definitely do have different ranking algorithms in place. Many of the things that you do to rank well in one of those will help with the other as well, but not everything.
As for ranking well with news articles, there are somewhat different algorithms in place as well, that can value things like whether or not a story is topical – suddenly drawing the interest of lots of people as a current event, and backed with a sudden increase in interest and searches. It is possible to rank well with a news article temporarily for a query term that might be very competitive if you have fresh and timely information and you’re identified as the one who “broke” the story.
Sorry I may be a bit dumb, but I fail to understand exactly why you might want search engines to ignore certain sections of your site. Do you fear the duplicate content issue, or what?
As an aside I’m very impressed with the structure of your site. You’re fully DoFollow, and you have a very generous recent comments section. Very, Very nice. I’d love to move in ๐
Hi Jo,
I don’t necessarily “want” search engines to ignore sections of my site. They may decide to pay more attention to some parts of a page, and less attention to other parts of a page.
For example, a search engine may make the assumption that stuff found within the footer of a page is boilerplate information, like a copyright notice, or that things in a sidebar might be more likely to be advertisements that the important content to be found on a page. When a search engine indexes the content of a page, it might identify some content as being for a main section of a page, and decide the content within that section should carry more weight than content found in the footer or sidebar.
Yahoo did experiment with this, and made it possible for people to use tags between content that they did want indexed. By giving people this option, they made it possible for a search engine to focus upon the chosen content more, and to ignore the other content. Yahoo’s program really didn’t get much attention, and it’s possible that most people didn’t like the idea.
But its also possible that the search engines are doing something like this anyway.