Yahoo Web Page Segmentation: Distinguishing Noise from Information

In a recent interview with Priyank Garg, Director of product management for Yahoo! Search Technology, conducted by Eric Enge, we were told that Yahoo breaks pages down into template sections to distinquish between noisy, or boilerplate content, and unique content:

One of the things Yahoo! has done is look for template structures inside sites so that we can recognize the boiler plate pages and understand what they are doing. And as you can expect, a boiler plate page like a contact us or an about us is not going to be getting a lot of anchor text from the Web and outside of your site. So there is natural targeting of links to your useful content.

We are also performing detection of templates within your site and the feeling is that that information can help us better recognize valuable links to users. We do that algorithmically, but one of the things we did last year around this time is we launched the robots-NoContent tag, which is a tool that webmasters can use to identify parts of their site that are actually not unique content for that page or that may not be relevant for the indexing of the page.

If you have ads on a page, or if you have navigation that’s common to the whole site, you could take more control over our efforts to recognize templates by marking those sections with the robots-NoContent tag. That will be a clear indicator to us that as the webmaster who knows this content, you are telling us this part of the page is not the unique main content of this page and don’t recall this page for those terms.

I’m not completely in agreement with the idea that an “about us” can’t be engaging and informative, and something that people won’t link to. Done right, with things like timelines and narratives, and insights into how an organization has developed, an “about us” page could potentially be one of the most interesting pages on a site.

But the general idea, that sites may contain content that isn’t compelling and informative is true. For instance, copyright notices on a page, or advertisements, or navigational links might be content that a person visiting a page may not want to focus upon the most when seeing that page.

The idea of breaking a page down into parts, or “segmenting” that page is something that we’ve seen before from Microsoft and Google.

A paper that I wrote about from Yahoo in my post Yahoo Research Looks at Templates and Search Engine Indexing explored how Yahoo might look at the features found on a page to see if that page was using a template, and to distinguish the “main content” on that page from template type features, such as “site navigation links, sidebars, copyright notices, and timestamps.”

Other features that might be considered “noise,” are ecommerce features such as “people who bought XXXXX also bought YYYYYY” sections of pages, and similar content that doesn’t actually focus upon the main content of a page.

I also wrote about another Yahoo patent filing that discussed automatic segmentation of web pages in my post Breaking Pages Apart: What Automatic Segmentation of Webpages Might Mean to Design and SEO.

Two patent applications from Yahoo, published earlier this month, take the idea of segmenting content on a web page further, to give us more information about ways that Yahoo might use web page segmentation.

The first one discusses a number of topics, including how multiple pages on a site might be compared, to see if some segments tend to show up on multiple pages, which could be an indication that those segments are boilerplate – content that may not be the main informational focus of each individual page.

Site-Specific Information-Type Detection Methods and Systems
Invented by Rupesh R. Mehta and Amit Madaan
Assigned to Yahoo
US Patent Application 20090248707
Published October 1, 2009
Filed: March 25, 2008

Abstract

Methods and systems are provided herein that may allow for pertinent information-type(s) of data to be located or otherwise identified within one or more documents, such as, for example, web page documents associated with one or more websites. For example, exemplary methods and systems are provided that may be used to determine if information may be more likely to be of an “informative” type of information or possibly more likely to be of a “noise” type of information.

The second patent application looks for connections between content nodes (somewhat like Microsoft’s “blocks”) to see if those sections of a page should be contained in the same segments.

Method for Segmenting Webpages
Invented by Shanmugasundaram Ravikumar, Deepayan Chakrabarti and Kunal Punera
Assigned to Yahoo
US Patent Application 20090248608
Published October 1, 2009
Filed March 28, 2008

Abstract

A method of segmenting a webpage into visually and semantically cohesive pieces uses an optimization problem on a weighted graph, where the weights reflect whether two nodes in the webpage’s DOM tree should be placed together or apart in the segmentation; the weights are informed by manually labeled data.

Conclusion

It’s likely that Google, Yahoo, and Microsoft are all giving different weights to the value of links in different segments of pages, so that a link from a main content area probably carries more weight than a link from a sidebar or a footer section of a page.

It’s also likely that the search engines are attempting to ignore boilerplate segments of pages when they try to determine if pages contain duplicate or near duplicate content, so that their decisions to filter some pages out of search results are based upon the main content of pages rather than duplicated content that may appear in places such as page footers or sidebars.

Is it possible that words on a page contained in the main content area of a page would be considered more important than words that appear in a sidebar or footer or list of recommended similar products by search engines?

We’ve seen enough information about web page segmentation in white papers and patents and interviews from the search engines over the past five or six years so that it should be considered one of the basics of SEO at this point, though the topic often doesn’t show up in popular lists of search engine ranking factors published by some sites. Perhaps it’s time that it should be.

Share

21 thoughts on “Yahoo Web Page Segmentation: Distinguishing Noise from Information”

  1. So this means there will be no value to changing all links from saying “home” to “keyword” in your nav anymore.. good because its confusing when you are on a site looking how to get back to the home page and you can’t see the word “home” anywhere!!

  2. Definitely something that people should be paying attention too. Even a fairly simple algorithm such as tag density is reasonably effective at identifying the main content on a page – allowing a reasonably efficient way for even those with limited data (e.g. just a page) to identify the content and links that could be really carrying weight.

  3. Hi Stuart,

    There might be some ranking value in links found in navigation bars and footer links, but likely not as much as in the main content area of a site. The use of “Home” as anchor text for a link to a home page does give visitors to your site a pretty clear signal of where that link goes to.

  4. Hi Chris,

    Thanks. The whole concept of segmentation can impact which areas of a page a search engine might look for duplicate content, how much weight links within that area might carry, how much value the search engine may apply to words in different sections of a page, how a search engine might categorize a page that displays contextual advertising like adsense, and more. So it’s good to have even a rough idea of how segmentation may work.

  5. Pingback: Bookmarks for October 12th through October 13th | Francois is running on the Sidewalk
  6. Hey Bill,

    This web page segmentation by the Yahoo will surely mark a difference in the present day SEO scenario.

    This is really a thing worth noticing that Yahoo filter the main page content and the sidebar content and also considers it as duplicate.

    Thanks for this valuable information.

  7. Hi Rod,

    The interview that the quote was taken from took place in April of last year, so the timing of the introduction of the “robots-NoContent tag” was “last year around this time.” Regardless of whether many people or using that tag or not (and supposedly not many are), Priyank Garg admits that Yahoo is using an algorithmic approach to breaking pages into templates to weigh links in different areas of a page differently.

    But that tag isn’t the point of this post, and it’s not what I’ve focused upon. The two patent filings were published within the past couple of weeks (October 1, 2009). They provide a fairly deep view of segmentation processes that Yahoo may be using. Microsoft’s papers on Visual Segmentation processes go back to 2003, so the idea itself isn’t new. But, look at something like the SEOmoz list of ranking factors for 2009, and you won’t see the concept of segmentation represented in that list.

  8. Whenever I tackle SEO work I skip Yahoo / Bing etc altogether, in fact I was surprised with further googling to see so many people place such importance in Yahoo’s indexing methods although I know they have come up with some leading concepts. I’ve always kept placed natural design and usability over SEO when creating sites but this sort of information is always good to bear in mind and implement when practical. Cheers! :)

  9. Hi Julian,

    Yahoo has some very knowledgeable and talented people working for them in the field of search, and they have come up with some very interesting ideas explored in white papers and patent filings, or as you called them, “leading concepts.” One of the things that I find interesting are the different approaches that Google, Yahoo, Microsoft, and Ask may take to try to achieve similar results. That might be an attempt to provide the most important and relevant results to searchers, or a good mix of diverse and relevant results, or showing other kinds of media or results when it might be appropriate, such as images or videos or definitions. It might mean filtering out spam and duplicate content.

    I’m a firm believer in creating web pages for users, but I consider search engines to be just another user of a site, with its own special requirements. The purpose behind design is in presenting ideas in a way that makes it more likely that a visitor will understand what is being communicated on a page. A focus upon usabilty makes sure that it’s more likely that a site owner and a visitor to a site meet the goals that both have when that person comes to visit, if the site does actually provide something that visitor may be interested in. The purpose behind SEO is to make it more likely that the right people, the ones interested in what a site offers, will find the sites that they are looking for through search engines (and through directories and direct links and referrals from others).

  10. I guess i’m just confused as to why yahoo is still crawling things, etc — they are agreeing to return bing results for the next several years. So if thats the case, why then are they crawling, much less issuing statements about what they are doing to filter out the noise.

  11. Yahoo! is still its own search engine until the deal is approved and implementation of the agreement has taken full effect. It could take up to two years for everything to be switched over between the two — assuming the deal goes through.

    It’s not necessary to make that assumption, as many voices have been raised against the deal in its present form.

  12. Pingback: Bill Slawski answers: Does Google really use VIPS to sort the signal from the noise?
  13. I’ve noticed that search engines (particularly Google) appear to pay less attention to keywords in layout areas of a page. It would indeed seem they focus more now on the page body content – which I think makes complete sense.

Comments are closed.