Yahoo Web Page Segmentation: Distinguishing Noise from Information
In a recent interview with Priyank Garg, Director of product management for Yahoo! Search Technology, conducted by Eric Enge, we were told that Yahoo breaks pages down into template sections to distinquish between noisy, or boilerplate content, and unique content:
One of the things Yahoo! has done is look for template structures inside sites so that we can recognize the boiler plate pages and understand what they are doing. And as you can expect, a boiler plate page like a contact us or an about us is not going to be getting a lot of anchor text from the Web and outside of your site. So there is natural targeting of links to your useful content.
We are also performing detection of templates within your site and the feeling is that that information can help us better recognize valuable links to users. We do that algorithmically, but one of the things we did last year around this time is we launched the robots-NoContent tag, which is a tool that webmasters can use to identify parts of their site that are actually not unique content for that page or that may not be relevant for the indexing of the page.
If you have ads on a page, or if you have navigation that’s common to the whole site, you could take more control over our efforts to recognize templates by marking those sections with the robots-NoContent tag. That will be a clear indicator to us that as the webmaster who knows this content, you are telling us this part of the page is not the unique main content of this page and don’t recall this page for those terms.
I’m not completely in agreement with the idea that an “about us” can’t be engaging and informative, and something that people won’t link to. Done right, with things like timelines and narratives, and insights into how an organization has developed, an “about us” page could potentially be one of the most interesting pages on a site.
But the general idea, that sites may contain content that isn’t compelling and informative is true. For instance, copyright notices on a page, or advertisements, or navigational links might be content that a person visiting a page may not want to focus upon the most when seeing that page.
A paper that I wrote about from Yahoo in my post Yahoo Research Looks at Templates and Search Engine Indexing explored how Yahoo might look at the features found on a page to see if that page was using a template, and to distinguish the “main content” on that page from template type features, such as “site navigation links, sidebars, copyright notices, and timestamps.”
Other features that might be considered “noise,” are ecommerce features such as “people who bought XXXXX also bought YYYYYY” sections of pages, and similar content that doesn’t actually focus upon the main content of a page.
I also wrote about another Yahoo patent filing that discussed automatic segmentation of web pages in my post Breaking Pages Apart: What Automatic Segmentation of Webpages Might Mean to Design and SEO.
Two patent applications from Yahoo, published earlier this month, take the idea of segmenting content on a web page further, to give us more information about ways that Yahoo might use web page segmentation.
The first one discusses a number of topics, including how multiple pages on a site might be compared, to see if some segments tend to show up on multiple pages, which could be an indication that those segments are boilerplate – content that may not be the main informational focus of each individual page.
Site-Specific Information-Type Detection Methods and Systems
Invented by Rupesh R. Mehta and Amit Madaan
Assigned to Yahoo
US Patent Application 20090248707
Published October 1, 2009
Filed: March 25, 2008
Methods and systems are provided herein that may allow for pertinent information-type(s) of data to be located or otherwise identified within one or more documents, such as, for example, web page documents associated with one or more websites. For example, exemplary methods and systems are provided that may be used to determine if information may be more likely to be of an “informative” type of information or possibly more likely to be of a “noise” type of information.
The second patent application looks for connections between content nodes (somewhat like Microsoft’s “blocks”) to see if those sections of a page should be contained in the same segments.
Method for Segmenting Webpages
Invented by Shanmugasundaram Ravikumar, Deepayan Chakrabarti and Kunal Punera
Assigned to Yahoo
US Patent Application 20090248608
Published October 1, 2009
Filed March 28, 2008
A method of segmenting a webpage into visually and semantically cohesive pieces uses an optimization problem on a weighted graph, where the weights reflect whether two nodes in the webpage’s DOM tree should be placed together or apart in the segmentation; the weights are informed by manually labeled data.
It’s likely that Google, Yahoo, and Microsoft are all giving different weights to the value of links in different segments of pages, so that a link from a main content area probably carries more weight than a link from a sidebar or a footer section of a page.
It’s also likely that the search engines are attempting to ignore boilerplate segments of pages when they try to determine if pages contain duplicate or near duplicate content, so that their decisions to filter some pages out of search results are based upon the main content of pages rather than duplicated content that may appear in places such as page footers or sidebars.
Is it possible that words on a page contained in the main content area of a page would be considered more important than words that appear in a sidebar or footer or list of recommended similar products by search engines?
We’ve seen enough information about web page segmentation in white papers and patents and interviews from the search engines over the past five or six years so that it should be considered one of the basics of SEO at this point, though the topic often doesn’t show up in popular lists of search engine ranking factors published by some sites. Perhaps it’s time that it should be.