Breaking Pages Apart: What Automatic Segmentation of Webpages Might Mean to Design and SEO

Towards the end of 2003, researchers at Microsoft published a paper on a way to analyze the structure and content of Web pages which they called VIPS, or Vision-based Page Segmentation Algorithm. The approach looked at visual and structural aspects of web pages, and meant that a search engine could identify different parts of pages, and possibly understand that some parts could be more important and meaningful than others.

an illustration of different segments of a web page such as header, navigation, and main content.

This could possibly have a number of implications for search and information retrieval, and for search engine optimization as well.

A newly published patent application from Yahoo provides another look at how web pages could be segmented into parts, and provides a number of heuristics, or rules, that a search engine might follow in segmenting the content of a web page, along with a number of examples.

Why Would a Search Engine Use Page Segmentation?

Being able to properly analyze the structure and content of a web page can be very useful to a search engine when it is trying to find pages relevant to search queries. Since search engines attempt to return pages in response to searches that are relevant for query terms that a searcher supplies, those terms should appear in the main content of a page.

The patent starts off with a couple of examples of problems that happen when a search engine can’t distinguish between the main content of a page and unrelated content that may appear in other parts of a page.

Consider, for instance, a webpage containing lyrics of a song X, but with links at the bottom of the page to other pages containing fragments from lyrics of other popular songs Y and Z. A search query for Y and Z will match this page, since both Y and Z are mentioned on the page; clearly, however, the page does not contain the information the user is looking for.

Similarly, Y and Z may be text in the advertisements appearing on the webpage.

In another instance, a search for “copyright for company X” ought to return the main legal webpage in the website for company X, and not every page in that website that has a small “copyright” disclaimer at the bottom.

Terms and phrases that appear in navigation pointing to other pages, in advertisements, and in boilerplate that may appear on every page may not be ideal pages that should appear in search results. Likewise, a page that contains those terms may not even be the best choice of pages on a web site for those terms:

As another example, a New York Times webpage may have a headline bar, sports, news items, and a copyright notice. A user may search for keywords such as “New York Times legal information.”

There is probably some webpage on the New York Times web site that provides much legal information. But the keywords may also match a news page that does not provide the relevant search results. To provide more meaningful information about a webpage, it is useful to figure out that the webpage is mainly about the news item, and that the other content available on that webpage is slightly relevant but not the most important in that webpage.

Thus, splitting up a webpage into different sections is useful to provide more relevant search results.

A page may be broken down into multiple blocks, such as main content, heading, footer, advertising, main navigation, and so on. Each of those blocks could be considered a separate segment and a separate semantic unit of a page that may be, or is unrelated to the other segments. Some blocks could be joined together into one segment if the may appear to be related. Other blocks may be broken down into smaller blocks. The patent filing describes rules that it might follows to join blocks together or to separate blocks into smaller units.

Another patent filing from Yahoo, which I discussed in The Importance of Page Layout in SEO detailed how the search engine might identify different parts of pages to try to find the most important parts of the page. This segmentation approach takes that a few steps further.

Some Other Benefits of Using Web Page Segmentation

In addition to improving search by placing more weight upon content found in a main content area of a page, there are other reasons why segmentation of a page can be helpful.

For instance:

1) A page may contain segments that focus upon different topics, such as on a news page, and segments of that page may be given categories that differ from each other.

2) Web search results usually show the title to a page, a snippet about the page (which is sometimes taken from the page’s meta description and sometimes from other sources such as the content of the page) and a URL to the page. Segmenting a page into parts may improve the way a snippet is created for a page if it is taken from the content of the page, by concentrating upon using content found in an appropriate segment – such as the main content of the page rather than a sidebar or a footer.

3) An entry point into a page (perhaps links shown under a main search result like Google’s site links or Yahoo’s quick links) could be more easily found because the main navigation of the site is more easily identified during the segmentation process. If the main navigation of a site accurately identifies how the site is organized, then it could be helpful in providing ways into the main parts of a site.

4) Frequently Asked Questions (FAQs) pages can be more accurately segmented.

5) A page that has multiple parts, such as a review page that covers many products or restaurants might be segmented into parts that could be used elsewhere. For instance, Google described how they might use a Visual Gap Segmentation process in a patent filed in 2006 for reviews in local search, which I wrote about in Google and Document Segmentation Indexing for Local Search

6) Not specifically stated in the patent filing, but it’s possible that links found in different segments may be treated differently. A link from the main content of a page may be considered to be “higher quality” then a link from an advertisement or a sidebar, and may pass along more link equity or something like PageRank.

Page Segmentation Approaches

Document Object Model (DOM)

The patent also starts off with a discussion of some different approaches to segmenting a page into blocks, including looking at a DOM, or Document Object Model, for understanding the different parts of a web page. Just looking at the different elements of a web page to see how they might be organized may not provide enough information for a meaningful method of segmenting content found on a page. The patent provides the following example to tell us about the limitations of just using a document object model to segment a page:

DOM trees were not meant to describe semantic structure, but to merely describe presentation. Therefore, simply examining the DOM tree of a webpage to determine the segments of the webpage will result in some missed segments. For example, assume a table of camera models, camera descriptions, and camera prices, separated into the columns of the table. The column of prices should be a segment, because the column contains just numbers. However, nodes in the DOM tree of the webpage that represent the camera prices may not have the same parent in the DOM tree.

The reason that the nodes that represent different camera prices may have different parents is that the children nodes of the table node are row nodes, not column nodes. Under these circumstances, the price node may each have a different parent node because each price node is on a different row. Thus, due to the DOM specification, there is no one node in the DOM tree of the webpage that represents the camera prices column. The camera prices column, therefore, cannot easily be said to be a segment by looking at the DOM tree. Existing approaches fail on many such webpages.

Visual Segmentation

In addition to looking at document object model information to break a page into blocks, it can be useful to look at the visual layout of a page, looking for things like visual lines between sections of pages or white space.

If a page is analyzed for the way it appears visually, and broken into little blocks, and then explored to see how the content of those different blocks relate to each other, then some sense of which blocks belong together and which don’t might make sense. Blocks with different background colors, or which are separated by horizontal or vertical lines, or which use different font sizes or colors or styles, or which are separated by white space may be within different segments.

A visual segmentation of a web page may also miss some segments on some web page configurations.

The patent filing provides a set of five heuristics, or rules (listed in the abstract below), that it might follow in segmenting pages.

Automatic Visual Segmentation of Webpages
Invented by Deepayan Chakrabarti, Manav Ratan Mital, Swapnil Hajela, and Emre Velipasaoglu
Assigned to Yahoo!
US Patent Application 20090177959
Published July 9, 2009
Filed: January 8, 2008

Abstract

To provide valuable information regarding a webpage, the webpage must be divided into distinct semantically coherent segments for analysis. A set of heuristics allow a segmentation algorithm to identify an optimal number of segments for a given webpage or any portion thereof more accurately.

A first heuristic estimates the optimal number of segments for any given webpage or portion thereof.

A second heuristic coalesces segments where the number of segments identified far exceeds the optimal number recommended.

A third heuristic coalesces segments corresponding to a portion of a webpage with much unused whitespace and little content.

A fourth heuristic coalesces segments of nodes that have a recommended number of segments below a certain threshold into segments of other nodes.

A fifth heuristic recursively analyzes and splits segments that correspond to webpage portions surpassing a certain threshold portion size.

Brief Overview of the Process

1. A document object model (DOM) tree for the web page is created and annotated with information about where the contents for each element appear on a page that has been rendered visually.

2. Second, HTML tags at each node are classified as things such as block separators, text formatters, or text layouts.

A block separator node – these create divisions in web pages, such as a line break (br) which would create white space between parts of a page, or a horizontal rule (hr), which could be used to render a line between text.

A text formatter node – affects the display properties of text, such as bold (b), paragraph (p), italics (p), font style or size or color (font). These elements can be an indication that the test associated with them should not be separated into different blocks.

A text layout node – these group things together and can indicate the layout of a page, such as divisions or sections (div), cells in tables (td), and rows in tables (tr).

3. Nodes are then assigned blocks.

4. Blocks may be merged together to reduce the overall number of blocks.

5. Different heuristics detailed in the patent filing may be used to reduce or increase the number of blocks as necessary. Whatever blocks remain are determined to be segments.

Conclusion – Implications for Site Design and SEO

A long standing convention that many hold when thinking about how search engines index content found on the web is that search engines consider all of the content of a single page when indexing that page.

However, the Microsoft paper that I linked to at the start of this post on Vision-based Page Segmentation was published a full six years ago, and describes how a search engine might break a page down into parts to index content from those parts of pages instead of the full page itself. Microsoft has followed up that paper with a number of others that explore the segmentation of pages in more detail, and has developed other approaches that go in somewhat different directions such as object-level indexing – also see my post on Microsoft’s granted VIPS patent and on an approach from Microsoft for identifying the most important block.

The post I linked to about Google’s patent on identifying Visual Gaps in pages, which was granted earlier this year, not only discusses using page segmentation to identify different reviews that appear on the same page, but also refers in a short paragraph at the bottom of the patent that they may use visual segmentation to identify different parts of pages. Here’s that section:

[0047] Although the segmentation process described with reference to FIGS. 4-7 was described as segmenting a document based on geographic signals that correspond to business listings, the general hierarchical segmentation technique could more generally be applied to any type of signal in a document.

For example, instead of using geographic signals that correspond to business listings, images in a document may be used (image signals). The segmentation process may then be applied to help determine what text is relevant to what image.

Alternatively, the segmentation process described with reference to acts 403 and 404 may be performed on a document without partitioning the document based on a signal. The identified hierarchical segments may then be used to guide classifiers that identify portions of documents which are more or less relevant to the document (e.g., navigational boilerplate is usually less relevant than the central content of a page).

If you design web pages, or perform SEO on a site, getting a sense of how a search engine might segment the content it finds on web pages is something that you should investigate if you haven’t started already. It’s a concept that has been around since before 2003.

It can determine which content on a page might be indexed or ignored, how much weight different links may carry, where content for the creation of search result snippets may be taken from, what the most important image on a page might be, and other aspects of how a search engine interacts with the content it finds and segments on web pages.

Share

37 thoughts on “Breaking Pages Apart: What Automatic Segmentation of Webpages Might Mean to Design and SEO”

  1. Some very interesting morels here. I had been wanting to find out more about this sort of thing for some time. From my observations, I have deduced that Google have honed their ability to analyse page semantics a lot over the last year and a half. In particular links in footers have become more and more devalued. In another instance, someone I know added external links to a number of sites and text indented them off the page a few 1000 pixels and found that Google had snadboxed all the sites at the next index.

    The main thing I would be interested to know is – what does this mean for atrocious HTML? Sites with tables for layout, MS Word formatting, swathes of empty tags, blocks of white space and so on.

  2. Although I’m sure more analysis and perhaps testing would be required to know, would a general rule of thumb be to place more of the desired SEO rated information in the normal title, description, H1, body and file names and stay away from footer links, and of course, menus? I have heard of SEO’s using footer text to emphasize anchor text and other internal links, but wondered whether this is effective?

  3. Its something interesting here. I would be happy if would further happy to wait for these type of posts…

  4. Hi David,

    One of the things that I like about working on the Web is the chance to learn and grow. It’s part of the challenge and the fun of doing SEO.

    Interesting observations about footer links and offset text.

    Regardless of how good or bad the code and layout of a page might be, the way that the page may render visually could influence how a search engine segments that page, inspite of whether the page is structured with tableless cascading style sheets or tables and spacing images and possibly deprecated HTML elements. The original VIPS paper tells us that:

    Due to the flexibility of HTML grammar, many web pages do not fully obey the W3C HTML specification, so the DOM tree can not always reflect the true relationship of the different DOM node.

    So, since there are many different possible ways that a designer may come up with the same design, using CSS or table layouts or some combination, and using HTML that may not follow standards from the W3C, a visual segmentation approach looks past the use of a DOM to the way that content actually renders on a page.

    In other words, the processes in this patent filing actually address issues of code such as empty tags and unused blocks of white space through their visual segmentation approach. :)

  5. I thought people started thinking about page segmentation a few years ago, when you wrote about search engines determining blocks of related links (be it navigation or in the sidebar) and the weight of those links (if I remember right).

    What interests me is if, considering page segmentation analysis by the search engines, it’s worth placing the content above the navigation in the code.

    On one hand, it’d give unique parts of the page content (and incontent links) more prominence, but on the other hand, if that is true, it’ll reduce the weight of side-wide navigation links (and it hurts screen readers, too).

    Do you have a formed opinion on the code order thing, Bill?

  6. Hi Rick,

    I would focus upon building good quality content in the main content area of a page, and using words and phrases that people searching for what that page has to offer within that content area, and in places like the page title and in links to the page.

    There’s nothing wrong with using appropriate and meaningful keywords in footer links and menus, especially if they are helpful to visitors in giving them an idea of what might appear on the other side of those links. The right trigger words are often good keywords as well. Your links in menus shouldn’t just focus upon being good anchor text, but also words and phrases that make it more likely that people will click on those links and visit pages of your site that you want them to see.

    We don’t know if search engines do give different weights to the relevance associated with anchor text in links based upon where they appear in the layout of a page, or different amounts of something like pagerank, but it’s possible that they do, and knowing that they may segment a page gives you an understanding of that possibility that you may not have had before.

    If you have a link in your footer on every page that goes to a privacy policy, and a link in the main navigation on your site to a support forum, it’s possible that a search engine might not pass along as much PageRank to that privacy policy page linked to in your footer as it does to the support forum page linked to in your navigation bar, even though the same number of links to each page may appear on your site.

    It’s possible that if you include links to other pages on your site within the main content area of your pages, that those links might carry more relevance weight for anchor text and more PageRank than links that appear on a global navigation or a template footer as well. We don’t know that for certain either, but regardless of whether or not they do, it’s not a bad idea to have some contextual links in the main content of your pages anyway. Too many might make the content of a page unreadable, so those should be used with that concern in mind.

    What I think is important here is to realize that not every link may have the same value, based upon where the link is located in the layout of a page.

  7. Hi Sanjay,

    Thanks. It can take a while to go through some of the patent filings that I write about, and think about some of the possible implications behind them.

  8. Hi Yura,

    Very good question. Hopefully people have started to think about the possibility of a block level PageRank of the kind described in a Microsoft page on Block Level Link Analysis.

    I think that you’re asking about approaches towards attempting to place content above other elements in the code of a page, while the page actually renders in the same place as it would regardless. In older days on the Web, this was done with a table trick that would make content appear higher in the HTML code, but in the same place on the page. The same kind of thing could be done with CSS, to make certain elements appear at certain places on a page, even though they may actually appear in a different order in the HTML of the page.

    I’m not sure that such an approach helps, and there might be a very small (possibly insignificantly small) chance that it could confuse things. The idea behind using a visual aspect of segmentation is to go beyond just how the code of a page might appear to seeing how the content of a page may actually render in a brower, and basing segmentation upon visual aspects of a page.

    So, the prominence of a main content segment in the HTML code of a page shouldn’t be a factor – the segmentation approach tries to break a page into blocks based upon how the page will be seen visually, and ideally use that main content area as a focal point. That’s true regardless of whether the contents of that main content area appear near the top of the HTML code for a page, or at the bottom.

    I do like to keep HTML code as simple as possible whenever possible. I believe that doing has multiple benefits, such as making pages easier to maintain, and possibly decreasing download times and bandwidth. But the most important part of doing so is to mitigate the risk that some errors have snuck into the code of a page and that some of those errors could cause problems. The idea of intentionally manipulating the location of content so that it appears in a different area on a page than it does in HTML code through some kind of CSS positioning trick or table trick is something that I avoid.

  9. I thought that Google must be doing something to render pages after I found about those sites being sandboxed instantly after the text indentation trick. However, I’m not sure what. Rendering a page with CSS is relatively resource intensive. Although, I guess SEs would just ignore many of the decelerations.

  10. Hi David,

    I’ve wondered about the computational cost of rendering pages via a crawler as well. Google does some other things that might point towards rendering pages, such as warning about the possible presence of malicious code on pages. That could be done with a browser type rendering in a sandbox that would protect the browser from infection (via something like the Greenborder technology that appears to have made its way into Google Chrome).

    As just one possible alternative, Google does capture cached copies of pages it finds on the Web. In addition to providing those cached copies for users to view pages when they might be unavailable, it might also be possible for the search engine to compare (offline) what they see on the cached copies of pages with what their index contains about the pages. A visual segmentation of a local copy of a page might be less resource intensive.

  11. Although this is a neat idea and I think search engines have to move towards this model of figuring out the visual layout of page (compatible with user’s experience). With web technology moving forward (Ajax etc), Google is actively trying to crawl and index Ajax, JS, Flash and other types of content and at the same time, it is figuring out what images represent.

    Once they have the basic building block figured out, they will take it to next level and try to determine whether site is aesthetically pleasing (not content anymore) to user and if user is likely to stay or bounce off the website.

    Of course, things like google toolbar, analytics and adsense network gives them a great idea of this also.

    Thanks for a great post again!

  12. The comments have been helpful. I was wondering about how a page loads and what effect that would have, and that was answered. I imagine then that blog formats are better appreciated by search engines, since it seems that the segmentation process would go quicker for those sites. Would easier segmentation provided by the blog site be better for SEO over a static site then?

    Since part of the segmentation uses the tag, would an id or class attribute help highlight certain segments? Should we be concerned how we name a class or an id for a search engine trying to find the relevant segment, or is this going too far?

  13. Hi Rajat,

    Good points on the other sources of information that search engines can use to find out more about pages, such as toolbars, etc. I do think we will see some more advances in crawling web pages over the next few years, including using optical character recognition software to read text in images, associating content from frames and iframes with the pages that frame them, and more.

  14. Hi Frank,

    Interesting questions. The use of well known, and well constructed content management systems and publishing tools like blogging software can make it easier for a search engine to potentially understand the mechanics of pages on a site. It may be possible for a search engine to understand and anticipate what it might find in terms of segmentation when it’s seen many wordpress blogs, for instance.

    I think what’s important from the perspective of a web publisher is to recognize and anticipate the possibility that a search engine might segment the content of their pages based upon their physical layout. I know of at least one major household brand name that only uses its name in page titles, an image text logo, and in the copyright footer notice of its pages. Its affiliates tend to outrank it consistently on a majority of brand related searches. If it bothered to use its own name in the main content area of its pages, that might change.

    Using something like “div class=”footer” may not be a bad idea at all. Expecially if it helps a search engine focus upon the text and images that it might find in the main content area (“div class=”main-content”) of a page.

  15. Pingback: SEO Daily Reading - Issue 152 « Internet Marketing Blog
  16. Bill, it did seem to me that changing code order was more problematic, than helpful. I wasn’t aware that algorithms exist to correctly estimate visual placement, not just code (what was I thinking!).

    Thanks for the response.

  17. Hi Yura,

    There have been people who have been trying to make content appear more prominently in the code of pages for many years, on the assumption that it will help them in rankings. The use of a visual segmentation approach is intended in part to emulate the experience of a viewer of a page, who doesn’t look at the HTML behind the page. :)

    I think it can be easy to fall into that line of thinking when you start reading about something like VIPS, and there’s all this initial discussion of document object models (DOMS) and HTML elements before they start discussing actually looking at a visual representation of a page in question.

  18. I guess the take-home lesson here is to view your pages from the perspective of your customer. What segment of the page is important to them? What draws their attention? Where would they see the important content being? That’s where we should be placing our most critical anchor text links. If the engines are trying to think like customers then we should too.

  19. Yes,Due to the flexibility of HTML grammar, many web pages do not fully obey the W3C HTML specification!

  20. Hi Bullaman,

    That’s a really useful perspective. Thank you. I think that take-away is one that is worth following regardless of whether search engines are using visual segmentation. Ultimately, the goal is connecting with visitors to your site, rather than ranking highly in search engines anyway.

  21. Hi Shadu,

    Yes, there are many ways to create the layout for a page, and many pages don’t follow standards. Search engines will still try to index the content they find on most of those pages. I do like trying to follow web standards. Mainly so that it’s more likely that your layout and design will be shown to visitors the way that you intended. Some HTML errors can cause problems with the indexing of pages, and others won’t. This visual segmentation approach attempts to avoid relying completely on the HTML code that it sees.

  22. Pingback: Bookmarks for July 23rd through July 24th | Francois is running on the Sidewalk
  23. Pingback: HTML5 et référencement, quel est le programme?
  24. Pingback: SEO – All 2010 Nominees » SEMMYS.org
  25. That’s an interesting article. It does make sense that a page would be broken down like that but I never imagened the detail that google might be able to go in to to determine the value and relevancy of content and links.

    Thanks for the informative post.

  26. Hi Simon,

    You’re welcome. This kind of segmentation is something that Google, Yahoo, and Microsoft have been working on for years. It does make sense, especially considering that many sites have pages often with very similar or the same content found in headings, footers, and sidebars. If a search engine wants to focus primarily on the main content found on pages, they would definitely use something like a segmentation approach.

Comments are closed.