vision based document segmentation

Microsoft Granted Patent on Vision-Based Document Segmentation (VIPS)

Sharing is caring!

How Does Vision-Based Document Segmentation Work?

Web pages can be messy; they can have more than one topic on a page, and use templates that surround those topics adding little meaning to the meat of the content, filled with links and labels, advertising and boilerplate, copyright and other notices.

With a diversity of topics, those pages may not be easily crawled and recorded and indexed and found, by search engines and searchers.

When we think of search engines and how they work, we often break what they do down into three main parts – discovering new pages and new content on old pages, indexing content on those pages following rules that show a preference for important pages and unique content, and presenting relevant and meaningful information to searchers and their intents (or at least matching their keywords) in response to queries that they enter into a search box.

We usually don’t think of search engines as indexing parts of pages, chunks of information that might exist side-by-side with very different topics, and yet many pages are messy like that.

But we’ve had signs from the white papers and patent filings that we see from search engineers, that they might try to segment and capture information about different topics found on the same page, using a vision-based document segmentation process.

Vision-Based Document Segmentation

Researchers from Microsoft have been working on understanding and indexing different parts of pages, and a couple of white papers from a few years back (2003 – 2004) tell us about that approach:

The abstract to the second document gives us a nice summary:

A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction, and automatic page adaptation can benefit from this structure.

This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands the web layout structure based on his visual perception.

Comparing to other existing techniques such as the DOM tree, our approach is independent of the HTML documentation representation. Our method can work well even when the HTML structure is quite different from the visual layout structure. Several experiments show the effectiveness of our method.

At its simplest level, this vision-based document segmentation approach breaks down web pages into different blocks of meaning, based upon the way we might see a page, with blocks of text and pictures, and line breaks and white space, and other separators of text and images and other content.

These different sections of a page can be identified as portions of a page that might contain different meanings, sometimes completely unrelated to each other. Understanding that those blocks exist may be helpful to a search engine when it crawls a page and decides to index the content it finds upon that page, so that searchers may be able to find the information that it contains.

Microsoft was granted a patent this week on this vision-based document segmentation (VIPS) as it might be used during document retrieval. A hat tip to David Harry, who pointed the patent out to me. I had found another patent related to this one and stopped searching for more. I’ll get to that other patent with another post, but it makes sense to point this one out first since it focuses upon an earlier step – breaking pages down into blocks.

Vision-based document segmentation
Invented by Ji-Rong Wen, Shipeng Yu. Deng Cai, and Wei-Ying Ma
Assigned to Microsoft
US Patent 7,428,700
Granted September 23, 2008
Filed: July 28, 2003


Vision-based document segmentation identifies one or more portions of semantic content of a document. The one or more portions are identified by identifying a plurality of visual blocks in the document and detecting one or more separators between the visual blocks of the plurality of visual blocks.

A content structure for the document is constructed based at least in part on the plurality of visual blocks and the one or more separators, and the content structure identifies one or more portions of the semantic content of the document. The content structure obtained using the vision-based document segmentation can optionally be used during document retrieval.

We’re fortunate that we have the two papers I listed above to help us cut through some of the language of the patent, with some very helpful illustrations.

I’m going to skim over a lot of the details in the patent and try to keep this post fairly simple because of that.

Finding Blocks and Separators for Blocks Visually

One of the first steps in Vision based document segmentation is to try to break a page down into visual blocks, based upon visual cues within the document, such as:

  • Font sizes and/or types
  • Colors of fonts and/or background information
  • HTML tag type, and
  • Others

Once differences in sections of a page are identified, visual separations between them are looked for, such as:

  • Lines in the document
  • Blank space in the document
  • Different background colors for different blocks, and
  • Others

Based on the identified blocks and separators, a content structure for the page is created.

The patent goes into detail on how the HTML of a page can be explored to visualize the structure of a page, using something like the document object model to help understand parts of pages, and using several rules based upon visual cues such as font sizes and colors, background colors and more.

A Document Object Model is used by programmers and browsers to understand how the different HTML elements on a page, such as tables and paragraphs and images and forms might be related to one another, and the overall structure of a document. A web page is represented as if it were a tree, with each of the HTML elements being a branch or leaf on that tree. Those elements each have a name, and someone using something like javascript on the page can use those names to affect those named elements.

Document Retrieval Based upon Blocks

Imagine a search index that not only indicates where words are found upon documents but also on a more finely tuned level – the blocks found on pages.

Instead of ranking pages to present to searchers, the search engine may rank blocks of content to see how well they match a query.

Rankings for documents might then be created based upon the block rankings. The rank from the highest-ranking block might be used, or an average of the rankings of all of the blocks, or a combination of the rankings of the blocks might be used, or some other approach.

Vision Based Document Segmenation Conclusion

I’m skipping over a large number of details described in the patent and the papers, but one of the most important takeaways from this patent is that the indexing of content on web pages may be based on parts of pages, rather than the whole page.

This patent was originally filed in 2003, and more papers have come out from Microsoft since then that build upon vision-based document segmentation and the idea of breaking pages into blocks. If you’d like to explore the topic further, here are some papers worth looking at:

Sharing is caring!

18 thoughts on “Microsoft Granted Patent on Vision-Based Document Segmentation (VIPS)”

  1. Robert,

    I thought the papers might help. 🙂

    A simple illustration of how Google has said that they might use segmentation (Google and Document Segmentation Indexing for Local Search)

    A monthly New York magazine publishes their pages on the Web, and each month they have an article on restaurants in a different New York neighborhood. The article is on one page, and starts off by telling us something about the neighborhood. It then has a paragraph about each restaurant, with the name of the restaurant in bold at the start of the paragraph, and the address and a rating for the restaurant at the end of the paragraph.

    Google sees the white space between each review paragraph, sees the bold name of the restaurant at the start and the address and rating at the end. It segments the page based upon the layout of the page, the bold fonts and the kinds of data contained in each paragraph. It takes those reviews, and indexes them in local search as reviews about each of the restaurants.

    Microsoft might do something similar for product reviews in its object level ranking. It will look at pages that may contain reviews for multiple cameras, and segment parts of those pages based upon the kind of camera that it sees, and the layout of the page, including different kinds of fonts, page background and separations between the different reviews, using VIPS. In Object Level Ranking, it might then create a database including segments about the same model of camera found on different pages, and attributes associated with that object (or camera), such as price, level of zoom, and other features associated with the camera. When someone searches for the camera, it might provide a lot of details taken from more than one page, but it might link to the most informative page or pages about the camera (or offer a number of pages, and let the searcher decide which pages they want to look at).

    I hope those examples make things clearer, instead of more complex.

  2. Well we have spent a long time trying to tell people that they should optimise each individual page and not simply accept that a website is optimised… now we’re going to have to optimise segments of a page? I wonder just how this will work. Possibly with 3 good topics on a page links within those topics will now carry theme weight as well as link weight. If so then the real winner will be the page that has been linked to.

    But with the world wide web being so full of clutter could this now offer us results that are more cluttered and leaving us wondering why exactly we were sent to that page?

    Heck, perhaps I’m way off base on this one. But as with everything, it’s not what you’ve got but what you do with it. Lets hope Microsoft make good sensible use of this.

  3. Hi Robert,

    Thanks. Very good questions.

    One of the ideas behind this VIPS or Object Level Ranking is that pages may be overlooked in search rankings and not show up in search results because they are about multiple topics, and even though they may have very helpful and relevant sections on those pages about something someone is searching for, their mix of topics on that page may make them seem as relevant for a query. As you note, that information may be diluted.

    If the little guy has a dedicated page, with more information about a product (or topic) in a format that is easy enough for a search engine to extract information from, he may be the one who ends up on top. If the search engine is gathering information from multiple pages about a product or a topic, it may decide that his is the most informative and helpful, and put his page first.

    I’m not sure that this segmentation hurts smaller businesses – it just looks more closely at the content on page, and may index on a finer grained level than pages.

  4. Hi Robert,

    In the list at the bottom of my post is a paper titled “Object-level Vertical Search,” which expands upon this VIPS method. Microsoft has used that approach in their product search, and in their equivalent of Google Scholar – Libra Academic Search. The Microsoft Research pages tell us that both have been incorporated into Windows Live, so we know that VIPS is being used for at least those types of results.

    Having to shift our thinking to segments of pages, and how a search engine might view those when crawling, indexing, and serving results is something to get our heads around, and we should probably expect that not only from Microsoft, but also from Google and Yahoo.

    If you want to explore it further, looking at the papers about object level searching may be a good place to look. Another couple of papers on that topic are:

    Object Level Ranking: Bringing Order to Web Objects (pdf)

    Web Object Retrieval (pdf)

  5. William

    hahahahahaha… okay I’m now about as confused as can be. Thankfully I have a development team that can make head or tail of all that. I guess it’s one of those things that you get a much better understanding of it when you see it in action.

    While I’m waiting for the penny to drop I’ll give those some reading and who knows, I might actually learn something.

    Thanks for your efforts 🙂

  6. Ah… plain English! Thanks now it’s pretty crystal clear. However, if all the information is on several pages and not just one, we are left with the same eternal question, is this the most relevant page for a search? I imagine that this may be a much better idea when it comes to obscure searches, however when it comes to products, wouldn’t this simply through the ball straight to the big guys and the little guy who has a single dedicated page will start to lose out?

    However I do believe this could be a good idea for those that have a website that is just bursting with information and other relevant content that is being overlooked as it’s being diluted at the moment with other equally strong content but not 100% dedicated to a single point.

  7. It seems that this type of search engine technology and page indexing had better somehow take into account the numerous ways that seemingly different topics ( in the eyes of a search engine algorithm ) actually may be related and relevant to a human reader.

    As you noted in this post:

    … one of the most important takeaways from this patent is that the indexing of content on web pages may be based on parts of pages, rather than the whole page.

    I think the pitfall here is in the possibility of taking sections of a web page out of context and giving them too much importance as standalone bites of information, thereby actually decreasing the relevancy of the results returned by a search engine that uses this technology.

    Your thoughts?


  8. @ People Finder: isn’t that the problem that we find with searches that return blog posts. If for some reason we find ourselves on the blog home page and not the exact entry that meets our needs we often wonder why we are there. If we bother to scroll to the bottom of all that garble we eventually find the post that relates to our search.

    Could it be that this will be an even better way of determining boilerplate text? After all, if the same or similar code exists on many pages perhaps the the snippet that varies is indeed more relevant than previously expected?

  9. Hi Peoplefinder,

    I think the pitfall here is in the possibility of taking sections of a web page out of context and giving them too much importance as standalone bites of information, thereby actually decreasing the relevancy of the results returned by a search engine that uses this technology.

    That may be a possibility, but the opposite also holds true – this approach may provide access to information that is helpful and on topic, but might be placed on a large page containing lots of information of different types, and may not be showing up in search results because of the information that surrounds it is irrelevant to the query someone used in their search.

    I think that the issue you raise is one that the Microsoft researchers have been concerned about, which is why the Object Level Ranking methods have been developing, which use VIPS to isolate content on pages in different blocks, and then find other pages that contain related content about the same topics (or objects). The different standalone bites of information from different pages are pulled together that way.

  10. Hi Robert,

    I’d like to hear peoplefinder’s view on that topic, too. I think that it can be helpful exploring something like this method in the context of specific types of pages – blog home pages, product department pages, news and magazine portal pages, review sites, social networking portal pages.

    I do think that this is helpful in determining boilerplate text, without having to look at multiple pages of the same site to compare them, and see what content does appear over and over as boilerplate. That’s a very useful aspect of this approach. We also see Yahoo trying to do that by looking at the HTML of a page more closely, and Google’s identification of boilerplate can help them focus more upon the main content of a page. The approaches behind each method differs, but the result is very similar – focus upon the unique content found on a page, and pay less attention to content that may appear in many other places on the same site.

  11. Great post! Very usefull information.
    I´ve already bookmarked and subscribed the feeds.

  12. I think the pitfall here is in the possibility of taking sections of a web page out of context and giving them too much importance as standalone bites of information, thereby actually decreasing the relevancy of the results returned by a search engine that uses this technology.

Comments are closed.