Search Engines, Web Page Segmentation, and the Most Important Block

Many web pages contain more than one topical section, or blocks, which may make it difficult for a search engine to tell what a page is about when it is trying to index that page.

These blocks may include such things as a main content area, navigation bars, headings, footers, advertisments, and other content that may refer to other pages on a site, or on other sites.

The Value of Knowing the Most Important Block

Being able to identify a block within a web page that represents the primary topic of that page may help a search engine decide which words are the most important ones on the page when it tries to associate the page with keywords that someone might search with to find that page.

Identifying that content might also help the search engine decide what topic is most relevant to any ads that they might show on the page if they are an advertising partner with the publisher of the page.

Or how to break a page into multiple parts when displaying a page in parts for mobile devices on a proxy server, and show the most important parts of the page first.

Knowing a primary topic for a page could help a search engine decide which pages across the web might be related by topic.

Determining the most important block on a page could influence the weight and importance of links from different blocks on a page, so that a link from the most important block on a page has more value than a link from the least important block – a Block Level Link Analysis.

A Microsoft patent granted in April explores the identification of a block within a web page that represents the primary topic for that page, and how identifying that block might be helpful in many ways.

Noise Information and Primary Topics

A page from a news site on the Web might contain an article about an international political event and “noise information” like a diet advertisement, a legal notices section, and a navigation bar.

A search engine attempting to index the full content of the page might choose keywords based upon the noise information instead of from text related to the primary topic of the page – the political event.

That news page shouldn’t rank well in a search for the keyword “diet,” but it might, even though the primary topic of the page involves an international political event.

Pages that appear in search results where the query terms searched for are related to the primary content of those pages are likely to provide a much better experience for a searcher than pages appearing in those search results where noise on the page is related to the query searched for by someone.

The Microsoft patent is:

Method and system for calculating importance of a block within a display page
Invented by Wei-Ying Ma, Ji-Rong Wen, Ruihua Song, Haifeng Liu
Assigned to Microsoft
US Patent 7,363,279
Granted April 22, 2008
Filed April 29, 2004

The abstract for the patent tells us that it describes:

A method and system for identifying the importance of information areas of a display page. An importance system identifies information areas or blocks of a web page.

A block of a web page represents an area of the web page that appears to relate to a similar topic.

The importance system provides the characteristics or features of a block to an importance function that generates an indication of the importance of that block to its web page.

The importance system “learns” the importance function by generating a model based on the features of blocks and the user-specified importance of those blocks.

To learn the importance function, the importance system asks users to provide an indication of the importance of blocks of web pages in a collection of web pages.

Identifying the importance of information areas of a web page

This system attempts to identify and understand different information areas, or blocks, of a web page – where a block represents an area on a page that seems to relate to a similar topic. For example, a news article might be one block, and an advertisement might be another.

After the blocks of a page are identified, an importance system might look at the characteristics or features of each block to determine how important each block might be.

This importance system would use an algorithm that builds a statistical model based upon features of blocks, and upon human input determining the importance of a number of blocks in a collection of web pages.

The model might help the search engine learn to determine which blocks are the most important, so that it can use that model on other pages that aren’t reviewed by people.

The kinds of features that might be used in the model might include “spatial” features such as the size of blocks or their locations or both, as well as “content” features such as the number of links within a block or the number of words within the block.

A block located in the center of a page might be considered more important that another block at the bottom of a page.

Some “content” features might include:

  • The number and size of images in the block,
  • The number of links, and the number of words in each link, in the block,
  • The number of words in the text of the block,
  • User interaction of the block, looking at things like the number and size of input fields, and;
  • Forms within the block, again looking at number and size number ans size of input fields.

When people review pages to provide input for the model, they might rate different kinds of blocks with different weights.

For example, an advertisement or copyright notice or decoration might be given a score of 1,

Navigation or directory information might be given a score of 2.

Information that is relevant to the primary topic but not of prominent importance such as “related topics” and “topic indexes” might be given a score of 3.

The most prominent page of the page such as a headline or the main content might be given a value of 4.

Conclusion

The patent describes some of the different methods that might be used to break a page up into segments, but a more detailed look at that process is available in a Microsoft paper titled Block-based Web Search (pdf).

A process for breaking web pages into smaller blocks for display on small screens is described in the Microsoft paper Adapting Web Pages for Small-Screen Devices (pdf)

The ideas of a “block-based” HITS algorithm, a block-level structure in an inverted index of web documents, a ranking of blocks similar to a ranking of pages, and expansion of query terms based upon content found in the most important blocks are discussed in Microsoft Research Asia at the Web Track of TREC 2003.

It shouldn’t be a surprise that Yahoo (pdf) and Google have been exploring the segmentation of pages to find the most important segments upon pages.

Share

31 thoughts on “Search Engines, Web Page Segmentation, and the Most Important Block”

  1. WOW!!! This post is intense. The Microsoft PDF research file about the “Block-Level Link Analysis” is like nothing I have ever seen before. Although I don’t quite understand this new algorithm I plan on researching more to understand. Thanks for your take on this important topic.

  2. Thanks, Garrett.

    Many of the papers I linked to are two or three years old, and there’s been a little talk about them in forums and on blogs, but not a lot. I think there should be more – it’s worth pursuing more information about. :)

  3. Great stuff Bill as always… was an interesting read (as were a few the last couple weeks) and a very logical evolution IMO. Keep up the great work as I shall be reading more than writing for another 2 weeks or so… feed the need!

    (would write more but still gimpy over here)

    Dave

  4. Wonderful job as always, Bill.

    Ultimately, it’ll be best for all concerned if navbars & non-central design elements are discounted or devalued for establishing keyword relevancy. Maybe designers will find themselves going “back to the future” and only putting on the bare minimum of navigation elements that are needed for getting around their sites.

    Speaking of which…Planning to re-do your lengthy rightbar any time soon? [;-)]

  5. I was always under the impression that the google spiders read content like this.

    Header, Left Nav, Body, Right nav, Footer.

    There are some tricks to getting the spiders to read in whatever order you want them to read it as well. Some good info in your article, but it could of been a bit more in depth.

  6. Hi Dave,

    Thanks. I suspect that Microsoft is exploring ideas at this point that take the Block concepts even further. This is a push towards an “object” level ranking, where information from similar objects might be grouped together to provide even more information to a searcher.

    Hi Winooski,

    Thank you. I’m not sure that the block level concepts need to steer designers towards minimalistic approaches in navbars and non-central design elements, but it might be that semantically well coded pages may have a benefit under an approach like this one. Understanding that a search engine might treat the content in your sidebar differently than in your main content section might be helpful. Having said that, I’m fine with the lengthy sidebar at this point.

    Hi seoguys,

    That “reading” order, and the “tricks” that you mention are SEO folklore that has been around since the late 1990s. The tricks have been updated a little over time, so that one table based trick has been replaced by a CSS based trick by a few designers. I never felt a need to use either of those tricks.

    The nice thing about this granted patent is that it points to the idea that search engines and search engineers might be smarter than that, and they discuss a number of methods that have been developed that allow them to break a page into segments, and then go further into describing how they might identify the most important of those. In other words, that reading order that you mention isn’t a consideration here under this approach.

    I’m sorry that you feel that the post could have been more “in depth” but at almost 1,200 words, I thought that I did the topic some justice. If you want more details, please feel free to dig into the patent itself or the six other links that I included in the post. They do provide a fair amount of depth on this topic. :)

    And don’t be surprised to see some of these ideas spring up again in future posts.

  7. Interestng info there are Bill (as always). Id like to read more about “impotant block” optimization and apply it in my sites.

  8. Pingback: Simple Block segmentation analysis • Tim Nash UK SEO Blog
  9. Solid post. I’m doing more work in an open source CMS that (interestingly enough) organizes its modules into blocks. I’m going to research this subject further as it applies to block elements from php-generated engine.

  10. Hi repair guy,

    Some open source content management systems do a nice job of breaking pages down into semantically meaningful blocks.

    As I pointed out in my comment above yours, it appears that Microsoft is evolving its research approach to one that focuses upon aggregating factual information from blocks (or objects) on different web pages. If you are going to research this subject further, it may be worth your time to learn more about how that object level ranking works.

    Another paper that is worth looking at on the topic is this one: Web Object Retrieval (pdf)

  11. Somebody already said it and I agree that this is pretty intense.

    I have been using segments for awhile for client explanations. The thought process is much easier to understand and explain with segments and personas.

  12. Hi Chris,

    Using the term “segments” sounds like a good idea. I’ve used “blocks” and “segments” interchangeably when describing the idea of a search engine breaking pages down into parts based upon topics or importance of the section or both. As long as the explanation is clear, I think most people I explain it to can capture the essence of the topic. :)

    Personas can be really fun and helpful in explaining things, too. :)

    Hi Pete,

    A header, the HTML element, can be very helpful to a search engine. When I refered to a “heading” section at the start of my post, I wasn’t talking about the heading elements (<h1>, <h2>, etc.) but rather an area on a web page at the top of the page. :)

    You are absolutely right – a header can give the basic theme of a webpage.

    Many sites use a heading block to display information that is the same from one page to another, such as the site logo and a set of primary navigation. While that information is important, it may not be as important as the unique content on each page, which may be shown in a main content area, with a main heading in that section.

    A search engine may want to focus more upon that unique content in the main content area than it does in a heading section, and may determine that it is a more important section.

  13. Isn’t this an instance of Microsoft attempting to patent “semantic mark-up”?

  14. Hi Jeff,

    Not so much an attempt to patent semantic markup as it is an effort to take advantage of it (and of undertanding actual whitespace in the layout of a page), to identify which segment or segments should be indexed, or reviewed against other blogs on the web as duplicates or near duplicates, or to assign different weights to links from different blocks…

    The idea of semantic markup is that it can be read and used by programs, in methods like this. By itself though, it may not be sufficient, which is why a visual aspect of understanding segmentation upon a page is necessary.

  15. William

    As always a very in depth study of a field that does need analysis.

    When latent semantic indexing came on to the scene a few years back, I thought the content side of a page was becoming more structured/ understood, but as you say (more so in your Google post) it is very difficult to prioritise the key-note of a page when the page ‘has’ to reference more than one topic, let’s hope there is more clear guidelines in the future and not just DIV positioning!

  16. Hi David,

    It’s an interesting topic. We don’t know too much about the use of semantic indexing by search engines. There have been a number of patent filings by Google on the topic of Probabilistic Latent Semantic Indexing which are interesting and worth spending some time with if you want to jump into that line of research. Understanding the layout of a page may help a search engine index content on a level that is smaller than a page, but it’s possible that other approaches may be followed.

    Microsoft seems to have shifted slightly from looking at blocks of content on pages to trying to collect data about “objects” that appear on pages. See the link above in my response to Seo.Begun.

  17. I think there should be an HTML markup tag that groups semantic content together.
    Semantic separated content could also be nested inside of other semantic content. This would allow a page index ranking algorithm to use pre-separated semantic content. Since the page-ranking algorithm was designed to do this on abstracts that were already separated it seems logical that this should be something a publisher should be able to indicate this as a hint as to how data is best found. If this tag is found on a page the tag would take precedence over the segmentation algorithm unless it meets a better fit criteria.

    This idea clearly fits within the game theoretic idea of allowing a flexible system of published/searcher.

  18. Hi Michael,

    Interesting idea. It sounds a little like Yahoo!’s idea behind their Y!Q search system, which allowed site owners to tag content with special div elements within pages that they would like search engines to focus upon. It seems like Y!Q is no longer available, but I wrote about it in the following post:

    Tagging Content On Webpages, Print, and Television, with Yahoo’s Y!Q

    It was an interesting idea that wasn’t very well known. I wonder if we will see the ideas behind it re-emerge. The idea of giving a webmaster a little more control over suggesting what content on a page was the most important to index isn’t a bad idea.

  19. Hi William,

    Yesterday some one asked this question to me in an interview. What is page segmentation? at that time I was not clear about this topic, I was not able to describe how to answer this question. But now I can do that..

    Thanks again for providing such a simple and clear article on “Page Segmentation”

  20. Hi Manoj,

    You’re welcome. I’m happy to hear that my post was able to give you an understanding of how page segmentation works. Wish you had found it before that interview, though.

  21. That makes a lot of sense. It is commonly known that search engines view material near the top of a page as more credible; this is a great explanation as to why. It is also my understanding that Yahoo spiders only read 1000 lines of code on each web page. This post gives me some new ideas as to how to go about ranking SERPs!

  22. Hi Jeff,

    Thanks. I’m thinking that you might have meant that search engines would view material near the top of a page more relevant for what the page was about rather than as an indication of credibility. Chances are under this segmentation approach that Google would view material at the top of a main content area as more relevant to what a page is about than content in a heading section, or sidebars or a footer though.

    Not sure about Yahoo’s crawling habits, though we know that Microsoft is fulfilling Yahoo’s search database needs these days, so I wouldn’t rely too much on any rumors or information about the crawling habits of Yahoo.

Comments are closed.