Classifying Web Blocks
In earlier days of SEO, many search engine optimization consultants stressed placing important and valuable content towards tops of HTML code on pages, based upon the idea that search engines would weigh prominent content more heavily if it appeared early on in documents. There are still very well known SEO consultants who include information about a “table trick” on their sites describing how to move the main body content for a page above sidebar navigation within the HTML for a page using tables. I’ve also seen a similar trick used with CSS absolute placement in HTML, where less important content appears higher on the HTML page that visitors actually see, but lower in HTML code for a page.
A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure.
Microsoft was granted a patent on the VIPS approach, and I was leaning heavily towards including that patent as one of the ten most important SEO patents, but there was something missing from it. While it described how pages might be segmented and different parts isolated from one another, it really didn’t describe how they might be differentiated from each other, or much about why the search engine would go through with a process like this.
In these days of headless browsers, search engine crawlers not only identify where content appears within the HTML of a page, but also have the ability to get an idea of where that content actually appears on a page by simulating how a browser might display that content. We’ve heard from Yahoo how they work to segment the content of web pages and interpret the layout of pages. See also my post on Breaking Pages Apart: What Automatic Segmentation of Web pages Might Mean to Design and SEO.
Google was also granted a patent on a page segmentation process earlier this year, even though the patent was originally filed way back in 2004. Page segmentation is something the search engines have been thinking about for a long time.
For example, you may have heard that links from some sections of a web page may carry different amounts of weight that links from other sections of a web page. Here’s a video of Matt Cutts telling us that links from footers on pages might not weigh as much as links from a paragraph of text in the middle of a page:
To carry that a step further, it’s possible that a search engine might break a page down into segments or blocks, like those described in the VIPS paper, and calculate independent PageRanks for each of those blocks, as described in the white paper, Block-level Link Analysis.
It was also tempting to point at the Google patent on page segmentation with this post because it provided some ideas on how it might use segmentation in indexing and ranking pages.
But I really liked the way the Microsoft patent, Classifying functions of web blocks based on linguistic features described how it might classify blocks that it found on web pages based upon features associated with those blocks, and I’m calling it one of the 10 most important SEO patents worth reading to help you understand search engine optimization better.
The patent does provide us with an idea of how a search engine might understand the different blocks that it finds on a page, and use those when it indexes, analyzes and classifies content on that page. For example, a section of a page that contains every short phrases, with each word capitalized, and each phrase a link to another page on a site, that appears near the top of the page or in the sidebar to the left of the page might be the main navigation for that page.
A section that contains full sentences, with punctuation, with capitalized first letters for each sentence, that appears in the center of a page might be a main content area of the page, and that content should be weighed more heavily when indexing the page, with no table or CSS tricks required.
I wrote a fairly detailed post about the patent at How a Search Engine Might Identify the Functions of Blocks in Web Pages to Improve Search Results, and the post includes links to a number of other posts I’ve written about segmentation as well.
Another thing that a search engine might try to avoid is having a page rank well for a specific multiple word query when that page covers multiple topics in different main content sections, like a news page, or a blog homepage that shows multiple posts about different topics, and the words appear in different blocks or segments. This is another thing that a page segmentation approach can help to address.
Having an idea of how search engines may work to understand the content they find on your pages, and the different sections on those pages are essential to the practice of SEO.
All parts of the 10 Most Important SEO Patents series:
Part 1 – The Original PageRank Patent Application
Part 2 – The Original Historical Data Patent Filing and its Children
Part 3 – Classifying Web Blocks with Linguistic Features
Part 4 – PageRank Meets the Reasonable Surfer
Part 5 – Phrase Based Indexing
Part 6 – Named Entity Detection in Queries
Part 7 – Sets, Semantic Closeness, Segmentation, and Webtables
Part 8 – Assigning Geographic Relevance to Web Pages
Part 9 – From Ten Blue Links to Blended and Universal Search
Part 10 – Just the Beginning