Classifying Web Blocks
In earlier days of SEO, many search engine optimization consultants stressed placing important and valuable content towards tops of HTML code on pages, based upon the idea that search engines would weigh prominent content more heavily if it appeared early on in documents. There are still very well known SEO consultants who include information about a “table trick” on their sites describing how to move the main body content for a page above sidebar navigation within the HTML for a page using tables. I’ve also seen a similar trick used with CSS absolute placement in HTML, where less important content appears higher on the HTML page that visitors see, but lower in HTML code for a page.
Back in 2003, the folks at Microsoft Research Asia published a paper titled VIPS: a Vision-based Page Segmentation Algorithm. The abstract for the paper describes the approach, telling us that:
A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction, and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure.
Microsoft was granted a patent on the VIPS approach, and I was leaning heavily towards including that patent as one of the ten most important SEO patents, but something was missing from it. While it described how pages might be segmented and different parts isolated from one another, it didn’t describe how they might be differentiated from each other, or much about why the search engine would go through with a process like this.
In these days of headless browsers, search engine crawlers not only identify where the content appears within the HTML of a page but also have the ability to get an idea of where that content appears on a page by simulating how a browser might display that content. We’ve heard from Yahoo how they work to segment the content of web pages and interpret the layout of pages. See also my post on Breaking Pages Apart: What Automatic Segmentation of Web pages Might Mean to Design and SEO.
Google was also granted a patent on a page segmentation process earlier this year, even though the patent was originally filed way back in 2004. Page segmentation is something the search engines have been thinking about for a long time.
For example, you may have heard that links from some sections of a web page may carry different amounts of weight that links from other sections of a web page. Here’s a video of Matt Cutts telling us that links from footers on pages might not weigh as much as links from a paragraph of text in the middle of a page:
To carry that a step further, a search engine might break a page down into segments or blocks, like those described in the VIPS paper, and calculate independent PageRanks for each of those blocks, as described in the white paper, Block-level Link Analysis.
It was also tempting to point at the Google patent on page segmentation with this post because it provided some ideas on how it might use segmentation in indexing and ranking pages.
But I liked the way the Microsoft patent, Classifying functions of web blocks based on linguistic features described how it might classify blocks that it found on web pages based upon features associated with those blocks, and I’m calling it one of the 10 most important SEO patents worth reading to help you understand search engine optimization better.
The patent does provide us with an idea of how a search engine might understand the different blocks that it finds on a page, and use those when it indexes, analyzes and classifies content on that page. For example, a section of a page that contains every short phrase, with each word capitalized, and each phrase a link to another page on a site, that appears near the top of the page or in the sidebar to the left of the page might be the main navigation for that page.
A section that contains full sentences, with punctuation, with capitalized first letters for each sentence, that appears in the center of a page might be a main content area of the page, and that content should be weighed more heavily when indexing the page, with no table or CSS tricks required.
I wrote a fairly detailed post about the patent at How a Search Engine Might Identify the Functions of Blocks in Web Pages to Improve Search Results, and the post includes links to several other posts I’ve written about segmentation as well.
Another thing that a search engine might try to avoid is having a page rank well for a specific multiple word query when that page covers multiple topics in different main content sections, like a news page, or a blog homepage that shows multiple posts about different topics, and the words appear in different blocks or segments. This is another thing that a page segmentation approach can help to address.
Having an idea of how search engines may work to understand the content they find on your pages, and the different sections on those pages are essential to the practice of SEO. This patent on classifying web blocks helps explain how search engines may view a different part of web pages.
All parts of the 10 Most Important SEO Patents series:
Part 1 – The Original PageRank Patent Application
Part 2 – The Original Historical Data Patent Filing and its Children
Part 3 – Classifying Web Blocks with Linguistic Features
Part 4 – PageRank Meets the Reasonable Surfer
Part 5 – Phrase Based Indexing
Part 6 – Named Entity Detection in Queries
Part 7 – Sets, Semantic Closeness, Segmentation, and Webtables
Part 8 – Assigning Geographic Relevance to Web Pages
Part 9 – From Ten Blue Links to Blended and Universal Search
Part 10 – Just the Beginning
I have to day after reading a few of your articles that this is one of the most insightful SEO blogs I have read. Understanding patents can help everyone involved in internet marketing appreciate what it takes to get the most value out of their content and efforts. As a small business owner I am pretty much everything when it comes to my business and that means I wear the SEO hat too.
Understanding how search engines might look at a page will help me write better content from the standpoint of getting better search exposure but it also will help me provide a bit more value to clients.
The first paragraph reminds me the first StomperNet course I bought several years ago. They tauqht doing exactly that and I think the tactic worked well back then.
I would be curious to know what you think about this post: http://www.seomoz.org/blog/just-how-smart-are-search-robots
Bill, you’ve wowed me once again. I’m so excited to read this and feel it’s about time. I often wondered why Google was not doing this as it seemed pretty straightforward to me. Glad I trusted my gut and stayed away from page segmentation tactics.
Now if Google can apply some of the same/similar principles to the whole website and filter out those local business websites trying to optimize for every city within 300 miles, I’ll be even happier. Or maybe they already are. Do you know? If they aren’t, Google if you’re listening,I can point the way to patterns that should help.
Is it important that the domain is paid for over 3 years? I have read that if Google detects that the domain for several years is understood to be of higher quality, that’s true?
I read many articles related to SEO in many websites, but the information that you provide are unique and effective. I read most of your articles and I think you are going really well. Keep up the good work.
Hi Sonia,
Thank you for your kind words. Search related patents don’t hold all of the answers. A search engine might decide it’s going to change what the patent describes, or the search engine might decide to use a different approach, or the search engine could have followed what was in the patent and then came up with another approach.
But reading the patents can give you ideas of what search engineers are thinking about, and many of the assumptions that they make about search and the Web. I like to read them to try to get the perspective of people working for the search engines. Sometimes they describe something that you can try to do, but don’t necessarily bet your business on them.
I know that SEO can be a challenge to small business owners, and that it’s one hat amongst many that small business owners wear. Definitely be cautiously skeptical when you read about SEO tactics, even if the source is something you might have learned directly from the search engines like patents are.
I’d definitely recommend that if you’re doing your own SEO, start a hobby website that you can experiment on, so that if something goes wrong, it won’t negatively impact the success of your business.
Hi Kathy,
Thank you. I’ve pretty much stayed away from those table tricks and CSS tricks that focusing upon putting important content as close to the top of HTML code as possible.
It’s the segmentation approach that actually makes those tricks worthless. If you rely upon tricks like that, and they work, it’s possible that Google might stop using them and your rankings could tank overnight.
I don’t think that page segmentation is really helpful in businesses that try to rank well in local search everywhere, through Google Local. But there are other approaches that Google might take to try to find those businesses. I’m going to have at least one “local search” patent in this series of posts on patents, which Google is quite possibly using to try to understand the locations of businesses.
Hi Charles,
I saw that advice on table tricks upon the website of a very well known SEO company this morning. It’s possible that the approach may have made a difference back then, but there are a lot of benefits to a search engine using a page segmentation approach, and it’s worth creating pages with the understanding that a search engine may be using that approach, and still getting it to rank well without trying to use a trick like that.
Funny that you mention the SEOmoz post on search robots. I don’t know if you noticed, but I linked to that one in the post. The segmentation process that is described in the patents rely upon an understanding of how a browser might show a page, rather than just reading the HTML of a page. The Google patent refers to a simulated or pseudo browser as a crawler, or something very much like what is described in the SEOmoz post as a headless browser.
Hi javijredondo,
I don’t beleive that it makes any difference at all as to how long a domain is registered.
The historical data patent that I wrote about in the second post in this series tells us that spammers usually register domain names for a single year, while non-spammers usually register domain names for a number of years. But, chances are that most domains are only registered for a year. I tried to register a domain name for longer than a year for one site, and my host treated me like I was crazy. They were responsible for a lot of sites (which typically were auto-renewed when they expired), but acted like that was the first time anyone had asked them to register a domain for a longer period of time than a year.
Go Daddy must have caught wind of the statement in the patent, and started telling people a few years ago on the Go Daddy site that they needed to register their domains for longer than a year or their rankings would drop in Google. Matt Cutts came out, and debunked the idea that you need to buy a domain for more than a year or Google would think you were a spammer.
Matt specifically debunks this, and addresses the patent as well, in this video:
Is the time left before your domain registration expires an SEO factor?
Great post.
It’s been just few days I started to come your website, to get more relevant information about search engine techniques & their way of performing search results.
And I am glad to inform that your post are rocking, unique & more valuable then that of any others.
Thanks A Lot, Keep it up!
Bill,
Great post and thanks for taken the time to break down what these patents mean for us linkbuilders. I think your right, having a good understand as to how search engines work is key.
I also think its important that we know where search engines are going to be in a few years time so we can give their customers what they want in terms of quality content and helping them find it.
Great post. I’m trying to read every single one of you. Their are all so informative. Keep going.
Mel
Hi Colm,
We can’t account for every change that might happen at the search engines, but updates like Panda and the recent Freshness updates shouldn’t really be able to come along and surprise us, like they had many. Same with Google’s increased look at social signals.
Hi Bill,
Personally, I think that segmenting pages into blocs is a great idea for search engines. I have a few posts on my blog where the comments have really built up over the years and I am pretty sure that it would be diluting the on-page SEO in terms of KW density and the like IF this bloc analysis technology was not in place.
It seems that this type of thing would buffer against search results where the web pages served were warped in this way perhaps from UGC.
Mark π
Hi Mark,
Interesting thoughts. There definitely is some value for blogs and ecommerce pages to allow comments and let people leave reviews on product pages. For those ecommerce pages, especially ones that rely upon using descriptions from publishers or manufacturers or distributors, unique user generated content (UGC) can make them stand out from other pages that are also licensed to use those same descriptions. And comments that are relevant and on topic can add to the value of a page by introducing related terms, broadening discusssion or helping to focus it more narrowly as well.
But yes, that additional user generated content also has the potential to dilute and diffuse the original message that might be presented on a blog post or on an ecommerce page.
So there’s both a blessing and a curse when it comes to user comments and reviews. It’s possible that search engines following a segmentation process wouldn’t ignore comments or reviews completely, but might instead give words found within them less weight than content found in a main content area of a page.
To use your comment specifically as an example now, it’s more likely that this page may rank more highly for “bloc analysis” or “bloc level analysis” than it would before you left your comment. Your comment might not be given as much weight as the original post, but you’ve broadened the topic linguistically in a way that the original post didn’t before.