Web Blocks Based On Linguistic Features
There have been a number of patent filings and whitepapers from the major search engines over the past 5 or 6 years that describe how they might use Web page blocks or segments to understand things like the main topic or topics on a page, which block might be the most important for a page, what to show on smaller screens for mobile devices, and to apply different weights for links depending upon which block they are located within. This patent looks at linguistic Features to understand Web blocks better.
I’ve written about a number of those in the past in posts such as:
- Breaking Pages Apart: What Automatic Segmentation of Webpages Might Mean to Design and SEO
- How a Search Engine Might Analyze the Linking Structure of a Web Site
- Microsoft Granted Patent on Vision-Based Document Segmentation (VIPS)
- Microsoft Playing with Blocks to Understand How Images Might be Related
- The Importance of Page Layout in SEO
- Smaller Screens Make Smarter Search Engines
- Yahoo Web Page Segmentation: Distinguishing Noise from Information
- Search Engines, Web Page Segmentation, and the Most Important Block
- Google and Document Segmentation Indexing for Local Search
One of the areas that most of the patents and papers that I wrote about didn’t delve into in much detail is how a search engine might understand the functions of blocks that they identify. Is a block primarily navigation, or advertising, or a footer? Does it contain mostly information or noise? Is it decorative in purpose, or primarily a way to interact with the site owner, such as a contact form?
A newly granted patent from Microsoft describes how they might look at linguistic and layout features associated with web page blocks to learn more about what their function is.
There are a few reasons for a search engine to break web pages into blocks.
One is to be able to return better search results.
If a page is about sports fishing, and there’s an advertisement on the same page for diet drinks, a search engine really wouldn’t want to show that page to searchers looking for diet drinks.
And speaking of ads, if the page is again about sports fishing and is showing ads from a search engine, it may want to classify the page as one about sports fishing based upon what’s found in the main content area of the page, and show ads about fishing poles, charter boats, and beach-based vacation spots, and not diet drinks.
Why Classify Blocks On Pages By Linguistic Features
Classifying the functions of different blocks on a page can help with:
- Classifying Pages
- Clustering the pages with other similar pages
- Extracting Topics from those pages
- Breaking Pages apart for display on handheld devices
- Highlighting Blocks that might be of interest to searchers
- Fragment-based caching
- Summarizing content, and
- Ranking pages
The patent is:
Classifying functions of web blocks based on linguistic features
Invented by Wei-Ying Ma, Xiangye Xiao, and Xing Xie
Assigned to Microsoft
US Patent 7,895,148
Granted February 22, 2011
Filed: April 30, 2007
A classification system trains a classifier to classify blocks of the web page into various classifications of the function of the block. The classification system trains a classifier using training web pages.
To train a classifier, the classification system identifies the blocks of the training web pages, generates feature vectors for the blocks that include a linguistic feature, and inputs classification labels for each block. The classification system learns the coefficients of the classifier using any of a variety of machine learning techniques. The classification system can then use the classifier to classify blocks of web pages.
The patent describes a machine-based training process that might look at a number of features related to blocks on web pages.
These can include:
- Linguistic features found within text in a block such as the parts of speech used (verbs, nouns, adjectives, etc.) and capitalization.
- Layout features about a block such as the size of the block, and the position of the block within the web page.
We’re told in the patent that web page blocks with different functions often have different linquistic features.
For example, a block containing navigation will usually have extremely short phrases and no sentences.
The main content, which includes the primary topic of the page, will usually contain complex sentences.
That main content section also often includes named entities, such as specific persons, places and things.
Some types of blocks contain specific terms, such as a footer using the words “copyright,” “privacy,” “rights,” “reserved,” and so on. The terms “sponsored,” “ad” or “advertisement” can help a search engine recognize an advertisement block.
The linguistic features may include parts-of-speech features, named entity features, symbolic features, and capitalization features.
The patent gives us some examples of those.
Parts-of-Speech Features – The text of a block might be submitted to a natural language processor which would tag each word as a specific type of speech, such as nouns, pronouns, verbs, adjectives, adverbs, foreign words, prepositions, conjunctions, and more. It might then count of the occurrence of each part of speech within the text of a block. So, a block might have 10 nouns, 5 verbs, 7 adjectives, and 2 prepositional conjunctions. Those numbers might be considered a linguistic feature of a block to be compared against other blocks.
Named Entity Features – Named entities mentioned in a block may include references to specific persons, places, and things such as “Bill Gates,” “Redmond,” and “Microsoft.”
Symbolic Features – These symbols can be broken down into punctuation and non-punctuation.
Capitalization Features – Which words are capitalized? The first word in a sentence, or every word?
There are also a number of features related to the layout of a web page block that can provide clues to a search engine about the functions behind that block.
The patent provides an example of one classification system where those are categorized as spatial features, presentation features, tag features, and hyperlink features, and some examples of how those might be identified
Spatial Features – involve the size and location of a block within the web page. For example, a copyright block is often at the bottom of a web page.
- X and Y coordinates of the center point of a block/page
- Width and height of a block/page
Presentation Features – involve a look at how content is presented on a page, such as font size, number of images in a block, number of words within a block, and so on.
- Maximum font size of the inner text in a block/page
- Maximum font weight of the inner text in a block/page
- Number of words in the inner text in a block/page
- Number of words in the anchor text in a block/page
- Number of images in a block/page
- Total size of images in pixels in a block/page
- Total size of form fields in pixels in a block/page
Tag Features – indicate the types of HTML elements in the markup language within a block. So, if you see a form element and an input element, that could indicate that the block those appear within is one involving interaction.
- Number of form and input tags in a block/page: <form>, <input>, <option>, <selection>, etc.
- Number of table tags in a block/page: <table>, <tr>, <td>
- Number of paragraph tags in a block/page: <p>
- Number of list tags in a block/page: <li>,<dd>, <dt>
- Number of heading tags in a block/page: <h1>, <h2>, <h1>
Hyperlink Features – could indicate that a block is navigational in nature.
- Total number of hyperlinks in a block/page
- Number of intrasite hyperlinks in a block/page
- Number of inter-site hyperlinks in a block/page
- Number of hyperlinks on anchor text in a block/page
- Number of hyperlinks on images in a block/page
There’s a good possibility that both Google and Bing are looking at blocks or segments on pages to focus upon indexing the most important content on pages and devaluing the weight of content within boilerplate or less important blocks.
There’s also a good chance that they are focusing upon the main content of a page to determine which advertisements to present on pages for sites that use their advertising.
If you were to look carefully at the pages on your site, and break them down into blocks, what might each block be telling the search engines about your pages?
The features listed above are taken from the description of the Microsoft patent. Chances are that Microsoft is looking at other features as well, and may not be looking at some of the features listed above.
Chances are also good that Google has developed a way of understanding the different functions of segments that they see on pages.
Can you think of any potential problems with some of the features that are listed above?
Are there some other features that aren’t listed that might be helpful?
43 thoughts on “How a Search Engine Might Use Linguistic Features of Web Page Blocks to Improve Search Results”
It seems pretty obvious to me that blekko discounts links placed in template parts of a website such as footer and sidebar, while still counting the links to the same target site once placed into the main content area. I would be surprised if Google/Bing weren’t able to do the same (even though it seems that they both still give them some value). Nevertheless, for practical purposes it’s a clear hint were to put efforts when link building. I also have the impression that above the fold links count much more than the typical at the end of the article links. Problems that arise here are very legitimate mentions of sources, as they might resemble too much common article marketing schemes.
You can find out exactly which links blekko is ignoring (and why) by searching for an url and adding /sections:
(no longer active)
I’m sure competing search engines have a similar process.
I didn’t know that layout features are also important for search engine, now i see that my site has to be changed, i knew before that frames are bad and shouldn’t be used but things like size of image etc. surprised me!!
Thank you for bringing up Blekko, and what they’ve been doing. I think your comment may have drawn Blekko’s CTO Greg Lindahl to comment.
It does seem like Google isn’t ignoring or excluding links that aren’t in the main content area of pages, but may not be giving them as much weight.
Chances are that content found in an important block will rank higher than content in less important blocks, and likewise, it’s also possible that images that are in “more important blocks” may rank higher in an image search than images in less important blocks as well.
Really appreciate your stopping by and sharing some information about how Blekko might view sections of pages, and which links that you might include or exclude.
I hadn’t seen the section tag search before, and it is very illuminating. I really appreciate that Blekko is sharing information like that.
I’m going to have to start including more references to Blekko in future posts.
Thank you very much.
I want to caution you before you start making too many changes to your pages.
The Microsoft patent that I’ve written about in this post provides a number of examples and ideas to describe how they might interpret different parts of pages, and how they may interpret the functions of blocks that they find, but we can’t be certain at all if they are using this the way that they describe in the patent.
When I see a patent like this, I try to use it as a springboard for testing things, and for experimenting. I don’t usually take it at face value. At the point that a patent like this is first written, to the time that it may be implemented, there’s a good likelihood that the way its implemented has changed in any number of ways. And, once something like this is put into place, and actually used, there’s also a good chance that it’s been tested and changed as well.
The layout of a page can have a profound impact upon the indexing of a page, and the way that it may rank in search results Some of the things discussed in the patent seem to make a lot of sense, such as the benefits of a search engine being able to identify which sections of pages are the most important, and can be viewed to see what the main topic of that page might be.
Here we’re given an insight into how a search engine might come up with rules to decide which part of a page might be that most important section, and how it might use a machine learning approach so that the process can be automated, and those rules can be adapted as the search engine comes across new pages. I wouldn’t call the “features” listed above regarding layout and linguistics to be guidelines to be followed, but rather things to be observed and tested.
Interesting implications for those using DoFollow blog comments for backlinking as well. If the search engine knows each part of the page, it can surely tell the difference between the main content and the comments and discount the weight of those links significantly. I’ll be very interested to see how this effects SERPs moving forward!
I didn’t know the blocks and their content was to integral in how my site placed on Google. Is this proven or still speculation at this point?
So content is King for both SEO and Advertisements? Nice.
Interesting question about dofollow blog commenting…it’s actually one that has been on my mind for a while now. The thing about blog commenting as a form of SEO is that lengthy comments with quality user-generated-content would seemingly differentiate themselves in value from seemingly low-quality site-wide links and the like. I think that comment blocs would/can actually be considered content blocs as opposed to simply link blocs.
As you mentioned, time will tell. Personally, I think that BC will continue to be a valuable SEO and networking technique that can never be overlooked. Additionally, much like link directories, they are, for the most part, moderated and if the comment is not spam AND unique, they will hold their value.
For Chinese search engines, linguistic features of web pages can be very powerful in identifying the functions of blocks.
Years ago I did an experiment with my friends, which used the frequency of comma and period to determine which block is the main content area of a page, and the accuracy rate was higher than 90%.
“a block containing navigation will usually have extremely short phrases and no sentences”
Besides what you have mentioned, I GUESS the density of hyperlinks can also be a characteristic that differs the navigation area from the main content block.
Good post – this kind of contextual segmentation makes a lot of sense. The thing is, a lot of this analysis will be complemented when/if HTML5 and more semantic page mark-up becomes more widely used. It will take a lot of the leg-work out of working out which sections do what.
For instance, a link in the will logically carry less weight than a link in an .
Interesting to see how this all develops.
Any kind of mark-up will be exploited by bad people. You can be sure that search engines won’t take it at face value all of the time.
Thanks Greg for your link,I find it very helpful. I also agree that classifying the functions on different blocks is really worth doing.
To be honest I thought this was something Google would be able to do already. Blocks are just as relevant as H1s, bolds, or italics in my opinion.
There are a number of implications for blog comments in a system like this.
One of them is that a search engine could segment blog comments and cluster them together. If the same comment was left on many blogs, regardless of the actual content of the original blog post, that comment and any links associated with it might be ignored completely, regardless of who the author was, or even where it links to.
It’s possible that links from most comments would carry less weight, if any at all.
A search engine might assign different weights from different authors based upon the quality and quantity of their content, especially if those comments could be associated with some type of trusted identification system, such as an Open ID initiative or Open social type approach to digitally signing comments.
A block that contains advertising may be treated as not relevant to the content of a page when it comes to indexing that page regardless of the quality of content within a block containing advertising.
Of the many reasons to comment on blogs that exist, I think the best of them may be to build relationships between the blogger and the commenter, and to add to a discussion in a way that it may draw the attention of other people who may read the blog post and the comments.
Relying upon them as a linkbuilding method may not be the strongest of reasons, especially considering that a search engine might not give them much weight as other approaches. For instance, I’d often rather spend the time creating a good quality blog post in response to a post that I see, and attracting links to that rather than a comment in response.
Thanks for sharing your experience. The Microsoft patent does include a section on how they might use punctuation as a signal for identifying the function of a block on a page, but they didn’t include an indication of how effective that approach might be. Interesting that you saw agreater than 90% accuracy with that approach alone.
A main content area consisting mostly of hyperlinks would seem to indicate that a section might be navigational instead of main content, though I’ve sometimes seen exceptions to that. For instance, I’ve written a couple of posts about Google’s patents, were most of those posts were links to the patents themselves. I guess those could be distinquished from navigation since the links mostly led to pages on another site (in that case, the US Patent Office).
Looks like you tried to use some HTML in your comment, and it was stripped out of the comment.
One of the great things about HTML is that it presents a lot of flexibility in the way it allows you to present content on a page. One of the biggest problems with HTML is that it presents a lot of flexibility in the way it allows you to present content on a page.
HTML5 may make it more likely that people will use elements that better identify the function of a section of code, such as the article element or the footer element, but we can’t tell for certain how much easier that will make it for search engines to perform this kind of segmentation.
People have also tried to “trick” search engines in the past, believing that the proximity of some words to the top of an HTML page made the search engine assign more weight to that content. From table tricks to CSS tricks, people have been trying to get an advantage in the way that their pages have ranked by doing odd things to their code. It’s actually easier to create content that is relevant to a topic than allocate your time and attention to try to create that illusion.
HTML5 may help, by making it clearer to a designer where they need to focus upon when adding content to a page.
Unfortunately, it’s true that people often seek to take advantage of how they think search engines might work.
Personally, I look at my understanding of search and search engines as similar to learning about grammar and language and the rules of writing and communication. The better you understand the framework of the medium in which you create pages and content, the more effective you message becomes in terms of reaching the audience it is written for.
One of my links above is to a post, Google and Document Segmentation Indexing for Local Search which I wrote about a Google patent filed in 2004, Document segmentation based on visual gaps.
While it primarily focuses upon being able to segment different reviews for different restaurants that appear on the same web page, it goes beyond that. A couple of statements from the patent filing:
Chances are good that this is something that Google is capable of doing, and is doing at this point. But it’s not something that is discussed very often when it comes to how the search engine indexes the content it finds on pages.
“Thanks for sharing your experience. The Microsoft patent does include a section on how they might use punctuation as a signal for identifying the function of a block on a page, but they didnâ€™t include an indication of how effective that approach might be. Interesting that you saw agreater than 90% accuracy with that approach alone.”
One point to clarify – the experiment we did was for Chinese language and Chinese punctuation. And between comma and period, accuracy based on period is a little higher than comma. Things may be a little different for the English language, but I guess not much, 🙂
“A main content area consisting mostly of hyperlinks would seem to indicate that a section might be navigational instead of main content, though Iâ€™ve sometimes seen exceptions to that. For instance, Iâ€™ve written a couple of posts about Googleâ€™s patents, were most of those posts were links to the patents themselves. I guess those could be distinquished from navigation since the links mostly led to pages on another site (in that case, the US Patent Office).”
Definitely – to maximize the accuracy, we all believe there are always as many factors to be taken into consideration as possible. Length of phrases can be a sign, frequency of hyperlinks can be a sign, maybe position can be a sign too, the id of the div and use of “” and “” can be a sign too – but a good algorithm should take them all into consideration rather than 1 or 2, 🙂
Good stuff Bill. It certainly makes makes you look carefully at page structure and where you put those all-important internal links.
Thanks. Punctuation seems like a useful signal regardless of what language it might appear within.
I agree completely about the need for an algorithm to look at a range of signals rather than just one or two.
Definitely agree that content blocks are the key to understanding the true meaning of a page, and have a recent example that I hope you enjoy.
A month ago, we launched a WordPress theme designed for travel bloggers. While the example content was all placeholder information about wonderful places to travel to in order to demonstrate the capability of the site, it was all indexable. Would Google rank the site for travel information, even though the true purpose was to show theme users placeholder information? I wondered; and if you are too, the results are in.
Organic search sent 500+ visits, and all of the top referring keywords contained either “wordpress” or “theme” – or both. This leads me to believe the the title tag, above all, is the most important feature of conveying the meaning of a page – is this perhaps a ‘content block’ too? Can the title tag be the dictator of how the content blocks on the page are relevant to a query, in this case the “free wordpress theme” title tag notifying Google that all the content blocks on the page are for demonstration purposes? I believe, yes!
All open questions. No need to answer, just food for thought.
“Punctuation seems like a useful signal regardless of what language it might appear within.”
Yeah, and you may not know, actually in Chinese, period is a little special – it is a small “o”, rather than a dot, haha. That is why it is more powerful than comma in my experiment,:)
Excellent post Bill, Bravo.
I think we (web developers and seo experts) need to pay more attention how we design the page layout and the elements in the markup language to describe the each block.
Ex.: it’s sound better for search engine than .
Thanks for the example.
It sounds like Google may have recognized what was going on with the page, and which block appeared to be the most important one, which seems to have been the title block.
I’m also guessing that stuff like links to the page probably focused more upon it being a page where people could get a wordpress theme rather than a page about travel.
Could Google have ranked this page well for wordpress themes without using a segmentation process? It’s possible, given the importance that search engines tend to give titles and hyperlink text. But, it’s not a bad example of how blocks might play a role in rankings as well.
Didn’t know that about the Chinese period. I guess if one wants to practice international SEO, they either need to learn a lot about languages, or turn to someone who does. I’m guessing that the second option is probably the wiser one.
I’m hearing more questions about layout from people asking questions about SEO, and I’m encouraged by that. I think the importance is starting to set in.
Love the post, but also the Blekko slashtag, this is great to see.
Thanks. I was really happy to see Greg Lindahl get involved in the comments to this post, and share some information about how Blekko works, and what links they might decide to not consider in their rankings.
Comments are closed.