How Page Segmentation May Be Understood by Google
The HTML5 standard being developed for the Web introduces a new set of HTML elements. These include elements such as section, header, footer, article, aside, header, and nav. Mark Pilgrim’s online book, Dive into HTML5 gives us a look at the newest version of HTML5, and shows us how these New Semantic Elements might change around the way that we create web pages in ways that might make it easier for automatic programs like search engines to understand what different sections of pages might mean.
Interestingly, the search engines have been working hard trying to do this themselves for many years, and a new patent granted to Google today describes how it might work to understand different parts of web pages, and use that understanding to help it rank web pages in search results, caption images, construct snippets for search result pages, and weigh links differently when they appear in different semantic parts of Web pages.
Microsoft has published several patents and whitepapers on how they might perform some of the activities while breaking pages down into smaller blocks. My most recent blog post describing the Microsoft efforts was about a patent filing that gave us some insights into how they determined the Functions of different Blocks in web pages. That post has some links in it to other posts here involving papers and patent filings from Yahoo, Microsoft, and Google.
But we’ve never seen a full-blown description from Google before on how they might segment pages into different parts with different purposes.
The Google patent granted today does give us much more information on how the search engine interprets parts of pages and uses that information in multiple ways:
Determining semantically distinct regions of a document
Invented by Yonatan Zunger
Assigned to Google Inc.
US Patent 7,913,163
Granted March 22, 2011
Filed September 22, 2004
A structured document is translated into an initial hierarchical data structure in accordance with syntactic elements defined in the structured document. The initial hierarchical data structure includes a plurality of nodes, and each node corresponds to one of the syntactic elements. The method then annotates a node with a set of attributes including geometric parameters of semantic elements in the structured document that are associated with the node in accordance with a pseudo-rendering of the structured document.
Finally, the method merges the nodes in the initial hierarchical data structure into a tree of merged nodes in accordance with their respective attributes and a set of predefined rules such that each merged node is associated with a semantically distinct region of the pseudo-rendered document.
The predefined rules include rules for merging nodes associated with semantic elements that have nearby positions and/or compatible attributes in the pseudo-rendered document.
Page Segmentation in an Earlier Google Patent
Back in 2006, a Google patent application was published which described how Google might take a page filled with reviews of restaurants, and break the page apart so that it could associate each review with the restaurant that it reviewed. Near the bottom of that patent was inserted a paragraph telling us that Google could use that segmentation process for much more than reviews:
 Although the segmentation process described with reference to FIGS. 4-7 was described as segmenting a document based on geographic signals that correspond to business listings, the general hierarchical segmentation technique could more generally be applied to any type of signal in a document.
For example, instead of using geographic signals that correspond to business listings, images in a document may be used (image signals). The segmentation process may then be applied to help determine what text is relevant to what image.
Alternatively, the segmentation process described with reference to acts 403 and 404 may be performed on a document without partitioning the document based on a signal. The identified hierarchical segments may then be used to guide classifiers that identify portions of documents which are more or less relevant to the document (e.g., navigational boilerplate is usually less relevant than the central content of a page).
Page Segmentation In the Reasonable Surfer Patent
Google’s Reasonable Surfer patent described how Google might pass along different amounts of PageRank to different links based on several features associated with those links. Some of those features involved the location of those links on the pages where they were found. But that patent, like the review segmentation patent, really didn’t go into much detail on how Google might break a page into different parts.
This new patent does.
It looks at the HTML of a page and also looks at how the page might be displayed in a simulated browser to understand the different parts of a page, what their purpose might be, and where they are located on a page.
Examples of some of the page segmentation processes driven by understanding these different parts of a page might include:
Link analysis – links found in “different semantically distinct regions may be assigned different weights.” So, a link to another page that’s found in the footer of a page might be given less weight (or PageRank) than a link found in a more important section of the page.
Text analysis Text found in some parts of a page might be given more weight than in others. So, a page that has a certain keyword phrase in its footer might not rank as highly for a query matching that term than if the keyword phrase was in a more important segment of the page, like the main content area. One page where a query term appears in an important segment of a page might also rank higher than another page where the query term appears in a much less important part of the page.
Image captioning Text found near an image is usually more relevant to the image than text further away from it. This segmentation process can help identify what text is nearer an image and could be used to help caption the image and help it rank in image search.
Snippet construction When a search engine returns a page in search results for a particular query, it generates a snippet of text to describe what it has found on that page. Sometimes when the query terms appear in the meta description for a page, the search engine shows the meta description. But a search engine will also use text it finds on the page itself, and it might create a snippet based upon text in a section that’s most relevant for the query term.
Page Segmentation Conclusion
The patent provides a considerable amount of detail on how it might interpret different parts of pages. Chances are that the new HTML5 will make it even easier for a search engine like Google to do this in the future.
We’ve had a considerable amount of detail from both Microsoft and Yahoo on how they might break pages down into parts, but not very much on page segmentation from Google. Now we do.
Last Updated June 9, 2019
46 thoughts on “Googles Page Segmentation Patent Granted”
As ever full of insight Bill.
Google trying to stay ahead of the curve with the eventual release of HTML5 will go a long way to making their indexing and segmentation easier however the uptake of HTML5 I am suspecting will be fairly slow. I am still auditing sites to this very day that still frustratingly use table based design.
I am suspecting will be fairly slow. I am still auditing sites to this very day that still frustratingly use table based design.
Very interesting this patent, with SEO and Google being the biggest I can see SEO needing to evolve to keep up, I wonder if some of the bigger sites will end up moving to HTML5 just because of the SEO benefits?
There are many wonderful reasons for a site to go through an upgrade to a new design, but I’m not sure that doing it just to make it a little easier for search engines to understand the semantic structure of a site a little better is one of the top ones that I would point to. The Google patent on segmentation was originally filed in 2004, and many of their examples of how they might interpret the structures of pages include table-based formats.
I do think that if a page is designed so that a search engine can understand what content on a page is the main content, and that main content is optimized well for keywords that both match the objective of the site owner and the intentions of the visitors to the site, then there is a benefit. We don’t necessarily need the site to be redesigned to HTML5 for that to happen, but it does look like HTML5 can make it easier to do that.
If a site moves to HTML5, they should do so because it will help make their pages load and render faster, make them easier to manage and maintain and update, give the site some beneficial capabilities that it didn’t possess before, and so on. SEO concerns should be part of that analysis, but I wouldn’t recommend updating a site to HTML5 solely for SEO purposes. There are many other potential benefits as well.
Thanks. The patent was filed early enough that I don’t think HTML5 was even conceived of back then. Newer versions of XHTML were being worked on at the time, and were a more likely future back them.
The beauty of HTML is that it is so flexible that you can do a lot of the same things different ways, and the problem with HTML is that you can do a lot of the same things different ways. There may always be people who design websites with older versions of HTML, and continue to use table-based layouts. That’s one of the challenges that search engines have – index content and try to find the best and most helpful information regardless of whether the code behind a site is using the latest technology or something somewhat older.
I think understanding something like the segmentation process that a Google or Microsoft might use is helpful in learning rules about grammar and spelling and writing when you want to become an author. Having a knowledge of those things gives you more tools and ideas to work with. You don’t need to use HTML5 to make it easier for a search engine to understand what part of your pages are the main content areas, but knowing that if you can make it easier for a search engine to pick that section out and focus on that content can make it easier for you to show up in search results for things that you’ve intended to.
I agree – the adoption of HTML5 will likely be slow, and I anticipate that a lot of sites won’t make the change over anytime soon. Maybe once there are some really compelling reasons to do so will bring us to a point where people starting making the change. There are things proposed in HTML5 that are pretty interesting, like the ability to embed videos much easier. Guess we wait and see.
It makes a lot of sense for Google to do this as HTML5 won’t be pervasive until a very long time, and could possibly be gamed anyway. I think Google has probably been putting less emphasis on links and text in less relevant sections (e.g. the footer / sidebar / nav) for quite a while now and it’s now a case of refining that process and making it more reliable, semantic, and precise.
Really interesting stuff, once again.
I think this makes pure sense. Google are trying to rid their index of the poor sites and return the best possible pages for given terms. It is far too easy to game Google currently and with html5 and prominence being the key it will hopefully rid the engine of the crap we all have to put up with. OK the algo’s will always be gamed but this will help as it will be far more on the quality and not just the link building.
Html5 is an easy upgrade for anyone with a small knowledge of web design. In fact I think it’s an easier language as it is far more logical.
Great article again. Time for me to do some testing with html5
Thank you Bill, as Dean said, insightful as ever.
I am inclined to agree with Twosteps here in his or her assessment that Google are likely to have been applying link weights based on structural, semantic location for some time, from an SEO standpoint.
Perhaps I’ve been reading too much of Aaron Wall’s seobook.com/blog (I do love that blog) because one of the first questions I find myself asking is:
How is Google going to use HTML 5 semantics to better scrape external sites to increasingly siphon traffic to their web properties, leading to a higher likelihood of Adsense revenue?
Great stuff (again!) Bill. One question would be might this supercede any benefits that SEOs have seen with rearranging content via CSS to put the body content higher up the document. If they are able to better classify the body content as an item, especially with HTML5, it would matter less where in the document it appeared?
I totally agree that the focus should be on the quality of the page content and the use of prominence within the semantically written copy.
I am looking forward to test link positions and write up a case study on it. It will be a very interesting test.
I’ll make more comments here when I have completed the test.
I agree with you that Google has likely been using a system like this to put less emphasis in less relevant sections for at least a few years now. Not only for the four things that I mentioned in the post – links, relevance for query terms, captioning images, and deciding upon snippets, but probably for other things as well. For example, deciding what advertising to show on pages that carry adsense.
Perhaps I’m not quite as cynical as Aaron when it comes to Google’s motivations, but Google has a good number of economists on its staff these days, and I’m sure that they recognize that if they behave too selfishly when it comes to sending searchers to their own sites, that the eventual outcome is for people to either use another search engine or to start ignoring Google services and sites more frequently when those appear in search results.
While I started this post out with some thoughts about HTML5, Google doesn’t need HTML5 to do this kind of segmentation. It might help Google some when they do segment pages, but I think the real benefit of the semantic elements of HTML5 is to site designers, so that they can make it more likely that a search engine like Google is doing a better job of segmentation.
The nice thing about HTML5 is that it gives us a few more tools that we can use to build better pages, and some of those make it easier for programs like search engine crawlers and indexers to understand what our sites are about.
But, I do want to stress that irrelevant content is irrelevant content regardless of whether it’s displayed on web pages using HTML5 or HTML 4. 🙂
So, if someone makes the effort to change over to HTML5 thinking that it will help them rank better at Google, they should also be working on improving the rlelvance, quality and usability of their pages as well.
People have been trying to play games with where their content is located in the code of their pages and on their pages as rendered by search engines for years. Before the CSS tricks using absolute positioning, people were using a similar table trick. At one point search engines were looking to see what words appeared closest to the top of a page to determine how important those words or phrases might be. It’s better to look at words that appear in a main content area of a page, and those CSS and table tricks really don’t help with that.
The affects on SEO should just be like any other change that we’ve experienced over the years. Adjusting is a part of the SEO world, so I don’t think this should be a big deal overall for SEO. Staying up to date is a main concern with our Chicago SEO Company. Be doing this we can strategically make adjustments according to the changes and stay on top of our client’s sites.
Here’s a quote from JohnMu @ Google,
“In general, our crawlers are used to not being able to parse all HTML markup – be it from broken HTML, embedded XML content or from the new HTML5 tags. Our general strategy is to wait to see how content is marked up on the web in practice and to adapt to that. If we find that more and more content uses HTML5 markup, that this markup can give us additional information, and that it doesn’t cause problems if webmasters incorrectly use it (which is always a problem in the beginning), then over time we’ll attempt to work that into our algorithms. With that in mind, I definitely wouldn’t want to stand in the way of your implementing parts of your site with HTML5, but I also wouldn’t expect to see special treatment of your content due to the HTML5 markup at the moment. HTML5 is still very much a work in progress, so it’s great to see bleeding-edge sites making use of the new possibilities :)”
I’m glad they are looking at segmentation. The more we can verify to Google that we have an absolute priority in our layout, the easier it would be for them to identify what we think deserves to rank the best it possibly can within our own sites. This reminds me of Garrett French’s book on linkable assets. I think any SEO’er who isn’t adapting this mindset is going to be left by the roadside.
Interesting tweaking going on with these patents. I would love to know how long it takes for these changes to actually filter down though.
What would be the consequences for small website businesses? Anything we need to plan for soon?
That’s what I love about Google and the search engines. How they evolve and help the Internet become a better place. Relevancy and click-through rate increase is what should be on every webmaster or blogger’s priority list.
I think this segmentation upgrade/update/game plan will make it harder for seo spammers to rank their sites, but at the same time, make listings and results more relevant.
Google and Amazon should work together as both are customer/visitor obsessed 🙂
Maybe linguistics studies are not so bad after all. We do not need to know the language to start parsing it, once we begin to understand the underlying rules, so I am glad to see that semantic studies are still involved in the evolution of search. However, I wonder if search engines may be missing an element that could be important. If we begin to map out elements in the header, footer,and sidebars, we may see something of note. A user may well pay attention to the content in these areas, as it may pertain to something they are seeking, or the content may affirm an idea that they have about the site, which cause them to read on to the content, which may be below the fold. For example, a large header blocks off available space before the fold, but you do see the title and maybe an excerpt of the content. If the sidebar is geared to advertising first, the user may possibly begin to form a perception of the site (depending on the advertising of course). If the sidebar begins with a bit of content explaining the site, or author bio, the user obtains a different opinion. Maybe more trusted. What if the sidebar is dedicated to UGC type items. With such an emphasis an users creating authority for a site, and with this sidebar (or maybe this is in the footer) content changing, wouldn’t a semantic analysis reveal this to be important to the understanding of the site? Basically, search engines are on the right track, but they may have farther to go in replicating user understanding of what is of value.
This is interesting. It’s natural for them to protect their online integrity as they continuously get flocked with spam content. It’s good to know they’re holding their guard.
I think it is natural but of course interesting that it happens. I agree with you, Bill, that it is important to understand that crap content is crap content whatever format you read it in.
To be honest bill i was of the impression that google looked into everything when it scrolled a website and valued it according to relevance of that site but i sit corrected. Thanks again. Will this mean people will need to look further into how they design websites in the future and be a little clever with it.
Good points. SEO is constantly changing and evolving, and I’ve suspected Google of using this type of segmentation of pages for at least 5 years. The patent is a nice validation of that assumption. Even if Google hadn’t been, the kinds of things that you might do on pages in case they were tend to result in higher quality pages anyway.
Your experimentation sounds interesting. There are so many different factors involved in the ways that Google ranks pages and values links that some experiments can be pretty difficult to perform – but that doesn’t necessarily mean that they aren’t worth trying. Look forward to hear about your results.
Thanks for the quote from John, and your thoughts on the subject.
I agree completely that the easier we can make it for a search engine to parse through the code on our pages and find the things that we hope they find, the better off we are, and the more control we have over the message that we are trying to send with our content. I’d much rather have a pagerank well for something found in its main content than something in its footer. I think that ends up as a better experience for the visitor of a page in most cases, and increases the chances that they might visit other pages that we might want them to see as well.
Sometimes when a patent is granted, a search engine may have been following and using the processes described in it for a while. Sometimes they’ve even moved on to something else that (they hope) might be better.
I’ve also seen somethings announced on one of the Google blogs and implemented the same day, or shortly after a patent has been published as a patent application or granted. For example, Google’s patents on logged-in personalized search and web history were pretty well descriptive of the user-interface elements involved, and were published about 6 months before those were launched.
But, I’ve also seen patents filed or granted where it’s just impossible to tell whether or not a search engine might ever use it, or where it seems like a lot of other things may have to fall into place before they do.
I think that the segmentation process described in this patent has the potential to impact businesses and sites of all sizes.
A couple of things that you should be asking yourself is whether or not the HTML of your pages makes it easier or more difficult for a search engine to understand what code goes with what parts of pages. The other is whether or not you’ve done a good job of putting important content (keywords, related terms, etc.) in a main content area of those pages.
I think in this particular instance, there’s a good chance that Google has been doing something like this for a few years. Because of that, I’m not sure that we will see a drastic impact from it. Of course, as Brent mentioned in his post, chances are that Google may be doing a better job with segmenting pages of some sites more than with other sites, and the advancement of standards like the semantic elements in HTML5 should make it easier for them to do it on many more sites.
Google and Amazon do both seem like they put “user experience” at the forefront of many things that they do.
This kind of page segmentation process is definitely the kind of thing that can help improve the relevance of search results, and enable search engines to provide better results.
I was thinking a few days ago if it would be better for someone working on a search engine if they had a strong background in linguistics or a strong background in computer science, and I came to the conclusion that it would be best to have people with both types of backgrounds collaborating together.
This segmentation process doesn’t necessarily ignore the content in headings and footings and sidebars as if they were irrelevant to the content shown in a main content area, and I absolutely agree with you that if they did, they would be making a mistake.
I think when you set out to analyze the impact that your pages can have from a conversion and usability standpoint that you need to look at every area of real estate on a page, too.
Yes, indeed. It’s possible to get all the technical details on a page right, from clean and crisp code, fast loading pages, good user experience, and so on, and still create content that might not be all that interesting or engaging or persuasive or helpful. 🙂
I think it’s always been helpful for designers to pay careful attention to the ways that they design pages, and think about the roles and uses of content found in sidebars, headings, footers, and main content areas. Often those header, footer, and sidebar sections contain the same or very similar content on more than one page on a site, rather than offering unique and interesting content. A search engine should focus upon what is important and unique on pages, so the idea that the important stuff should usually be found in a main content area on a page is true regardless of whether or not a search engine might pay more attention to what it finds there.
Hi Bill. I agree upgrading to HTML5 purely for SEO benefit is probably folly but certainly for new site development anything that makes it easier for the engines to understand your content and improves user experience makes it a must do.
Interesting article! It seems HTML5 giving more options for Google identifying particular aspects of a page and the value of particular items like links. However aren’t links in footers, sidebars already given less weight?! Is the new standard giving Google and other SE’s just more options to segment? Grtz, Mark
There do seem to be a number of other benefits to developing in HTML5 that go beyond just SEO. I’m wavering on doing so a little because the standard still needs some work, but I’m trying to keep an eye on what develops with it.
It’s likely that links in footers and sidebars are given less weight, and this patent was filed way before anyone thought of coming up with HTML5. I introduced this post with some thoughts on HTML5 because it will provide designers with a way to make it much clearer which parts of a page are which. The new standards may make it much easier for the search engines to segment pages. I think that’s helpful for developers because they can focus more upon making the content found in the main content area of a page more focused.
Comments are closed.