10 Most Important SEO Patents: Part 7 – Sets, Semantic Closeness, Segmentation, and Webtables

Yesterday, I wrote about how Google may be looking at the semantics associated with HTML heading elements, and the content that they head, and how the search engine might be looking at such content with similar headings across the Web to determine how much weight to give words and phrases within those headings.

That post was originally part of the introduction to this post, but it developed a life of its own, and I ran with it. Here, we’re going to look at semantics related to other HTML structures, primarily lists and tables.

I’m going to bundle a handful of patents together for this choice of one of the 10 most important SEO patents, since I think they work together to illustrate how a search engine might use semantic structures to learn about how words and concepts might be related to each other on the Web. Some of these patents are older, and one of them is a pending patent application published this week. I’m also going to include a number of white papers which help define a process that might seem to be very much behind the scenes at Google. I’m going to focus upon Google with this post, though expect that similar things may also be happening at other search engines as well.

Google Sets

Google Sets was a service retired earlier this year through Larry Page’s “more wood behind fewer arrows” initiative, but when it was active, it was the longest running beta service running at Google, and it spent its last few years in Google Labs.

screenshot of Google Sets with 4 Delaware cities entered as a starter set

The process behind Google Sets sounds simple enough. You entered some terms that might possibly be members of the same set, and Google would tell you other terms that might also be members of that set.

The patent behind Google Sets was System and methods for automatically creating lists, filed originally in 2003, and what was interesting about it is that it extracted information in list format from the Web to find terms that were related in some manner. Those lists weren’t limited to HTML list elements, but could also be content formatted in some list like form, such as:

  • HTML tags (e.g., <UL>, <OL>, <DL>, <H1>-<H6> tags).
  • Items placed in a table,
  • Items separated by commas or semicolons,
  • Items separated by tabs.
  • Other ways.

Google was taking advantage of the formatting of HTML list items, but it wasn’t relying upon them solely.

Lists, Titles, Headings and Semantic Closeness

When I originally planned this series and was trying to decide which patents to include, this particular patent was one of my first choices, not because it received much fanfare when I wrote about it in 2010 in Google Defines Semantic Closeness as a Ranking Signal, but rather because it illustrated so well how Google was considering the semantics of HTML elements.

Quite simply, every item in a list is equally as close to the heading of that list. When we think about SEO, and how relevant a search engine might consider a page to be for a specific phrase, one of our first impulses is to believe that the the closer the words within that phrase might be to each other, the more relevant a search engine is going to find them to be for the phrase.

But the semantics of a list confuses that somewhat.

The HTML list from above showing that Saturn and Distance are semantically closer than Days and rotation.

The heading for a list can be a heading element such as a <h2>, but it doesn’t have to be. It could also be regular paragraph text that stands out in some way, such as the use of a larger font.

The patent also goes on to tell us that it isn’t limiting semantic closeness to just lists:

For headings and titles, a term in the title of a document may be considered to be close to every other term in document regardless of the word count between the terms. Similarly, a term occurring in a heading may be considered to be very close to other terms that are below the heading in the tree structure.

So, words in the HTML title element of a page are equally the same distance to every word on a page.

Words in an HTML heading element on a page are equally the same distance to every word that they are a heading for.

Semantically Distinct Regions of a Page

In addition to looking at explicit HTML semantic structures on a page, or implicit structures like lists within text that are separated by commas or HTML break elements <br />, a search engine might also attempt to understand how different parts of a page are grouped together, and understand how those parts are related to both each other, and the other groupings on a page.

In the Google patent, Determining semantically distinct regions of a document, we are told that Google may perform a pseudo-rendering of a web page when it crawls the page to determine “the approximate position and size of each element of the document” to identify semantically distinct regions of a document.

From the patent, a page broken down into visual semantic regions, including a heading, sidebars, a main content area, and a footer section.

Each of these regions, or “chunks” as the patent calls them, can play a role in a number of search related applications. For example:

Link analysis – Links in different semantic regions might be assigned different weights. Sounds like one of the features of the reasonable surfer patent.

Text analysis – Identical text in different semantic regions of a page might be assigned different weights based upon their location. So, if a query term is matched in the image blocks section of the picture above instead of in the footer, the page should be given a higher query score than if the term were found in the footer, and thus a higher position in the list of search results, because the main content area where those image blocks are is usually the primary target of a visitor to a page.

Image captioning – text close to an image is more likely relevant to the image than text farther away from the image, and might be used to create a caption for the image, which could then be used in applications like image search.

Snippet construction – Chunks might be in a document’s chunk tree might be assigned a pseudo-title, and those pseudo-titles might be relied upon to construct a more accurate snippet of the page for search results, capturing the major topics of the page and including at least one of the pseudo-titles.

Google’s WebTables Project

Google Sets collected information that appears in implied and explicit lists on the Web to tell you about other terms that could be included in the same set of terms when you submitted at least a couple of terms to the application. Google Squared, before it became another victim of Google’s “more wood behind fewer arrows,” let you search for a specific topic, such as [Presidents of the United States], and you would have search results returned to you in the form of tables, with a first column listing the names of the Presidents, a second column that might tell you the dates of their births, a third column with place of birth, a fourth column with their political party, and so on.

Instead of searching lists, like Google Sets, Google Squared pulled information from tables across the Web. It collected not only subjects associated with your query, such as the names of the Presidents, but also attributes associated with those subjects.

An article in Communications of the ACM, Structured Data on the Web (pdf), provides a very readable and high level (and highly recommended) look at the WebTables (pdf) project behind Google Squared. By crawling and trying to make sense of data found in structures like tables across the Web, Google provides ways to:

  • Improve Web Search
  • Make question answering better
  • Integrate Data from multiple sources on the Web

The data shown in the spreadsheets returned by Google Squared would pull in information from more than one table, so that the birthdates for the Presidents might be found in a number of Web tables, and the political party affiliations of the Presidents might be pulled in from a number of other tables on the Web.

Not only did Google Squared take advantage from the structure of tables, but it also learned labels and schema from those tables related to the data and attributes that it discovered. Google uses those labels to create something it calls the Attribute Correlation Statistics Database (ACSDb)

The ACSDb calculates the probabilities that certain attributes will appear for a certain Google Squared query.

Chances are there may be a table somewhere on the Web listing the Presidents that might include an attribute like “favorite color.” Chances are that there aren’t many Web tables like that which tell us the favorite color of each president, and because there’s so little collaboration amongst publishers of tables on the Web regarding President’s favorite colors, that attribute is unlikely to show up in a Google Squared result.

The ACSDb could be used to suggest labels for database designers to suggest for certain types of attributes, in an “auto schema” approach somewhat like the “auto complete” that we see for Google Instant Queries.

Synonym finding is also aided by the ACSDb as well, so that when you decide to include phone numbers as a label in a table, Google Squared might offer “phone-#” as a synonym for “phone number.”

If you’d like to delve more deeply into the kind of extraction of data that Webtables does, the following whitepapers are worth a look:

A related application is Google’s Fusion Tables which allows you to upload spreadsheets and use visualization tools of different types to view data within your spreadsheets in different ways. One of the inventors behind Webtables and Fusion was interviewed about Google Fusion last August: Google Fusion Tables. Interview with Alon Y. Halevy

Google Fusion also enables you to merge data sets belonging to different owners, such as one that lets you see Global Earthquake activity since 1973 in the same image as the locations of nuclear power plants. Last summer when I visited the town at the epicenter of an earthquake in Virginia a few days before, I found myself rethinking the visit upon seeing a sign for a local nuclear power plant. Could have used that visualization.

Adding Semantics to Tables on the Web

Many tables you run across on the Web are missing header rows with attribute names or have poorly named attribute labels in those rows. A Google paper presented at The 37th International Conference on Very Large Data Bases held in late August and early September of 2011, in Seattle, Washington, describes how data collected from projects like the Webtable project can be used to annotate tables that are missing that kind of information. The paper is Recovering Semantics of Tables on the Web (pdf).

Here’s the “why” behind this approach:

In principle, we would like to associate semantics with each table in the corpus, and use the semantics to guide retrieval, ranking and table combination. However, given the scale, breadth and heterogeneity of the tables on the Web, we cannot rely on hand-coded domain knowledge. Instead, this paper describes techniques for automatically recovering the semantics of tables on the Web. Specifically, we add annotations to a table describing the sets of entities the table is modeling, and the binary relationships represented by columns in the tables.

So when a someone searches for George Herbert Walker Bush’s favorite snack (pork rinds or popcorn?), and there were some tables on the Web that contained that kind of information, but most of them were unlabeled, this semantic recovery process would help return more results for that search, even if the term “favorite snack” didn’t appear on those pages with those tables.

The paper does a pretty good job of presenting this semantic recovery process, but there’s a patent application also came out this week on the same topic, and a number of the co-authors on the table are listed as inventors on the patent filing.

Table Search Using Recovered Semantic Information
Invented by Jayant Madhavan, Chung M. Wu, Alon Halevy, Gengxin Miao, Marius Pasca, and Warren H. Y. Shen
US Patent Application 20120011115
Assigned to Google
Published January 12, 2012
Filed July 8, 2011

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for searching tables using recovered semantic information.

In general, one aspect of the subject matter described in this specification can be embodied in methods that include the actions of:

  • receiving a collection of tables, each table including a plurality of rows, each row including a plurality of cells;
  • recovering semantic information associated with each table of the collection of tables, the recovering including determining a class associated with each respective table according to a class-instance hierarchy including identifying a subject column of each table of the collection of tables; and
  • labeling each table in the collection of tables with the respective class.

Takeaways

Google has published a number of patents and papers and launched a number of services that show how they might be using the semantics of content presentation on a web page to index that content, to answer questions, to even add labels to table columns that are missing attribute labels.

When Google first started extracting content from lists found on the Web to build Google Sets, they were taking advantage of the semantic structures of those lists, to understand which words might fit together as part of the same sets.

The “semantic closeness” of list items and the headings for those lists, or of HTML title elements and the words on those pages, or of HTML headings, and the content they head has some implications for copywriters who want to include a number of phrases on the same page:

For example, under this semantic closeness approach, the following two sets of lists might be equally relevant to the other for the phrases contained within them:

List # 1

bicycles
Schwin
Huffy
Coker
Gitane
Kona
Mongoose

List # 2

Schwin bicycles
Huffy bicycles
Coker bicycles
Gitane bicycles
Kona bicycles
Mongoose bicycles

Google might look at the locations of different sections on a page by using a crawling program that does a pseudo rendering of a page see where content on the page is located. Those locations could have semantic meaning based on their locations, such that, for example, links in a main content area might carry more weight than links in a footer on the page.

While Google has been introducing labels for different kinds of content for web publishers such as recipe or rating schema, they’ve created an Attribute Correlation Statistics Database powered by labels found on tables on the Web that enables them to do things like create labels for table columns that might be missing table headers with attribute labels.

Are you paying enough attention to the semantic structures on Web pages?

All parts of the 10 Most Important SEO Patents series:

Part 1 – The Original PageRank Patent Application
Part 2 – The Original Historical Data Patent Filing and its Children
Part 3 – Classifying Web Blocks with Linguistic Features
Part 4 – PageRank Meets the Reasonable Surfer
Part 5 – Phrase Based Indexing
Part 6 – Named Entity Detection in Queries
Part 7 – Sets, Semantic Closeness, Segmentation, and Webtables
Part 8 – Assigning Geographic Relevance to Web Pages
Part 9 – From Ten Blue Links to Blended and Universal Search
Part 10 – Just the Beginning

Share

32 thoughts on “10 Most Important SEO Patents: Part 7 – Sets, Semantic Closeness, Segmentation, and Webtables”

  1. Yes it’s valuable information as same as your last six patent’s post before.

    And this patents make it more clear that webmaster needs to pay more attentions on “Page segmentation” theory of Google bots. And try to make web-pages more valuable with the content.

    Thanks!

  2. Great breakdown on the importance of lists. I have always preached to my clients to mix up their blog posts with paragraphs, ordered and un-ordered lists, images, and sub-headers.

    It seems only natural content has this sort of “mix” of content organization and markup. From my POV, using these markup styles reflects that the blog post is of higher quality than one not carrying such styles.

  3. Bill,

    This information actually has implications that, in my opinion, extend far beyond simple search.

    I can’t help but remember the scene out of the movie “The Matrix” when Morpheus explains to Neo how humans were in celebration of the birth of “Artificial Intelligence” right before things went “South”.

    This whole semantic relationship model sounds much like a infant making comparisons and associations and thereafter drawing unique “fill in the blank” conclusions. That’s how it starts. I am surprised people aren’t asking questions about this process from this standpoint.

    The entire Google algorithm is definitely heading in that direction I think. Bit by bit we are slowly build a mind, a true “decision engine”, but for real this time…not a marketing ploy.

    And what is this “entity” going to have in its possession the day becomes sentient? . . . a picture perfect snapshot of humanity with all of its strengths and weaknesses. FWIW.

    Mark

  4. Thanks for publishing these SEO patents Bill, its been great following these posts. And helpful in figuring out the organized chaos we all know and love – Google’s algorithm.

  5. It seems to me that this and similar Google patents are aimed at returning more relevant results but not necessarily quality results. There may be miilions relevant results, the question of course is which are more useful. Although recent Google’s algorithm changes were claimed to lead to better quality of the results, it my personal view they are still low quality. When you search for information, real useful pages (such as .edu or .pdf pages) are often buried behind useless sites providing just general words about the subject, but have the “right” domain name and keywords sprinkled across their page.

  6. Pingback: The Week in Words - 1-13-2012: A Recap of Great Content We Read, Enjoyed & Shared.
  7. Hi Rajesh,

    Thanks for your kind words. Understanding things like how a search engine (not just Google) might segment a page, and how it might related titles or headings or lists with other words on a page can be helpful.

  8. Hi Eric,

    Thank you.

    Using headings, subheadings, lists, and images can make a page more interesting, more readable, and the structure of that content has the potential to help the search engines understand your content better as well, when it goes to index it. They can make a page more fun to read as well.

  9. Hi Mark,

    I really liked this interview from last year with Google’s Head of Research, Peter Norvig:

    Search Algorithms with Google Director of Research Peter Norvig

    He is asked and answers a question or two about artificial intelligence there.

    His last statement in the interview says a lot about some of Google’s goals when it comes to search:

    And of course this gets to a deeper A.I. problem: not just understanding information and queries, but really understanding what the user needs and will find useful at a given moment, and serving it up in a way that’s perfectly digestible. It’s not just about human-computer interaction or information retrieval. It’s about how people learn and attain knowledge. We’re trying to move beyond just presenting information, and really focus on increasing people’s knowledge of the world. So Google needs to be “smart” in the sense of really understanding the user’s needs in order to help them build up their knowledge of the world.

  10. Hi Matt,

    Thank you. I’m constantly trying to make sense out of that chaos myself, and writing this series has been helpful in pulling some of the pieces together. Happy to hear that you’re enjoying it so far.

  11. Hi Lazar,

    Relevance is definitely one aspect of quality, and there are others that Google does seem to be starting to explore, but even defining relevance in the right context can be a challenge that meeting well does help determine the quality of the results that you see.

    At one point in time, just retrieving pages on the Web that contained keywords found in a query was thought about as “relevance,” and that’s what search engines like AltaVista did. Google’s use of PageRank helped sort those by including a quality signal – PageRank, which may have some flaws in some ways, but still took advantage of a signal on the web – the links between pages, to include a quality signal that was independent of relevance. That’s part of the reason why I started with PageRank in this series of the 10 Most important SEO patent posts.

    Here are the others:

    Part 2 – The Original Historical Data Patent Filing and its Children – Search engines looking at temporal data associated with websites to try to weed out both spam and stale search results. Not so much about relevance as it was about getting rid of dated pages and pages that were changed to try to manipulate search results.

    Part 3 – Classifying Web Blocks with Linguistic Features – By using a visual segmentation approach, and trying to understand what the main content on a page might be, this approach focuses upon cutting down some of the noise on a page, and looking at what it has to offer that might be unique. It assumes that content and links within the main content area of a page might be more important and useful than content in footers or sidebars or head sections of a page. That might help relevance, but it also makes it more likely that a search result might not consist of pages where the terms from a searchers’ queries are possibly going to have one of the query terms in a sidebar, another in a footer, and another in the main content area of a page.

    Part 4 – PageRank Meets the Reasonable Surfer – An improvement on Google’s early implementations of PageRank that would pass along more link value to links that were identified as likely being more important.

    Part 5 – Phrase Based Indexing – By looking at how good and meaningful phrases tend to co-occur on higher ranking web pages in response to certain queries, this approach focuses upon displaying higher quality pages in search results.

    Part 6 – Named Entity Detection in Queries – By understanding that some queries are about specific people, places, and things, this approach can take advantage of information about those named entities that exists on the Web, deliver searchers to pages that appear to be strongly associated with those people or places and things, and show better information about attributes and aspects of those entities in more meaningful ways.

    Part 7 Sets, Semantic Closeness, Segmentation, and Webtables – By looking at the structure and organization of content found on the web, and the semantic properties of those structures, a search engine can create schema about different types of content, and relationships between words found within those structures.

    Search isn’t an easy problem, but I’m encouraged by a lot of the approaches that I’m seeing to try to deliver both relevant and quality results, and I think we will continue to see improvements.

  12. Thanks for a great article Bill. Your information is really useful and helpful to us. Google page segmentation is a great point to be discussed here. One thing I learned before I became a bestselling author and long before Inc Magazine voted my company as one of the fastest growing companies is that sometimes with the help of few articles like yours this one it could be a great help and specially the way you have explained it in diagram.

  13. Bill…I’m a newbie in seo. Never explored everything related to seo but your explanation on “page segmentation” would be a very useful thing for me I believe. I don’t know all seo professionals will/must know this.. but if they had a chance to read ..I think its the best.

  14. Hi Daniel,

    Thanks. I’ve seen a few copywriters who came to the Web from the print word, and have written blog posts, articles, and even books about writing copy for the Web who never mention or seem to grasp how the semantics of the web and different HTML elements might impact how both readers and search engines might interpret what they’ve written. Hopefully they will at some point.

  15. Hi Andrea,

    Thank you.

    When we learn to write in grade school, we learn about sentence structures, about grammar, about how to construct essays, and how to lead to a conclusion, and similr things.

    When it comes to the Web, we add headings and footers and sidebars to the information that we present, but we don’t always look at the “why” behind the best ways to try to do those things. Patents like the ones above have been having me asking myself and others a lot of questions about what the best way to present information on web pages might be and how search engines might interpret that information.

  16. You really always write great, comprehensive posts on SEO. I really appreciate the amount of work you put into your posts, thanks!

  17. Thanks very much, Jon.

    I’ve always been a believer that the more I put into a blog post, the more I can learn from it, and the more I can share. :)

  18. I followed all the articles and I really enjoyed this year, the SEO will be very controversial, we have to keep track of everything!
    Social media will come with everything, especially Google +, but the semantics, the on-page will never be left alone by their importance

    Katipsoi Zunontee

  19. Very interesting article. What are some products where you see google using this functionality? Have you see their best of the web 2011? Excellent use of these methods.

  20. Hey Bill this is unrelated to the post however, how did you get to PR 6 with this sites so quickly? Are you doing outreach work or is your blog simply being linked to from a lot of sources?

    Thanks

  21. I think the list header format you mentioned is a directly related to your previous post about headings. Search engines don’t necessarily need headers to determine the following content. It could be text that is bold, bigger, different color etc.

  22. Bill,

    Please, take a look also at another great Google invention, which has been patented in US, namely Patent Number: 7707157, filed on December 3, 2009.

    This patent is related to detection, so-called, near-duplicate components. In SEO terms this is about detection of a double-content. I looked through description of this Patent, and, it was really fascinating! In order to detect double-content they build a fingerprint corresponding to a webpage (based on checksum values). These fingerprints are being used for comparison with others web contents. Why is it great? The fingerprints are small, but provide great criteria for comparing different contents!

    OK. But, if you turn “around” such comparison to another goal of detection, namely to detect Semantic Closeness, then it works too!

    Thank you for your attention.

  23. Bill, fascinating read. You did a great job of describing the patents in a way that was useful yet not mind numbing. Prior to the read, I had never consciously thought that the order of my tables and words in that table had that large of an impact. It is also nice to see a deeper explanation as to the inner thinkings of Google and their technology.

  24. Completely agreed with your details, google search algorithm is based on set of relevant keywords which match the keywords searched on google. It is the only reason we find different results for “cure for arthritis” and “cure of arthritis” on google however meaning is same.

    It is believed that google also tracks activities and personal search is a result of that and probably future tool for sales. Company with actual data of users is King of tomorrow.

    Very well written by you “As you always do”,
    thanks a lot Bill……

    Your posts are always awaited..

  25. Hi Katipsoi,

    Thanks. Your feedback is appreciated.

    Social media does seem to be growing more and more important everyday, but I agree that understanding on page semantics and SEO is still very important to site owners and SEO.

  26. Hi Jacko,

    Google Sets and Google Squared have been retired from Google these days, and the announcement told us that technology would be used in other ways such as helping to identify query suggestions and related search suggestions that Google offers.

    The technology behind identifying semantically distinct regions of pages will likely have ongoing significant impacts to search, such as weighting content and links in different parts of pages differently, and even Google’s recent announcement about possibly devaluing sites that have too many advertisements above the fold is something possible through the use of that technology.

    The semantic closeness is quite possibly something built into Google’s web search at present as well.

    Not sure that I know what you mean by Google’s “best of the web 2011″.

  27. Hi Vince,

    This site has been around since 2005, and the homepage has been PageRank 6 before. It has been PageRank 5 for a while, but I was fortunate in attracting a fair number of links over the past year.

  28. Hi Ryan,

    When I started writing this post, that earlier post about headings was going to be the introduction to this post. It grew into a post of its own. :)

    I agree that Google doesn’t necessarily need to see an <h1> in use to treat something as a main heading, if it looks, acts, and feels like a main heading. The semantic closeness patent actually discusses that, and mentions that while they will look for explicit headings, they might may do the same thing with “implicit” headings, where the structure presented on a page for a heading, and a list that follows it, for instance, doesn’t use either HTML element (headings or even list elements), but act just like them.

  29. Hi Sergei,

    I’ve seen that patent before, but haven’t written about it. You might find this post interesting though:

    http://www.seobythesea.com/2008/02/new-google-process-for-detecting-near-duplicate-content/

    The finger print approach definitely could work to find similarities on web pages, but that would possibly be more a matter of the semantic closeness between different web pages, rather than a closeness between words on a single page based upon an explicit or implied semantic element (lists, titles, headings) or structure.

    It is pretty fascinating.

  30. Hi Jonathan,

    Thanks. Definitely one of the reasons why I like to dig through patents like these is to try to see the web and what search engineers do through their perspective. Sometimes I will run across a patent that describes something that I’ve seen or guessed Google was doing on my own, and the patent confirms it, and adds some vocabulary and some details that I hadn’t run across yet. I really like it when that happens.

  31. Hi Ankush

    One of the things that I find really fascinating though is that Google does some things that we might not anticipate or even expect, even if it is just a matter of them returning pages that have the keywords we searched for on them. We somewhat expect that the distance between two words matters, and then a patent like the semantic closeness one comes along, and tells us that every item in a list is the same distance away from the list’s heading as every other item.

    Personalized data about how people search is probably growing in importance to how Google does what they do, but they also are the ones who have the best access to that data.

    Thanks for your kind words.

Comments are closed.