Search Engines Extracting Table Data on the Web

The Web is filled with page after page after page of data. That data is usually organized differently from one site and one page to another, and contained in text, in pictures, in videos, in audio, in columns, in rows, in frames, and many other formats.

When a search engine spider comes to a page on the Web, it will try to go through all of the text it finds, make note of links to other pages, consider alt text for images, and view meta data tags.

Search engines spiders will decide whether or not the content of pages should be indexed by the search engine, and determine which links to follow next.

Sometimes search engine spiders will pick out part of a page to treat a little differently for one reason or another. It might extract specific types of information, or look for data in specific formats. For instance, Google might find a list on a page, and send information about the list to the data base for Google Sets (no longer available). I wrote about some of the details in a post about the Google Sets patent.

Instead of looking for lists, what if Google focused upon taking information from tables that contain meaningful data (as opposed to tables that might be used on a web page to control the formatting of part or all of a page)?

What if it took all those data filled tables, and created a separate database just for them, and tried to understand which of those tables might be related to each other? What if it then allowed for people to search through that data, or combine the data in those tables with other data that those people own, or that they found elsewhere on the Web?

Why just look at tables of data? The answer to that has something to do with the structure of data in tables.

Because a table is structured like the one below, with labels for each column, a search engine might be able to extract that information with the associated labels, and store it in a database where it could be accessed by searchers later.

Popular top level domains and Google Results:
tld type Google Results
.com commercial 6,930,000,000
.net Network services 1,980,000,000
.org Noncommercial 1,940,000,000
.jp Japan 1,760,000,000
.de Germany 1,660,000,000
.uk United Kingdom 770,000,000
.fr France 583,000,000
.edu US accredited postsecondary institutions 294,000,000
.ca Canada 291,000,000
.gov United States Government 185,000,000

Structured and Unstructured Data

When you go from site to site on the World Wide Web, you see a wide range of formats and organization of content and information.

Many of the pages that you find on the web could be considered to consist of unstructured data, information that isn’t strictly tables of labels and values for those labels. But many web pages also contain tables that do contain structured data, which is much more organized. It might be interesting if those tables could be removed from web pages, and placed into an index where the data within them might be compared.

Google Research on Table Data

A couple of papers from Google researchers explore the extraction of data and labels for that data from HTML tables on the Web, so that the information found in those tables can be searched for by keywords, and used in other ways, such as using it to create mashups from information gathered from different tabular sources.

A good percentage of tables found on the Web were left out of the research, such as very small tables, those used for formatting of pages, calendars, and other uses that didn’t involve a meaningful display of related data.

A massive undertaking, the research provides a different way of thinking about how search engines might crawl web pages to find information to return to searchers. Here’s a description of the data used in this research:

We extracted 14.1 billion HTML tables from Google’s general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own “schema” of labeled and typed columns, each such table can be considered a small structured database.

The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude.

These tables are found on freely accessible web pages, and are not tables of data hidden behind logins and forms from the deep web. One example of a table that might appear in the tables included in this data is one about the Presidents of the United States.

The conclusion in one of the two papers notes that tables might not be the only kind of structured data that might be explored using a similar approach in the future.

Finally, we would like to also include relational data derived from more than just HTML tables. Potential data sources that researchers have studied include tabular layouts that do not use the table tag, deep web databases, socially-tagged data items, HTML-embedded lists, and natural language text.

Structured data from tables and other formats, found amongst the many different pieces of unstructured data found on pages of the Web, may lead to new ways to find information on those web pages. Something to think about the next time you see a table on the Web, or build a table for a web page.

Share

15 thoughts on “Search Engines Extracting Table Data on the Web”

  1. This makes a lot of sense. I am actually surprised that this was not happening before….

    However, key has to be put into context as if city is a key than, page should show up only for [City name] + [contextual] reference.

    I didn’t read the paper but I expect that they should have already covered it.

    Thanks for keeping all of us upto date

    ~r

  2. Hi Rajat,

    You’re welcome. Thanks for sharing your thoughts on this post.

    The papers are worth a look if you have some time. The WEBTABLE search system may use a different approach than you might expect if you don’t dive in, and learn more about it.

    For example:

    WebTables users can issue queries that include various spatial operators like samecol and samerow, which will only return results if the search terms appear in cells in the same column or row of the table. For example, a user can search for all tables that include Paris and France on the same row, or for tables with Paris, London, and Madrid in the same column.

  3. Although table data is structured, I’m not sure how a user will find a table that they’re looking for. A table with London, Paris etc.. could mean anything.

  4. Hi Adam

    Hopefully Google will release the Webtables application as a beta project at some point in time, and we can get a better sense of how queries operate.

    The Data Management Projects at Google report in the SIGMOD Record, March 2008 (Vol. 37, No. 1) mentions the need for the creation of a useful query tool to help search through the contents of the Webtable database.

    The papers describe the first stage of the webtable project – the challenges behind extracting the information. It seems like they have a number of ideas for making the data contained in the Webtables database useful that need to be fleshed out more fully.

    I would like to see the demos that are mentioned in this presentation related to the Webtables project, which hints at ways to query data within webtables:

    http://mit.edu/~y_z/www/slides/webtables-presentation-google07.pdf

  5. I am not sure but someday we shall have to manage Google bots in that way. Perhaps the basic concept of database (primary/foreign keys) Should be there to simply all the algorithm. may be we shall have to be janitors of our sites to block or unblock google bots. I think current sitemaps schema is not efficient enough to handle all this.

  6. Hi Tom,

    Some of the basic concepts of database administration, such as assigning primary keys can be really useful. I wrote about Google trying to take that approach with local search in a post a while back – A Google Approach to Improving Location Information Accuracy

    There have also been more than a couple of papers from Google on Extracting information about named entities from Web pages, which attempt to define an object – a specific person, place, or thing, and then associate pages, or parts of pages with that object.

    And Google has their own set of janitors attempting to clean up data on the Web, from the facts they find out about those objects.

  7. This is an interesting post, especially considering that new CSS standards seem to be moving away from the use of standard html tables for laying out web pages and data.

    As more and more web designers use strict html with css to display web pages and data, it will be interesting to see how information from old html table formats is used by search engines like Yahoo and Google, especially if the data indexed is as outdated as the html markup itself.

  8. Hi People Finder,

    Using tables for layout has been a hack that people have been using ever since tables have been available, and I’ve done it more than a couple of times myself. Really good point.

    It does look like the recommendations for the latest CSS version do heavily emphasize moving away from tables for layout of pages -see this page on Advanced layout in CSS from April of 2008.

    But does there seems to be room for tables under the latest CSS, too. And they may be very much like the tables under the most current version of CSS tables. Tables for displaying data seems to still be an acceptable practice under CSS standards.

    There’s also some discussion of the use of tables in HTML 5.

    The papers do describe an attempt to try to not include tables that exist only for layout purposes, so the information that are extracted from tables may continue to be useful. And the authors do mention that they may be trying to extract data from other structured parts of pages found on the Web. It’s worth keeping an eye on what they may come up with in the future. :)

  9. Recently I came across a statement that content is a king and links are a queen fo reffective SEO. And I just wanted to say that structuring data is important. And this is not only for robots but for the users to make the site convenient. All in all sites are made for people.

  10. Hi Web development,

    I’ve heard people use the content as king analogy, and discuss the importance of links.

    I’m a strong believer in context rather than just content – the right links in the right place, the right words in the right place, and so on, are important to searchers and search engines.

    Considering the structure of a page, and how a search engine might take advantange of that structure is definitely something to think about. The right structure in the right place may have significant meaning for search engines.

  11. Reminds me about all the articles regarding “neural computing” we were flooded with a while back. When computers can handle a tiny fraction of what the human brain can store visually it will really change “the algorithm”.

  12. Hi Ed,

    Interesting comment. I think this research, attempting to find structured data on web pages filled with unstructured data, is an interesting approach because it’s working to make sense of the data it finds on the Web in a useful fashion. The idea is a little similar to Google Sets, but takes the concept further by lookiing at the labels associated with data found in tables to see if there might be a relationship between those labels.

    I’m looking forward to seeing where it leads.

  13. I think if some way could be developed in order to efficiently distinguish between tables for layout purposes and tables of data this could be very interesting. With Google spanning the amount of webpages it currently does it would probably be able to collect a sample of data from the web on every category on wikipedia.

    The great Google thinking machine!

  14. Hi Web Design Horsham,

    It does sound like they’ve done a lot of experimentation to back this process, and have found ways to keep tables out that are used solely for layout. I’m hoping that this webtable search is something we see available for public use on the Google experimental search pages sometime soon.

Comments are closed.