If you’ve been doing SEO for a while, one of the papers that you may have read describes how Google was attempting to index content found on the Web that might be difficult for their crawlers to access, such as financial statements from the SEC. The search engine would have to try to access this information by filling out a form and guessing good queries, because that was the only way to access the information – they couldn’t crawl it without querying it first. This paper describes efforts that Google undertook to access that information:

Google’s Deep-Web Crawl

From the abstract to the paper:

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content

A few years ago, I wrote a blog post that I titled “Solving Different URLs with Similar Text (DUST)” which described some of the difficulties that a search engine might have indexing some URLs.

A paper I came across this week sort of combines those topics, and uses information from sources such as Freebase to better guess queries and crawl deep-web pages that focus upon products found on commerce pages that could be difficult to reach otherwise, that Google might have to fill out forms to access, and which it could use names of Entities found from sources such as Freebase (such as names of phones, like “iphone”) to query and to find those deepweb pages. The paper is:

Crawling Deep Web Entity Pages (pdf) The image below is from the paper, and illustrates how the process describe within it works:

Entity Crawl System
A diagram of the entity crawl system described in the paper

The abstract from the paper tells us:

Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.

A Search at Freebase for [smartphones] provides a list of different entities described as competing entities at the Knowledge base:

A Freebase list of smartphones.
A Freebase list of smartphones.

The paper tells us that this type of information can be helpful in identifying queries that can be used to crawl content from a product-based ecommerce site:

Our first contribution is to show how query logs and knowledge bases (e.g., Freebase) can be leveraged to generate entity queries for crawling. We demonstrate that classical techniques for information retrieval and entity extraction can be used to robustly derive relevant entities for each site, so that crawling bandwidth can be utilized efficiently and effectively

The paper also describes how it might try to filter content from these deep-crawled pages to avoid empty pages (pages without content focusing upon a specific entity) or pages that duplicate content under a different name.


If you’ve been looking for a connection between the SEO of web-page Crawling, and the use of Data from sources like Knowledge-bases, this paper describes such a connection – using data from a knowledge-base such as freebase to query the content of a deepweb database, such as an ecommerce site where content doesn’t surface to be crawled unless it is queried first.

As I was looking through this paper, I was impressed by the papers that were cited within it, and I wanted to look them up. After looking at a few, I decided that I’m probably going to be spending some time reading though many of them, so I created links to them. A few of them require ACM Membership to read, but many of them are freely accessible on the Web. You may find them interesting, too.


