How Google May Index Deep Web Entities

If you’ve been doing SEO for a while, one of the papers that you may have read describes how Google was attempting to index content found on the Web that might be difficult for their crawlers to access, such as financial statements from the SEC. The search engine would have to try to access this information by filling out a form and guessing good queries, because that was the only way to access the information – they couldn’t crawl it without querying it first. This paper describes efforts that Google undertook to access that information:

Google’s Deep-Web Crawl

From the abstract to the paper:

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content

A few years ago, I wrote a blog post that I titled “Solving Different URLs with Similar Text (DUST)” which described some of the difficulties that a search engine might have indexing some URLs.

A paper I came across this week sort of combines those topics, and uses information from sources such as Freebase to better guess queries and crawl deep-web pages that focus upon products found on commerce pages that could be difficult to reach otherwise, that Google might have to fill out forms to access, and which it could use names of Entities found from sources such as Freebase (such as names of phones, like “iphone”) to query and to find those deepweb pages. The paper is:

Crawling Deep Web Entity Pages (pdf) The image below is from the paper, and illustrates how the process describe within it works:

Entity Crawl System
A diagram of the entity crawl system described in the paper

The abstract from the paper tells us:

Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. While many deep-web sites maintain document-oriented textual content (e.g., Wikipedia, PubMed, Twitter, etc.), which has traditionally been the focus of the deep-web literature, we observe that a significant portion of deep-web sites, including almost all online shopping sites, curate structured entities as opposed to text documents. Although crawling such entity-oriented content is clearly useful for a variety of purposes, existing crawling techniques optimized for document oriented content are not best suited for entity-oriented sites. In this work, we describe a prototype system we have built that specializes in crawling entity-oriented deep-web sites. We propose techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep-web sites. These techniques are experimentally evaluated and shown to be effective.

A Search at Freebase for [smartphones] provides a list of different entities described as competing entities at the Knowledge base:

A Freebase list of smartphones.
A Freebase list of smartphones.

The paper tells us that this type of information can be helpful in identifying queries that can be used to crawl content from a product-based ecommerce site:

Our first contribution is to show how query logs and knowledge bases (e.g., Freebase) can be leveraged to generate entity queries for crawling. We demonstrate that classical techniques for information retrieval and entity extraction can be used to robustly derive relevant entities for each site, so that crawling bandwidth can be utilized efficiently and effectively

The paper also describes how it might try to filter content from these deep-crawled pages to avoid empty pages (pages without content focusing upon a specific entity) or pages that duplicate content under a different name.

Take-Aways

If you’ve been looking for a connection between the SEO of web-page Crawling, and the use of Data from sources like Knowledge-bases, this paper describes such a connection – using data from a knowledge-base such as freebase to query the content of a deepweb database, such as an ecommerce site where content doesn’t surface to be crawled unless it is queried first.

As I was looking through this paper, I was impressed by the papers that were cited within it, and I wanted to look them up. After looking at a few, I decided that I’m probably going to be spending some time reading though many of them, so I created links to them. A few of them require ACM Membership to read, but many of them are freely accessible on the Web. You may find them interesting, too.

10. REFERENCES

[1] HTML 4.01 Specification, W3C recommendations.
[2] Z. Bar-yossef, I. Keidar, and U. Schonfeld. Do not crawl in the dust: different urls with similar text (pdf) In Proceedings of WWW, 2006.
[3] L. Barbosa and J. Freire. Siphoning hidden-web data through keyword-based interfaces (pdf) In Proceedings of SBBD, 2004.
[4] L. Barbosa and J. Freire. Searching for hidden web databases (pdf) In Proceedings of WebDB, 2005.
[5] L. Barbosa and J. Freire. An adaptive crawler for locating hidden-web entry points (pdf) In Proceedings of WWW, 2007.
[6] K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge (pdf) In Proceedings of SIGMOD, 2008.
[7] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic
clustering of the web
(pdf) In Proceedings of WWW, 1997.
[8] A. Dasgupta, R. Kumar, and A. Sasturkar. De-duping urls via rewrite rules In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Proceedings of KDD, 2008.
[9] J. Guo, G. Xu, X. Cheng, and H. Li. Named entity recognition in query In
Proceedings of SIGIR, Proceedings of SIGIR, 2009.
[10] B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Accessing the deep web
Commun. ACM, 50, 2007.
[11] M. A. Hearst. UIs for faceted navigation recent advances and remaining open problems (pdf) In Proceedings of HCIR, 2008.
[12] A. Jain and M. Pennacchiotti. Open entity extraction from web search query logs (pdf) In Proceedings of ICCL, 2010.
[13] H. S. Koppula, K. P. Leela, A. Agarwal, K. P. Chitrapura, S. Garg, and
A. Sasturkar. Learning url patterns for webpage de-duplication (pdf) In Proceedings
of WSDM, 2010.
[14] J. Madhavan, S. R. Jeffery, S. Cohen, X. luna Dong, D. Ko, C. Yu, and
A. Halevy. Web-scale data integration: You can only afford to pay as you go (pdf) In
Proceedings of CIDR, 2007.
[15] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy.
Google’s deep web crawl (pdf) In Proceedings of VLDB, 2008.
[16] G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web
crawling
(pdf) In Proceedings of WWW, 2007.
[17] A. Ntoulas. Downloading textual hidden web content through keyword queries (pdf) In JCDL, 2005.
[18] M. Pa ̧sca. Weakly-supervised discovery of named entities using web search queries In Proceedings of CIKM, 2007.
[19] Y. Qiu and H.-P. Frei. Concept based query expansion (pdf) In
Proceedings of SIGIR, 1993.
[20] S. Raghavan and H. Garcia-Molina. Crawling the hidden web Technical report, Stanford, 2000.
[21] P.-N. Tan and V. Kumar.
Introduction to Data Mining
[22] Y. Wang, J. Lu, and J. Chen. Crawling deep web using a new set covering algorithm In Proceedings of ADMA, 2009.
[23] P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. Query selection techniques for efficient crawling of structured web sources (pdf) In Proceedings of ICDE, 2006

Summary
Article Name
How Google May Index Deep Web Entities
Description
A Google Whitepaper that describes how Entity content from sources such as Freebase can be used when indexing deep content from product sites on the Web.
Author

24 thoughts on “How Google May Index Deep Web Entities”

  1. Thanks for writing this post, Bill.

    Now that Freebase is closing down, it would be interesting to hear your thoughts on Wikidata and future of Google’s Knowledge graph? Maybe an idea for a future blog post? 🙂

  2. You’re welcome, Trond. I was really excited when I saw this document among the Google Research papers, and remembered vividly when the original Deep Web paper came out. It’s difficult to tell what the future of Google’s Knowledge Graph might be with Freebase closing down, and being replaced by Wiki Data. It might be the subject of a future blog post.

  3. Nice Analysis !!!
    You have done great research on deep web entities.

    It is really interesting to read this article and know more about crawl system works

  4. thanks for sharing this important article,we love that ,i am always following your artickles and always find them very very helpfull thank you again

  5. Been saying to people for a while that Deep Web Entities is the next big thing in google, thanks Bill to keep us one step ahead of Google, love this blog.

  6. It seems Google is taking on something that might be bigger than they can handle. They are already shooting for giving priority to mobile sites–which is overdue. I’d like to see the deep links succeed.

  7. Hello Bill,

    Thank you so much for sharing this most useful and new information with us. I never miss your article bill, because I know, you would share something different and valuable stuff. Thanks again dear.

  8. I am kinda on the fence on this one. The deep web is the deep web for a reason. Some of the things you can only find by being very diligent are that way for a good reason. I get that Google wants to be able to index everything but the deep web should be left alone.

  9. Hi Kevin,

    There’s a lot of data that exists on the Web that was meant to be accessible to a larger audience, but hasn’t been set up that way. Like for instance, the SEC Edgar database information: https://www.sec.gov/edgar/searchedgar/companysearch.html – there really isn’t a good reason why that information should be difficult to get to, and it’s projects like Google’s Deep Web project that makes it possible to find the information much easier. See also the Google Webmaster Central blog post: Crawling through HTML forms

  10. Hi Bill,
    This is really very interesting and helpful article. Earlier one of my colleague described about indexing and data collection used by google. I have one question. If there is web application running with website where internal users and only employee can login. Some sensitive information is stored in that application. Will Google be able to collect that information too?

  11. Hi Bill, Awesome post! You have shared pretty deep insights. Your post shows that you did a lot of research before writing this great article. Thanks for sharing with us

  12. Hi Soni.

    It’s possible that Google might be able to access such information; so if you have things like important financial information within it, you may want to remove that.

  13. Gotta say, I’m loving the stuff you turn up from Google’s research.

    This model is pretty intriguing!

Comments are closed.