Direct Answers: Extracting Text from Pages Citations

This is the last post in a series about Google’s International patent application Natural Language Search Results for Intent Queries.

This section was inspired by the citations list at the end of a paper used by the listed inventors as a provisional patent, that preceded that patent. The paper was Scalable Attribute-Value Extraction from Semi-Structured Text (pdf).

I sometimes like to start looking through the documents I see listed as citations or footnotes in a paper I find interesting, As I started looking at the documents in that paper, I found many of them to be very interesting.

And then an idea struck me.

Rather than me trying to take just one or two of these papers, I’d share the process. Since the original paper was a PDF without any links to it, the chances of most people exploring those links was very limited.

And yet some of these papers should be read.

There’s one on the Semantic Web from 1975, created by the Department of the Navy. There’s another from the 80s, and three more from the early 90s. Some basic concepts that people interested in the Semantic Web and in Search Engines such as Wrappers are covered.

I don’t know all of the classic papers of the Semantic Web, and whether or not many of these are ones that fit into that category. But that’s why I’m sharing links to them – so that we can work on learning that together.

It you see something that strikes you as really interesting, please let me know in the comments.

Thanks, and I hope you find something really interesting in these.

The key modules involved in TextRunner: from "Open Information Extraction from the Web."
The key modules involved in TextRunner: from “Open Information Extraction from the Web.”

[1] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. (pdf) In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07), pages 2670–2676, Hyderabad, India, January 2007.

From "Finding parts in very large corpora."
From “Finding parts in very large corpora.”

[2] M. Berland and E. Charniak. Finding parts in very large corpora (pdf). In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), pages 57–64, College Park, MD, June 1999.

[3] R. C. Bunescu and R. J. Mooney. Collective information extraction with relational Markov networks (pdf). In Proceedings of the 42th Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 439–446, Barcelona, Spain, July 2004.

[4] S. A. Caraballo. Automatic construction of a hypernym labeled noun hierarchy from text (pdf). In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99), pages 120–126, College Park, MD, June 1999.

From " Weaving a web of ideas."
From ” Weaving a web of ideas.”

[5] S. M. Cherry. Weaving a web of ideas. IEEE Spectrum, 39(9):65–69, September 2002.

[6] W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system for wrapping tables and lists in HTML documents (pdf). In Proceedings of the 11th International World Wide Web Conference (WWW-02), pages 232–241, Honolulu, HI, May 2002. (Presentation (PDF))

[7] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithm (pdf). Journal of Machine Learning Research, 7:551–585, 2006.

[8] D. Freitag and N. Kushmerick. Boosted wrapper induction (pdf). In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI-00), pages 577–583, Austin, TX, July 2000.

[9] M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 15th International Conference on Computational Linguistics (COLING-92), Nantes, France, August 1992.

From Customizing a lexicon to better suit a computational task.
From Customizing a lexicon to better suit a computational task.

[10] M. A. Hearst and H. Schutze. Customizing a lexicon to ¨better suit a computational task. In Proceedings of the ACL-SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, Columbus, Ohio, June 1993.

[11] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semistructured information sources (pdf).Journal of Autonomous Agents and Multi-Agent Systems, 4:93–114, 2001.

[12] M. Pasca and B. Van Durme. Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs (pdf). In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL-HLT-08), pages 19–27, Columbus, OH, June 2008.

[13] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (book in Amazon). Morgan Kaufmann, San Mateo, CA, 1988.

[14] M. Poesio and A. Almuhareb. Identifying concept attributes using a classifier (pdf). In Proceedings of the ACL Workshop on Deep Lexical Semantics, Ann Arbor, Michigan, June 2005.

[15] J. Pustejovsky. The Generative Lexicon (Book atg MIT). MIT Press, Cambridge, MA, 1995.

[16] Y. Shinyama and S. Sekine. Preemptive information extraction using unrestricted relation discovery (pdf). In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL (HLT-NAACL-06), pages 304–311, New York City, NY, June 2006.

From What’s in a link: Foundations for semantic networks
From What’s in a link: Foundations for semantic networks

[17] W. A. Woods. What’s in a link: Foundations for semantic networks (pdf). In D. G. Bobrow and A. M. Collins, editors, Representation and Understanding: Studies in Cognitive Science, pages 35–82. Academic Press, New York, 1975.

[18] S. Zhao and J. Betz. Corroborate and learn facts from the web(Paid ACM Access Only). In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 995–1003, San Jose, CA, August 2007.

14 thoughts on “Direct Answers: Extracting Text from Pages Citations”

  1. Hey Bill,

    I commend you for putting this all together. I cant help but feel like i’m reading some classified Roswell level information with the photocopies and legitimate seals ha

    I also think that a deeper understanding of the history of SEO helps to predict trends of best practice and what future updates may be about

    Thanks for the smoking gun!


  2. Reading What’s in a link: Foundations for semantic networks challenged some of my presuppositions concerning semantics. It seemed like part II-2 was describing in broad strokes what a SERP is and the components necessary to bring it into being. Fascinating read, blows my mind that it was written in 1975 when I was barely the outcome of my father’s pursuit of my mom 😉 Thanks Bill.

  3. Hi Bill:

    I don’t know how to thank you for your awesome blog posts. I am sure that ” SEO by the Sea” is the only one resource for this kind of in-depth content and insights. Have a nice day!

    Best Regards
    Miraj Gazi

  4. I find citation 12 to be quite interesting since it really does telegraph the entity extraction efforts that are now fully underway.

    For me, it’s also gratifying to see them use some of the same processes I use to identify patterns or templates, something I refer to as ‘query classes’ (i.e. – [song] lyrics, [ailment] diagnosis etc.).

    I’ve always found those to be of great use when working with large clients.

    Thanks for the reading list Bill!

  5. Hi A.J.

    I’ve run across papers and patents from both of the authors of that paper while they’ve been working for Google, and I’ve found most of those pretty interesting as well. I’ve seen the templates or patterns language used before too. In some instances, Google seems to refer to some of those as canonical query formations as well.

    I’m not sure that you’d remember this post, but it covers some of the same territory as well:

    Does Google Search Google? How Google May Create and Use Synthetic Queries

    Glad that you enjoyed the list of citations from that paper. I found enough of the ones listed to be interesting that I was sure others would too.

  6. Hi Dan,

    I’ve been returning to that one over and over. I’m not sure that I understand the context of a lot of the statements made in that document – but I’m working on it. I think I need to find some other things written about semantic networks during that time (the seventies) to give it more perspective. It is puzzling, and yet very interesting at the same time.

  7. Hi Miraj,

    Thank you, and I hope you’re having a wonderful week. I don’t really know if there are too many people writing about how Google and the other search engines may be attempting to integrate the semantic web into their approaches, especially since most people are still writing about keyword-based search approaches, and link-based rankings. There is more to Google and Bing and Yahoo than those methods, and more to SEO as well.

  8. Hi Daniel,

    I was really struck by the appearance of some of the documents I was finding after performing searches for them – I got the same sense of being in an episode of the X-Files, seeing top-secret clearance documents. Going with a “The Truth is Out There” motif, I decided to use the citation list with links added to it as a blog post and include some of those seals. As I noted in the post, I wanted to share the experience. It looks as if you felt it too, which makes me happy that this format for this post succeeded.

    I think that history really helps with things too.


  9. Bill, this is quite interesting share. I can’t believe these notes were written a long time ago. I am reading “what’s in a link” now and its very interesting.

    – Hamayon

  10. I’m running through these with interest. I basically read and collect precisely this type of information thinking it might be useful in the future.

    Soon enough you’ll be able to share snippets from articles just like this through Patdek. Instead of clicking the image the three key modules for TextRunner, the reader will be placed right in the document where that snippet came from, with the snippet (and others) highlighted, and able to read any other part of the document.

    The more snippets of these key prior art documents can be linked together, the easier it will be to analyze past and future innovations – or just study the topic in general. At least that’s the hope.

    Some real interesting stuff going back a fair ways you’ve pointed out. Very interesting. Lots of reading ahead.

  11. I am agree with Bill! I just want to share my thoughts about this blog. SEO By The Sea is providing detailed info on SEO facts and that’s the only thing that keeps it different from the existing mob!

    John Pereless
    CEO, Pereless Software

  12. That’s a lot of information for us, and a lot of typing for you Bill, again. Thank you for being such a remarkable and explanatory source!

Comments are closed.