Mining User Queries to Extract Information From Web Pages

A new Google paper describes an interesting shift on how Google might find information on the web, using artificial intelligence methods on query log files.

We often talk about how Google indexes web pages by looking at relevance factors (which are “relevant” to the query used), and quality factors (which look at importance of pages, through something like page rank).

But there are a number of Google databases which focus upon extracting information from web pages, and a few Google patent applications which describe such efforts by the search engine.

Examples include relating answers to questions in Google Q&A, joining business and location information to be used in local search, finding reviews to extract and aggregate, and pulling information from the Deep web.

Extracting Information from Users’ Queries

A Google paper presented at the International Joint Conference on Artificial Intelligence last month, What You Seek is What You Get: Extraction of Class Attributes from Query Logs (pdf), takes a different approach.

Instead of attempting to extract the information directly from web pages as Google crawls the web, it looks at the search engine’s log files, and users’ interactions with pages to extract information. The abstract to the paper tells us:

Within the larger area of automatic acquisition of knowledge from the Web, we introduce a method for extracting relevant attributes, or quantifiable properties, for various classes of objects.

The method extracts attributes such as capital city and President for the class Country, or cost, manufacturer and side effects for the class Drug, without relying on any expensive language resources or complex processing tools.

In a departure from previous approaches to large-scale information extraction, we explore the role of Web query logs, rather than Web documents, as an alternative source of class attributes.

The quality of the extracted attributes recommends query logs as a valuable, albeit little explored, resource for information extraction.

The authors of the paper state that they believe that their exploration of finding such data from log files is the first effort of its type while extracting information from a collection of documents.

Experimenting with Query Log Extraction

In addition to describing details on how this might be done, they also disclose information about an attempt to try the method on a random sample of 50 million (fully anonymized) Google queries, taken from the first few months of 2006.

These are the classes, and attributes related to those, that they tried to extract from query logs:

Country: capital, population, president, map, capital city, currency, climate, flag, culture, leader

Drug: side effects, cost, structure, benefits, mechanism of action, overdose, long term use, price, synthesis, pharmacology

Company: ceo, future, president, competitors, mission statement, owner, website, organizational structure, logo, market share

City: population, map, mayor, climate, location, geography, best, culture, capital, latitude

Painter: paintings, works, portrait, death, style, artwork, bibliography, bio, autobiography, childhood


In the conclusion to the paper, the authors note that they may attempt to “incorporate additional evidence from more traditional sources, such as natural language text and semi-structured text (e.g., tables), which may help further improve the quality of the output.”

If you would like to learn a little more about some recent research on information extraction at Google, these papers are worth a look:

6 thoughts on “Mining User Queries to Extract Information From Web Pages”

  1. You’re right, Michael. There are going to be some limitations with an approach like this.

    But, it does provide an additional set of data points to look at, and compare to information extracted directly from documents on the web.

    And, just as Yahoo is tracking and trying to understand trends, it’s probably not a long stretch to say that Google probably is, too.

    One thing that I really appreciated about the paper and the research is that they did experiment with the approach, and if they were to start using this log analysis to extract information, I would imagine that they would do a good deal more.

    It may be that this works best with some classes and categories, and not others. They do discuss that in light of the results of their experiments.

    Might this be an effective approach in conjunction with other methods? Maybe.

  2. I see some limitations in this type of analysis. The queries would be largely driven by news trends, fads and fashions, etc. Hence the search engine would have to decide whether to limit the scope of query evaluation or to allow historical trends to dominate. Either way, you’ll end up with queries weighted toward a set of linguistic classes that overlap on names and attributes but are nonetheless distinct.

  3. 🙂

    I do like that they are exploring areas like this, Darren.

    They have all that data – if they can find a good use for it, it doesn’t hurt to try. I imagine that the sheer volume of information in log files might even be a little overwhelming.

  4. I think you’re right. It will be especially tough to correlate all of it. But they have a mountain of data and the real trick for them is making sense out of all of it.

Comments are closed.