I’ve been writing recently about a patent from Google on Direct Answers, and how Google might take those from authoritative sources, using an intent template process (“what are the symptoms for [measles, flu, athlete’s foot,ebola]”) to include many direct answer responses to natural language queries, while also showing keyword-based search results.
The patent doesn’t tell us about how such natural language direct answers are chosen by the search engine, but the following document, which shares the same authors as the inventors of the patent, and which was filed by them as a provisional patent, does give us some ideas on how those are found on the web.
We know that Google is looking for responses from pages that they consider to be “authoritative” pages.
We also knew that Google uses query templates to help identify the right pages among those authoritative pages to use content from to answer questions such as:
- What are the symptoms for measles
- What are the symptoms for chicken pox
- What are the symptoms for the flu
When it was published, as we can see just below, the identities of the authors was protected since it was “submitted for blind review.”
The paper tells us about how Google might grab information from pages on the Web, and can be found at:
The authors/inventors names are on the version at the Google research abstract (in orange, below):
The paper tells us right up front that it uses a process that makes it easy to find information on web pages.
Extracting Information based on Structural Contexts
In this paper, we present a general framework for extracting attribute-value pairs from web pages.
Specifically, we restrict our attention to attribute-value pairs that are expressed in structural contexts such as tables and colon-delimited pairs.
The main motivation is that a large number of attribute-value pairs that exist on the Web are encoded in such formats, and identifying these formats is relatively straightforward.
So information might be extracted from tables like the following from a Wikipedia infobox:
In addition to two column tables like that, tables with additional rows are pointed to in the paper. It also tells us that it might grab attribute value information from pairs of things which are formatted and separated by colons, like this:
Extracting Information based upon Patterns
The paper points out another source that could be used to extract information in the form of patterns. These patterns are like the query intent templates that the patent points at:
Most such work has been devoted to the acquisition of WordNet-style relations between pairs of concepts. Work specifically directed towards extracting attributes of concepts was performed by Poesio and Almuhareb .
Their system generates candidates using the pattern “the X of the Y (is Z)”, the hypothesis being that X is an attribute of the concept described by the noun phrase Y, and Z, if it appears, is the corresponding value.
Google has published much more detailed looks at how they might capture information from patterns.
If you think you might like it if your pages were shown as the sources for direct answers, striving to make your pages seen as authoritative pages is a good first step.
Understanding how tables and colon delimited pairs might be used as sources for information can be important too.
Using patterns for content on your pages for related topics can be another way of enticing Google to extract information from your pages.
The paper also refers to a program called Text Runner, which involves an Open Information Extraction approach to learning from the Web. The processes described in the paper have a lot of parts and involve a lot of complex looks at the information being extracted to avoid extracting information that doesn’t answer questions.
The paper also describes the process of using wrappers, which I haven’t discussed here before. I will in the next and final post for this series.
Of course, we will probably look at many other posts and topics that involve how SEO and the Semantic Web are crossing paths and finding answers to questions that people might pose at the search engines.