Direct Answers: How Answers are Extracted from Web Pages

I’ve been writing recently about a patent from Google on Direct Answers, and how Google might take those from authoritative sources, using an intent template process (“what are the symptoms for [measles, flu, athlete’s foot,ebola]”) to include many direct answer responses to natural language queries, while also showing keyword-based search results.

The patent doesn’t tell us about how such natural language direct answers are chosen by the search engine, but the following document, which shares the same authors as the inventors of the patent, and which was filed by them as a provisional patent, does give us some ideas on how those are found on the web.

We know that Google is looking for responses from pages that they consider to be “authoritative” pages.

We also knew that Google uses query templates to help identify the right pages among those authoritative pages to use content from to answer questions such as:

  • What are the symptoms for measles
  • What are the symptoms for chicken pox
  • What are the symptoms for the flu

When it was published, as we can see just below, the identities of the authors was protected since it was “submitted for blind review.”

This paper is from the same authors as the natural language query patent I've been writing about and was filed as an early provisional patent by them.
This paper is from the same authors as the natural language query patent I’ve been writing about and was filed as an early provisional patent by them.

The paper tells us about how Google might grab information from pages on the Web, and can be found at:

Scalable Attribute-Value Extraction from Semi-Structured Text

The authors/inventors names are on the version at the Google research abstract (in orange, below):

The paper was originally filed to be presented at a conference on large scale data mining. Perhaps that's where it was "blindly submitted" to.
The paper was originally filed to be presented at a conference on large scale data mining. Perhaps that’s where it was “blindly submitted” to.

The paper tells us right up front that it uses a process that makes it easy to find information on web pages.

Extracting Information based on Structural Contexts

In this paper, we present a general framework for extracting attribute-value pairs from web pages.

Specifically, we restrict our attention to attribute-value pairs that are expressed in structural contexts such as tables and colon-delimited pairs.

The main motivation is that a large number of attribute-value pairs that exist on the Web are encoded in such formats, and identifying these formats is relatively straightforward.

So information might be extracted from tables like the following from a Wikipedia infobox:

On the left are atrributes, and on the right are values for them.
On the left are atrributes, and on the right are values for them.

In addition to two column tables like that, tables with additional rows are pointed to in the paper. It also tells us that it might grab attribute value information from pairs of things which are formatted and separated by colons, like this:

when items on a page are separated by colons like this, they are often related.
when items on a page are separated by colons like this, they are often related.

Extracting Information based upon Patterns

The paper points out another source that could be used to extract information in the form of patterns. These patterns are like the query intent templates that the patent points at:

Most such work has been devoted to the acquisition of WordNet-style relations between pairs of concepts. Work specifically directed towards extracting attributes of concepts was performed by Poesio and Almuhareb [14].

Their system generates candidates using the pattern “the X of the Y (is Z)”, the hypothesis being that X is an attribute of the concept described by the noun phrase Y, and Z, if it appears, is the corresponding value.

Google has published much more detailed looks at how they might capture information from patterns.

Take Aways

If you think you might like it if your pages were shown as the sources for direct answers, striving to make your pages seen as authoritative pages is a good first step.

Understanding how tables and colon delimited pairs might be used as sources for information can be important too.

Using patterns for content on your pages for related topics can be another way of enticing Google to extract information from your pages.

The paper also refers to a program called Text Runner, which involves an Open Information Extraction approach to learning from the Web. The processes described in the paper have a lot of parts and involve a lot of complex looks at the information being extracted to avoid extracting information that doesn’t answer questions.

The paper also describes the process of using wrappers, which I haven’t discussed here before. I will in the next and final post for this series.

Of course, we will probably look at many other posts and topics that involve how SEO and the Semantic Web are crossing paths and finding answers to questions that people might pose at the search engines.

Summary
Article Name
Direct Answers: How Answers are Extracted from Web Pages
Description
A look at how Google uses a pattern match strategy to find and extract answers to direct questions on web pages.
Author

10 thoughts on “Direct Answers: How Answers are Extracted from Web Pages”

  1. Hi Bill:

    I think your blog is the only one source where we can find this type of insightful post. SEO guys are currently trying to figure out How Answers Are Extracted From Web Pages! So I think It will be very helpful for the community!

    Best Regards
    Miraj Gazi

  2. It is always great pleasure to read our great blog. Thanks for this great insights. As a semantic copywriter i love your take aways. It will be important to create well structured websites which are highly useful for readers. I recommend structuring relevant data in terms of bullet lits and well prepared columns.

  3. As usual, Bill, a post full of insights and possibilities. I’m, going to have to go read it, though, to understand the pattern aspect.

  4. Thanks, Doc.

    That pattern aspect shows up in some Google processes since the 90s, when Sergey Brin came up with his DIPRE algorithm:

    Google’s First Semantic Search Invention Was Patented In 1999
    http://www.seobythesea.com/2014/09/google-first-semantic-search-invention-patented-1999/

    I’ve seen in in a few other Google patents and papers as well. See also:

    Does Google Search Google? How Google May Create and Use Synthetic Queries
    http://www.seobythesea.com/2013/01/google-synthetic-queries/

  5. Bill, this is a post full of insights . i will read it few times i guess though, to understand the exact pattern aspect.it will for sure be very helpful for me

  6. I’ve noticed that Google loves to serve up blurbs from Wikipedia, which in my experience have been relevant about 75% of the time.

  7. Great article. The upgrade to Google Knowledge graph Unlike answer boxes based on the Knowledge Graph, this new format pulls its answer directly from third party websites, giving them attribution via the page title . Thanks for sharing.

  8. Thank you, Liam.

    I consider the patents to be a great source for learning about what search engines might be doing in the future, and getting an idea about how they feel about the Web itself. Appreciate your comment.

  9. The only blog where I see these search patents and mechanisms being dissected in a way I can understand. Love occasionally dipping my toe in here! (bit of sea related theming to my comment for you there)

Comments are closed.