Featured Snippets: Extracting Answers from Web Pages

Sharing is caring!

I’ve been writing recently about a patent from Google on Featured Snippets, and how Google might take those from authoritative sources, using an intent template process (“what are the symptoms for [measles, flu, athlete’s foot,ebola]”) to include many direct answer responses to natural language queries, while also showing keyword-based search results.

The patent doesn’t tell us about how such natural language featured the search engine chooses snippets. Still, the following document, which shared the same authors as the inventors of the patent and filed by them as a provisional patent, gives us some ideas on how those are found on the web.

We know that Google is looking for responses from pages that they consider “authoritative” pages.

We also knew that Google uses query templates to help identify the right pages among those authoritative pages to use content from to answer questions such as:

  • What are the symptoms for measles
  • What are the symptoms for chicken pox
  • What are the symptoms for the flu

When it was published, as we can see just below, the authors’ identities were protected since it was “submitted for blind review.”

This paper is from the authors of the natural language query patent and was filed as an early provisional patent by them, about extracting answers.
This paper is from the same authors as the natural language query patent I’ve been writing about and was filed as an early provisional patent by them.

The paper tells us about how Google might be extracting answers from pages on the Web, and can be found at:

Scalable Attribute-Value Extraction from Semi-Structured Text

The authors/inventors names are on the version at the Google research abstract (in orange, below):

The paper was originally filed to be presented at a conference on large scale data mining. Perhaps that's where it was "blindly submitted" to.
The paper was originally filed to be presented at a conference on large scale data mining. Perhaps that’s where it was “blindly submitted” to.

The paper tells us right up front that it uses a process that makes it easy to engage in extracting Answers from Pages on the Web.

Extracting answers based on Structural Contexts

In this paper, we present a general framework for extracting attribute-value pairs from web pages.

Specifically, we restrict our attention to attribute-value pairs that are expressed in structural contexts such as tables and colon-delimited pairs.

The main motivation is that a large number of attribute-value pairs that exist on the Web are encoded in such formats, and identifying these formats is relatively straightforward.

So information might be extracted from tables like the following from a Wikipedia infobox:

On the left are atrributes, and on the right are values for them.
On the left are atrributes, and on the right are values for them.

In addition to two-column tables like that, tables with additional rows are pointed to in the paper. It also tells us that it might grab attribute value information from pairs of things that are formatted and separated by colons, like this:

when items on a page are separated by colons like this, they are often related.
when items on a page are separated by colons like this, they are often related.

Extracting Answers based upon Patterns

The paper points out another source that could be used to extract information in the form of patterns. These patterns are like the query intent templates that the patent points at:

Most such work has been devoted to the acquisition of WordNet-style relations between pairs of concepts. Work specifically directed towards extracting attributes of concepts was performed by Poesio and Almuhareb [14].

Their system generates candidates using the pattern “the X of the Y (is Z)”, the hypothesis is that X is an attribute of the concept described by the noun phrase Y, and Z if it appears, is the corresponding value.

Google has published much more detailed looks at how they might capture information from patterns.

Take Aways

If you think you might like it if your pages were shown as the sources for featured Snippets, striving to make your pages seen as authoritative pages is a good first step.

Understanding how tables and colon-delimited pairs might be used as sources for information can be important too.

Using patterns for content on your pages for related topics can be another way of enticing Google to extract information from your pages.

The paper also refers to a program called Text Runner, which involves an Open Information Extraction approach to learning from the Web. The processes described in the paper have a lot of parts. They involve many complex looks at the information being extracted to avoid extracting information that doesn’t answer questions.

The paper also describes the process of using wrappers, which I haven’t discussed here before. I will in the next and final post for this series.

Of course, we will probably look at many other posts and topics that involve how SEO and the Semantic Web are crossing paths and finding answers to questions that people might pose at the search engines.

Featured Snippets – Natural Language Search Results for Intent Queries, Part 1
Featured Snippets – Taken from Authority Websites, Part 2
Featured Snippets – Using Query Intent Templates to Identify Answers, Part 3
Featured Snippets: How Answers are Extracted from Web Pages, Part 4
Featured Snippets: Extracting Text from Pages Citations, Part 5

Also see: Does Google Use Schema to Write Answer Passages for Featured Snippets?

Last Updated August 8, 2019.

Sharing is caring!

10 thoughts on “Featured Snippets: Extracting Answers from Web Pages”

  1. Hi Bill:

    I think your blog is the only one source where we can find this type of insightful post. SEO guys are currently trying to figure out How Answers Are Extracted From Web Pages! So I think It will be very helpful for the community!

    Best Regards
    Miraj Gazi

  2. It is always great pleasure to read our great blog. Thanks for this great insights. As a semantic copywriter i love your take aways. It will be important to create well structured websites which are highly useful for readers. I recommend structuring relevant data in terms of bullet lits and well prepared columns.

  3. As usual, Bill, a post full of insights and possibilities. I’m, going to have to go read it, though, to understand the pattern aspect.

  4. Thanks, Doc.

    That pattern aspect shows up in some Google processes since the 90s, when Sergey Brin came up with his DIPRE algorithm:

    Google’s First Semantic Search Invention Was Patented In 1999

    I’ve seen in in a few other Google patents and papers as well. See also:

    Does Google Search Google? How Google May Create and Use Synthetic Queries

  5. Bill, this is a post full of insights . i will read it few times i guess though, to understand the exact pattern aspect.it will for sure be very helpful for me

  6. I’ve noticed that Google loves to serve up blurbs from Wikipedia, which in my experience have been relevant about 75% of the time.

  7. Great article. The upgrade to Google Knowledge graph Unlike answer boxes based on the Knowledge Graph, this new format pulls its answer directly from third party websites, giving them attribution via the page title . Thanks for sharing.

  8. Thank you, Liam.

    I consider the patents to be a great source for learning about what search engines might be doing in the future, and getting an idea about how they feel about the Web itself. Appreciate your comment.

  9. The only blog where I see these search patents and mechanisms being dissected in a way I can understand. Love occasionally dipping my toe in here! (bit of sea related theming to my comment for you there)

Comments are closed.