BERT Question-Answering at Google

Sharing is caring!

BERT Question-Answering

A Patent granted to Google on May 11, 2021, involves performing Natural language processing (“NLP”) tasks such as question answering. It relies on a language model pre-trained using world knowledge.

Recent advances in language-model pre-training have resulted in language models such as Bidirectional Encoder Representations from Transformers (“BERT”). This has been discussed within the past year when it comes to Google. The patent also tells us that using a Text-to-Text Transfer Transformer (“T5”) can capture a large amount of world knowledge, based on massive text corpora on which that language model has been trained.

An issue exists that is related to using a language model that accrues more and more knowledge. The inventors noted that storing knowledge implicitly in the parameters of a neural network can cause the network to increase in size significantly. This increase could adversely impact system operation.

The patent begins with that information and then moves into a summary, which tells us how the process described in the patent might work.

Pre-Training and Fine-Tuning Neural-Network-Based Language Models

It tells us that it works with systems and methods for pre-training and fine-tuning neural-network-based language models.

In more detail, it tells us that it relates to augmenting language model pre-training and fine-tuning by using a neural-network-based textual knowledge retriever trained along with the language model. It states:

During the process of pre-training, the knowledge retriever obtains documents (or portions thereof) from an unlabeled pre-training corpus (e.g., one or more online encyclopedias). The knowledge retriever automatically generates a training example by sampling a passage of text from one of the retrieved documents and randomly masking one or more tokens in the sampled piece of text (e.g., “The [MASK] is the currency of the United Kingdom.”).

Knowledge Retriever Question Answering

It points out this masking feature specifically:

The knowledge retriever also retrieves additional documents from a knowledge corpus to be used by the language model in predicting the word that should go in each masked token. The language model then models the probabilities of each retrieved document in predicting the masked tokens and uses those probabilities to continually rank and re-rank the documents (or some subset thereof) in terms of their relevance.

Knowledge Retrieval BERT Question Answering

A language model such as BERT can be used for many functions. The patent points those out as well, telling us specifically about BERT question-answering:

The knowledge retriever and language model are next fine-tuned using a set of different tasks. For example, the knowledge retriever may be fine-tuned using open-domain question and answering (“open-QA”) tasks, in which the language model must try to predict answers to a set of direct questions (e.g., What is the capital of California?). During this fine-tuning stage, the knowledge retriever uses its learned relevance rankings to retrieve helpful documents for the language model to answer each question. The framework of the present technology provides models that can intelligently retrieve helpful information from a large unlabeled corpus, rather than requiring all potentially relevant information to be stored implicitly in the parameters of the neural network. This framework may thus reduce the storage space and complexity of the neural network and also enable the model to more effectively handle new tasks that may be different than those on which it was pre-trained.

Knowledge Corpus BERT Question Answering

Hos the Patent Defines Training a Language Model

The patent describes training a language model by:

  • Using one or more processors of a processing system, a masked language modeling task using text from a first document
  • Generating an input vector by applying a first learned embedding function to the masked language modeling task
  • Generating a document vector for each document of a knowledge corpus by applying a second learned embedding function to each document of the knowledge corpus, the knowledge corpus comprising a first plurality of documents
  • Generating a relevance score for each given document of the knowledge corpus based on the input vector and the document vector for the given document
  • Generating a first distribution based on the relevance score of each document in the second number of documents, the second plurality of documents being from the knowledge corpus
  • Generating a second distribution based on the masked language modeling task and text of each document in the second plurality of documents
  • Generating a third distribution based on the first distribution and the second distribution
  • Modifying parameters of at least the first learned embedding function or the second learned embedding function to generate an updated first distribution and an updated third distribution.

language model knowledge retriever

When BERT came out, it was intended to help with around 11 NLP tasks, including question answering. I provided a look at the summary of the patent and some of the images that go with that summary, but the final process is much more detailed. This statement from the patent reminded me of a previous acquisition by Google of Wavii (With Wavii, Did Google Acquire the Future of Web Search?):

In some aspects, the system’s one or more processors are further configured to receive a query task, the query task comprising an open-domain question and answering task; generate a query input vector by applying the first learned embedding function to the query task, the first learned embedding function including one or more parameters modified as a result of the modifying; generate a query relevance score for each given document of the knowledge corpus based on the query input vector, and the document vector for the given document; and retrieve the third plurality of documents from the knowledge corpus based on the query relevance score of each document in the third plurality of documents.

You can find the patent at:

Retrieval-augmented language model pre-training and fine-tuning
Inventors: Kenton Chiu Tsun Lee, Kelvin Gu, Zora Tung, Panupong Pasupat, and Ming-Wei Chang
Assignee: Google LLC
US Patent: 11,003,865
Granted: May 11, 2021
Filed: May 20, 2020

Abstract

Systems and methods for pre-training and fine-tuning of neural-network-based language models are disclosed. A neural network-based textual knowledge retriever is trained along with the language model. In some examples, the knowledge retriever obtains documents from an unlabeled pre-training corpus, generates its own training tasks, and learns to retrieve documents relevant to those tasks. In some examples, the knowledge retriever is further refined using supervised open-QA questions. The framework of the present technology provides models that can intelligently retrieve helpful information from a large unlabeled corpus, rather than requiring all potentially relevant information to be stored implicitly in the parameters of the neural network. This framework may thus reduce the storage space and complexity of the neural network and also enable the model to more effectively handle new tasks that may be different than those on which it was pre-trained.

More Resources on BERT and on BERT question-answering

There is a paper about BERT which is worth reading to see how it is being used: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

One of the authors of the BERT paper was also one of the inventors of this patent. There looks like some interesting reading on his homepage as well: Ming-Wei Chang’s Homepage

Visiting this patent can also be worth doing because while I have presented a summary of how Google may use BERT question-answering, the patent provides a much more detailed look at how the search engine uses the technology behind it.

Some aspects behind this BERT question Answering approach includes:

According to aspects of the technology, a neural-network-based language model resident on processing system is pre-trained using masked language modeling tasks. Each masked language modeling task may be automatically generated by a neural-network-based knowledge retriever (also resident on processing system), allowing pre-training to proceed unsupervised.

For example, the pre-training corpus may be an online encyclopedia such as Wikipedia, and the retrieved document may be a complete HTML page for a given entry, a selected section, or sections of the page (e.g., title, body, tables), a single paragraph or sentence, etc. In step 204, the knowledge retriever selects a passage of text from the document to be masked. For example, the knowledge retriever may select a single sentence from the document, such as “The pound is the United Kingdom currency.” Finally, in step 206, the knowledge retriever creates a masked language modeling task x by replacing one or more words of the selected passage with a masking token (e.g., “[MASK]” or any other suitable token). For example, continuing with the same example from step 204, the knowledge retriever may mask “pound” within the selected passage, such that masked language modeling task x becomes “The [MASK] is the currency of the United Kingdom.”

For example, knowledge corpus Z may be an unlabeled corpus such as Wikipedia or some other website. In that regard, knowledge corpus Z may be the same as pre-training corpus X, may have only some overlap with pre-training corpus X, or may be completely different from pre-training corpus X. In implementations where knowledge corpus Z is the same as pre-training corpus X, the particular document selected for generating masked language modeling task x may be removed from knowledge corpus Z before pre-training begins to avoid training the language model, becoming too accustomed to finding answers through exact string matches.

This is a brief glimpse into how BERT question-answering works and what the patent offers. It is recommended you read the rest of the patent and the paper to learn more about how BERT question-answering can be used to pre-train a large corpus to provide answers for question answering.

Sharing is caring!

10 thoughts on “BERT Question-Answering at Google”

  1. Hi Bill,
    This is the first this is the best BERT post I have ever read.
    It also fits perfectly with your last three fantastic NLP & Semantic SEO posts.

    All the best,
    Tom

  2. Interesting read Bill.

    At the heart of this is the concept of entity salience (correct me if I’m wrong).

    We’re just seeing further rollout BERT in search and we’re dealing with larger data sets. (Rather than document level ES).

  3. Hi Tom,

    Like any machine learning approach that works with natural language processing, Entities are identified using BERT, but it does a lot more than understand the significance of entities in content. It can understand other types of language, and grammar rules associated with the text, and understand the co-occurrence of words that tend to appear near each other in large corpora of text, such as Wikipedia and the scanned books at Google. BERT is used to better understand a number of NLP processes beyond entity recognition and question answering.

  4. Hi Thomas,

    I was happy to find a patent from Google about BERT specifically because it does fit in well with the NLP patents that they are coming out with. The one person who was on both the patent and the paper for BERT (Chang) has written a number of papers that were cited at the Best papers of the conferences they were written at, and are worth spending time with as well.

    Thanks.

    Bill

  5. Hi Bill,

    Very interesting and in depth look at BERT. I know this isn’t a specific function of the transformer or the algorithm, but I did notice something interesting in my rankings at the end of last year/beginning of this year.

    I built several blog sites and blog posts in the same niche, some with the same keywords (boiler installation and sale in the UK, various models/brands). I tested these with general long form content (review/article 1,200 words +) and then did posts and articles just asking questions and answering them.

    Every time, the questions and answers ranked more highly and much faster than the other articles. I tried to keep things the same across both types of posts with the same amount of images, Internal/external links, etc but I was surprised at how the question answering posts were much more valued by the algorithm than even long review posts.

    Just thought I’d post that here to see if anyone else notices the same. (I sometimes think I am perhaps too much of an SEO geek so you can imagine how enthralling I am at parties ????).

    Thanks for the info!

    Regards,
    Ryan.

  6. Great article Bill, thanks.
    I was interested to read more into Bert after hearing about it in your write up. Maybe Elon Musks AI nightmares will eventually come true! One thing I think is interesting is that
    although BERT’s bi-directional approach (MLM) converges significantly slower than using the left-to-right approaches (due to only 15% of words being predicted in each batch) but bidirectional training still significantly outperforms left-to-right training after a small number of pre-training steps. Google certainly are pressing ahead with this tech, interesting times indeed.

    Regards

    Andy

  7. Hello Bill,

    As an SEO Company Owner, I just want to tell you that I really miss reading your articles. Probably, I will fire a couple of clients so that I can read your articles for hours.

  8. Great Article.
    These articles help me understand how the BERT is working to identify the right things to be seen on top.

    As a beginner in SEO, I usually think of getting to some genuine piece of content that shares some useful insights about content. Then there comes “SEObytheSEA” which helps me out every time especially via your Twitter updates.

    xD

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.