A Patent granted to Google on May 11, 2021, involves performing Natural language processing (“NLP”) tasks such as question answering. It relies on a language model pre-trained using world knowledge.
Recent advances in language-model pre-training have resulted in language models such as Bidirectional Encoder Representations from Transformers (“BERT”). This has been discussed within the past year when it comes to Google. The patent also tells us that using a Text-to-Text Transfer Transformer (“T5”) can capture a large amount of world knowledge, based on massive text corpora on which that language model has been trained.
An issue exists that is related to using a language model that accrues more and more knowledge. The inventors noted that storing knowledge implicitly in the parameters of a neural network can cause the network to increase in size significantly. This increase could adversely impact system operation.
The patent begins with that information and then moves into a summary, which tells us how the process described in the patent might work.
Pre-Training and Fine-Tuning Neural-Network-Based Language Models
It tells us that it works with systems and methods for pre-training and fine-tuning neural-network-based language models.
In more detail, it tells us that it relates to augmenting language model pre-training and fine-tuning by using a neural-network-based textual knowledge retriever trained along with the language model. It states:
During the process of pre-training, the knowledge retriever obtains documents (or portions thereof) from an unlabeled pre-training corpus (e.g., one or more online encyclopedias). The knowledge retriever automatically generates a training example by sampling a passage of text from one of the retrieved documents and randomly masking one or more tokens in the sampled piece of text (e.g., “The [MASK] is the currency of the United Kingdom.”).
It points out this masking feature specifically:
The knowledge retriever also retrieves additional documents from a knowledge corpus to be used by the language model in predicting the word that should go in each masked token. The language model then models the probabilities of each retrieved document in predicting the masked tokens and uses those probabilities to continually rank and re-rank the documents (or some subset thereof) in terms of their relevance.
A language model such as BERT can be used for many functions. The patent points those out as well, telling us specifically about BERT question-answering:
The knowledge retriever and language model are next fine-tuned using a set of different tasks. For example, the knowledge retriever may be fine-tuned using open-domain question and answering (“open-QA”) tasks, in which the language model must try to predict answers to a set of direct questions (e.g., What is the capital of California?). During this fine-tuning stage, the knowledge retriever uses its learned relevance rankings to retrieve helpful documents for the language model to answer each question. The framework of the present technology provides models that can intelligently retrieve helpful information from a large unlabeled corpus, rather than requiring all potentially relevant information to be stored implicitly in the parameters of the neural network. This framework may thus reduce the storage space and complexity of the neural network and also enable the model to more effectively handle new tasks that may be different than those on which it was pre-trained.
Hos the Patent Defines Training a Language Model
The patent describes training a language model by:
- Using one or more processors of a processing system, a masked language modeling task using text from a first document
- Generating an input vector by applying a first learned embedding function to the masked language modeling task
- Generating a document vector for each document of a knowledge corpus by applying a second learned embedding function to each document of the knowledge corpus, the knowledge corpus comprising a first plurality of documents
- Generating a relevance score for each given document of the knowledge corpus based on the input vector and the document vector for the given document
- Generating a first distribution based on the relevance score of each document in the second number of documents, the second plurality of documents being from the knowledge corpus
- Generating a second distribution based on the masked language modeling task and text of each document in the second plurality of documents
- Generating a third distribution based on the first distribution and the second distribution
- Modifying parameters of at least the first learned embedding function or the second learned embedding function to generate an updated first distribution and an updated third distribution.
When BERT came out, it was intended to help with around 11 NLP tasks, including question answering. I provided a look at the summary of the patent and some of the images that go with that summary, but the final process is much more detailed. This statement from the patent reminded me of a previous acquisition by Google of Wavii (With Wavii, Did Google Acquire the Future of Web Search?):
In some aspects, the system’s one or more processors are further configured to receive a query task, the query task comprising an open-domain question and answering task; generate a query input vector by applying the first learned embedding function to the query task, the first learned embedding function including one or more parameters modified as a result of the modifying; generate a query relevance score for each given document of the knowledge corpus based on the query input vector, and the document vector for the given document; and retrieve the third plurality of documents from the knowledge corpus based on the query relevance score of each document in the third plurality of documents.
You can find the patent at:
Retrieval-augmented language model pre-training and fine-tuning
Inventors: Kenton Chiu Tsun Lee, Kelvin Gu, Zora Tung, Panupong Pasupat, and Ming-Wei Chang
Assignee: Google LLC
US Patent: 11,003,865
Granted: May 11, 2021
Filed: May 20, 2020
Systems and methods for pre-training and fine-tuning of neural-network-based language models are disclosed. A neural network-based textual knowledge retriever is trained along with the language model. In some examples, the knowledge retriever obtains documents from an unlabeled pre-training corpus, generates its own training tasks, and learns to retrieve documents relevant to those tasks. In some examples, the knowledge retriever is further refined using supervised open-QA questions. The framework of the present technology provides models that can intelligently retrieve helpful information from a large unlabeled corpus, rather than requiring all potentially relevant information to be stored implicitly in the parameters of the neural network. This framework may thus reduce the storage space and complexity of the neural network and also enable the model to more effectively handle new tasks that may be different than those on which it was pre-trained.
More Resources on BERT and on BERT question-answering
There is a paper about BERT which is worth reading to see how it is being used: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
One of the authors of the BERT paper was also one of the inventors of this patent. There looks like some interesting reading on his homepage as well: Ming-Wei Chang’s Homepage
Visiting this patent can also be worth doing because while I have presented a summary of how Google may use BERT question-answering, the patent provides a much more detailed look at how the search engine uses the technology behind it.
Some aspects behind this BERT question Answering approach includes:
According to aspects of the technology, a neural-network-based language model resident on processing system is pre-trained using masked language modeling tasks. Each masked language modeling task may be automatically generated by a neural-network-based knowledge retriever (also resident on processing system), allowing pre-training to proceed unsupervised.
For example, the pre-training corpus may be an online encyclopedia such as Wikipedia, and the retrieved document may be a complete HTML page for a given entry, a selected section, or sections of the page (e.g., title, body, tables), a single paragraph or sentence, etc. In step 204, the knowledge retriever selects a passage of text from the document to be masked. For example, the knowledge retriever may select a single sentence from the document, such as “The pound is the United Kingdom currency.” Finally, in step 206, the knowledge retriever creates a masked language modeling task x by replacing one or more words of the selected passage with a masking token (e.g., “[MASK]” or any other suitable token). For example, continuing with the same example from step 204, the knowledge retriever may mask “pound” within the selected passage, such that masked language modeling task x becomes “The [MASK] is the currency of the United Kingdom.”
For example, knowledge corpus Z may be an unlabeled corpus such as Wikipedia or some other website. In that regard, knowledge corpus Z may be the same as pre-training corpus X, may have only some overlap with pre-training corpus X, or may be completely different from pre-training corpus X. In implementations where knowledge corpus Z is the same as pre-training corpus X, the particular document selected for generating masked language modeling task x may be removed from knowledge corpus Z before pre-training begins to avoid training the language model, becoming too accustomed to finding answers through exact string matches.
This is a brief glimpse into how BERT question-answering works and what the patent offers. It is recommended you read the rest of the patent and the paper to learn more about how BERT question-answering can be used to pre-train a large corpus to provide answers for question answering.