BERT Question-Answering
A Google Patent from May 11, 2021, is about Natural language processing (“NLP”) tasks such as question answering. It relies on a language model pre-trained using world knowledge.
Advances in language-model pre-training have led to the use of language models. This one uses Bidirectional Encoder Representations from Transformers (“BERT”). Google has worked on this language model. The patent also tells us that using a Text-to-Text Transfer Transformer (“T5”) can capture a large amount of world knowledge. That would become a massive text corpus on the training of that language model.
An issue exists in using a language model that accrues more and more knowledge. The inventors noted that storing knowledge in the parameters of a neural network can cause the network to increase in size. This increase could hurt system operation.
The patent begins with that information and then moves into a summary. The summary tells us how the process described in the patent might work.
Pre-Training and Fine-Tuning Neural-Network-Based Language Models
The patent tells us that it provides a way of pre-training and fine-tuning neural-network-based language models.
It relates to augmenting language model pre-training and fine-tuning by using a neural-network-based textual knowledge retriever trained along with the language model. It states:
During the process of pre-training, the knowledge retriever obtains documents (or portions thereof) from an unlabeled pre-training corpus (e.g., one or more online encyclopedias). The knowledge retriever generates a training example by sampling a passage of text from one of the retrieved documents and masking one or more tokens in the sampled piece of text (e.g., “The [MASK] is the currency of the United Kingdom.”).
The Language Model Includes a Masking Feature
The Pre-Training set uses this masking feature:
The knowledge retriever also retrieves more documents from a knowledge corpus used by the language model in predicting the word that should go in each masked token. The language model then models the probabilities of each retrieved document in predicting the masked tokens and uses those probabilities to rank and re-rank the documents (or some subset thereof) of their relevance.
How BERT Handles Question-Answering
A language model such as BERT can work for many functions. The patent points those out as well. It tells us about BERT question-answering:
The knowledge retriever and language model are next fine-tuned using a set of different tasks. For example, the knowledge retriever may get fine-tuned using open-domain question and answering (“open-QA”) tasks, in which the language model must try to predict answers to a set of direct questions (e.g., What is the capital of California?). During this fine-tuning stage, the knowledge retriever uses its learned relevance rankings to retrieve helpful documents for the language model to answer each question. The framework of the present technology provides models that can retrieve helpful information from a large unlabeled corpus, rather than requiring all relevant information in the parameters of the neural network. This framework may thus reduce the storage space and complexity of the neural network and also enable the model to more handle new tasks that may be different than those on which it was pre-trained.
How the Patent Defines Training a Language Model
The patent describes training a language in a few ways.
- Using one or more processors of a processing system, a masked language modeling task using text from a first document
- Creating an input vector by applying a first learned embedding function to the masked language modeling task
- Picking a document vector for each document of a knowledge corpus by applying a second learned embedding function to each document of the knowledge corpus. The knowledge corpus comprising a first plurality of documents
- Developing a relevance score for each given document of the knowledge corpus based on the input vector and the document vector for the given document
- Choosing a first distribution based on the relevance score of each document in the second number of documents. The second plurality of documents being from the knowledge corpus
- Deciding on a second distribution based on the masked language modeling task and text of each document in the second plurality of documents
- Selecting a third distribution based on the first distribution and the second distribution
- Modifying parameters of at least the first learned embedding function or the second learned embedding function to generate an updated first distribution and an updated third distribution.
Pre-Training and Fine-Tuning with BERT
When BERT came out, it was to help with around 11 NLP tasks. That included question answering. I provided a look at the summary of the patent and some of the images that go with that summary. The final process is much more detailed. This statement from the patent reminded me of a 2013 acquisition by Google of Wavii. I wrote about that acquisition in With Wavii, Did Google Get the Future of Web Search?:
In some aspects, the system’s one or more processors are further configured to receive a query task, the query task comprising an open-domain question and answering task; generate a query input vector by applying the first learned embedding function to the query task, the first learned embedding function including one or more parameters modified as a result of the modifying; generate a query relevance score for each given document of the knowledge corpus based on the query input vector, and the document vector for the given document; and retrieve the third plurality of documents from the knowledge corpus based on the query relevance score of each document in the third plurality of documents.
This Patent is at
You can find the patent at:
Retrieval-augmented language model pre-training and fine-tuning
Inventors: Kenton Chiu Tsun Lee, Kelvin Gu, Zora Tung, Panupong Pasupat, and Ming-Wei Chang
Assignee: Google LLC
US Patent: 11,003,865
Granted: May 11, 2021
Filed: May 20, 2020
Abstract
Systems and methods for pre-training and fine-tuning of neural-network-based language models display.
A neural network-based textual knowledge retriever goes along with the language model.
In some examples, the knowledge retriever obtains documents from an unlabeled pre-training corpus, generates its own training tasks, and learns to retrieve documents relevant to those tasks.
In some examples, the knowledge retriever is further refined using supervised open-QA questions.
The framework of the present technology provides models that can retrieve helpful information from a large unlabeled corpus, rather than requiring all relevant information in the parameters of the neural network.
This framework may thus reduce the storage space and complexity of the neural network and also enable the model to more handle new tasks that may be different than those on which it was pre-trained.
More Resources on BERT and on BERT question-answering
There is a paper about BERT which is worth reading to see how it is being used. The paper is BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
One of the authors of the BERT paper was also one of the inventors of this patent. There looks like some interesting reading on his homepage as well. You can see it at: Ming-Wei Chang’s Homepage
This patent is also worth reading. I have presented a summary of how Google may use BERT question-answering. But, the patent provides a much more detailed look at how the search engine uses the BERT technology.
Some aspects behind this BERT question Answering approach includes:
A Language Model is Pre-Trained Using Masked Language Modeling Tasks
According to aspects of the technology, a neural-network-based language model resident on processing system is pre-trained using masked language modeling tasks. Each masked language modeling task may come from a neural-network-based knowledge retriever (also resident on processing system), allowing pre-training to proceed unsupervised.
The Pre-Training Corpus May Be an Online Encyclopedia Such as Wikipedia
For example, the pre-training corpus may be an online encyclopedia such as Wikipedia, and the retrieved document may be a complete HTML page for a given entry, a selected section, or sections of the page (e.g., title, body, tables), a single paragraph or sentence, etc. In step 204, the knowledge retriever selects a passage of text from the document to get masked.
For example, the knowledge retriever may select a single sentence from the document, such as “The pound is the United Kingdom currency.” Finally, the knowledge retriever creates a masked language modeling task x by replacing one or more words of the selected passage with a masking token (e.g., “[MASK]” or any other suitable token).
The knowledge retriever may mask “pound” within the selected passage, such that masked language modeling task x becomes “The [MASK] is the currency of the United Kingdom.”
The Knowledge Corpus
For example, knowledge corpus Z may become an unlabeled corpus such as Wikipedia or some other website. The knowledge corpus Z may be the same as pre-training corpus X, with only some overlap with pre-training corpus X, or maybe completely different from pre-training corpus X.
In implementations where knowledge corpus Z is the same as pre-training corpus X, the particular document selected for generating masked language modeling task x may get taken from knowledge corpus Z before pre-training begins to avoid training the language model, becoming too accustomed to finding answers through exact string matches.
This is a Brief Summary of How BERT Question-Answering Works and What The Patent Offers
This is a brief glimpse into how BERT question-answering works and what the patent offers. You should read the rest of the patent and the paper to learn more about how BERT question-answering can pre-train a large corpus to provide answers for question answering.
Hi Bill,
This is the first this is the best BERT post I have ever read.
It also fits perfectly with your last three fantastic NLP & Semantic SEO posts.
All the best,
Tom
Interesting read Bill.
At the heart of this is the concept of entity salience (correct me if I’m wrong).
We’re just seeing further rollout BERT in search and we’re dealing with larger data sets. (Rather than document level ES).
Hi Tom,
Like any machine learning approach that works with natural language processing, Entities are identified using BERT, but it does a lot more than understand the significance of entities in content. It can understand other types of language, and grammar rules associated with the text, and understand the co-occurrence of words that tend to appear near each other in large corpora of text, such as Wikipedia and the scanned books at Google. BERT is used to better understand a number of NLP processes beyond entity recognition and question answering.
Hi Thomas,
I was happy to find a patent from Google about BERT specifically because it does fit in well with the NLP patents that they are coming out with. The one person who was on both the patent and the paper for BERT (Chang) has written a number of papers that were cited at the Best papers of the conferences they were written at, and are worth spending time with as well.
Thanks.
Bill
Hi Bill,
Very interesting and in depth look at BERT. I know this isn’t a specific function of the transformer or the algorithm, but I did notice something interesting in my rankings at the end of last year/beginning of this year.
I built several blog sites and blog posts in the same niche, some with the same keywords (boiler installation and sale in the UK, various models/brands). I tested these with general long form content (review/article 1,200 words +) and then did posts and articles just asking questions and answering them.
Every time, the questions and answers ranked more highly and much faster than the other articles. I tried to keep things the same across both types of posts with the same amount of images, Internal/external links, etc but I was surprised at how the question answering posts were much more valued by the algorithm than even long review posts.
Just thought I’d post that here to see if anyone else notices the same. (I sometimes think I am perhaps too much of an SEO geek so you can imagine how enthralling I am at parties ????).
Thanks for the info!
Regards,
Ryan.
Great article Bill, thanks.
I was interested to read more into Bert after hearing about it in your write up. Maybe Elon Musks AI nightmares will eventually come true! One thing I think is interesting is that
although BERT’s bi-directional approach (MLM) converges significantly slower than using the left-to-right approaches (due to only 15% of words being predicted in each batch) but bidirectional training still significantly outperforms left-to-right training after a small number of pre-training steps. Google certainly are pressing ahead with this tech, interesting times indeed.
Regards
Andy
Hello Bill,
As an SEO Company Owner, I just want to tell you that I really miss reading your articles. Probably, I will fire a couple of clients so that I can read your articles for hours.
Hi Koray,
Thanks for your kind words.
Great Article.
These articles help me understand how the BERT is working to identify the right things to be seen on top.
As a beginner in SEO, I usually think of getting to some genuine piece of content that shares some useful insights about content. Then there comes “SEObytheSEA” which helps me out every time especially via your Twitter updates.
xD
Thanks for making this article, now.. I know about BERT, nice to meet your blog
Wow, just discovered your website while diving through a competitor’s backlinks and am very impressed by the depth of your articles and actually going into Google’s patents to extract insights.
I’ll be back for future content!
Well-written post! Thank for sharing
I read your blog. It is very interesting and informative. I appreciate your work.