2019: The Year of BERT

The boom in deeper transfer learning in NLP

Published in

Towards Data Science

11 min readDec 31, 2019

As we wrap up 2019, it’s interesting to reflect on the major recent trends in the field of machine learning for language. 2019 has been a landmark year for NLP, with new records across a variety of important tasks, from reading comprehension to sentiment analysis. The key research trend that stands out is the rise of transfer learning in NLP, which refers to using massive pre-trained models and fine-tuning them to your specific language-related task. Transfer learning allows you to reuse knowledge from previously built models, which can give you a boost in performance and generalisation, while demanding much less labelled training data.

The idea of pre-training models followed by task-specific fine-tuning is in itself not new — computer vision practitioners regularly use models pre-trained on large datasets like ImageNet, and in NLP we have been doing “shallow” transfer learning for years by reusing word embeddings.

But in 2019, with models like BERT, we saw a major shift towards deeper knowledge transfer by transferring entire models to new tasks — essentially using large pre-trained language models as reusable language comprehension feature extractors.

This was talked about as “NLP’s ImageNet moment” last year, and in 2019, research continued to build on this trend. BERT was remarkable for making transfer learning in NLP easy, and in the process producing state-of-the-art results for 11 sentence-level and word-level NLP tasks with minimal adaptation. This is exciting from the practical point of view, but BERT and related models are perhaps even more interesting because they are advancing our fundamental understanding of how we should represent language to computers, and which representations best allow our models to solve challenging language problems.

The emerging paradigm is: why constantly learn language syntax and semantics from scratch for every new NLP task when you can reuse BERT’s solid grasp of language?

This core concept, together with an easy fine-tuning procedure and open source code, means that BERT has spread rapidly — initially released in late 2018, BERT has become enormously popular in 2019. I did not realise just how popular until I attempted to compile a list of BERT-related papers published this past year. I gathered 169 BERT-related papers, and manually annotated these into a few different categories of research (e.g. building domain-specific versions of BERT, understanding BERT’s inner mechanics, building multilingual BERTs, etc.).

Here is a plot of all of these papers at once:

A collection of BERT-related papers published between November 2018 and December 2019. The y axis is the log of the citation count (as measured by Google Scholar), but with a floor of 0. The majority of these papers were found by searching for BERT in the title of arXiv papers.

This sort of information is often better interactive, so here is a GIF. You can also check out the Jupyter notebook to play with the plot yourself, and the raw data is here.

Now that’s a lot of BERT papers. Some notes on this plot:

It’s interesting to observe the (fairly short) lag between the publication of the original paper in November 2018, and the flood of papers starting around January 2019.
The initial wave of BERT papers tended to focus on immediate extensions and applications of the core BERT model (red, purple and orange dots), like adapting BERT for recommendations systems, sentiment analysis, text summarisation, and document retrieval.
Then, starting in April, a collection of papers probing the internal mechanisms of BERT were published (in green), like understanding how BERT models hierarchical linguistic phenomena and analysing the redundancy between attention heads. Of particular interest is the paper “BERT Rediscovers the Classical NLP Pipeline”, in which the authors find that BERT’s internal computations mirror the traditional NLP workflow (first do parts-of-speech tagging, then dependency parsing, then entity tagging, etc.).
Around September, a collection of papers focused on compressing the model size of BERT were released (cyan), like the DistilBERT, ALBERT and TinyBERT papers. For instance, DistilBERT model from HuggingFace is a compressed version of BERT with half the number of parameters (from 110 million down to 66 million) but 95% of the performance on important NLP tasks (see the GLUE benchmarks). The original BERT models are not exactly lightweight, and this is a problem in places where computational resources are not plentiful (like mobile phones).
This list of BERT papers is very likely to be incomplete. I wouldn’t be surprised if the true number of very BERT-relevant papers is double my figure. As a rough upper bound, the number of papers that cite the original BERT paper is currently over 3100.
In case you’re curious about the names of some of these models — essentially NLP researchers are getting carried away with Sesame Street characters. We can blame the ELMo paper for getting this whole thing started, which made later models like BERT and ERNIE inevitable. I am eagerly awaiting a BIGBIRD model — and let’s call the compressed version SMALLBIRD?

A few lessons from the BERT literature

Going through this literature, a few general concepts emerged:

The value of open-sourcing machine learning models. The authors made the BERT model and relevant code freely available, and provided an easy, reusable fine-tuning procedure. This type of openness is vital for accelerating research, and I doubt the model would have been nearly as popular if the authors were less forthright.
The importance of taking hyperparameter tuning seriously. The RoBERTa paper made a splash by presenting a more principled approach to training BERT, with optimised design choices (like changing a training task) and more extensive hyperparameter tuning. This updated training regime, together with just training the model for longer on more data, again pushed performance to record-breaking levels across various NLP benchmarks.
Thoughts on model size. The original BERT authors were intrigued to observe that simply increasing the size of the model can dramatically improve performance, even on a very small dataset. Perhaps this means that in some sense you “need” hundreds of millions of parameters to represent human language. Several other papers in 2019 found that simply scaling up the size of NLP models leads to improvements (famously, OpenAI’s GPT-2 model). And there’s new tricks to train ridiculously huge NLP models, like NVIDIA’s 8 billion parameter behemoth. But there has also been evidence of diminishing returns as model size increases, similar to the wall computer vision researchers hit when adding more convolutional layers. The successes of papers on model compression and parameter efficiency suggest that significantly more performance can be squeezed from models of a given size.

Our NLP models are getting bigger and bigger. From DistilBERT paper.

What is BERT, exactly?

Let’s take a few steps back and discuss what BERT actually is. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model (LM) built by Google researchers. Language models are trained on tasks that incentivise the model to learn a deep understanding of language; a common training task for LMs is next word prediction (“the cat sat on the ___”).

BERT is based on a relatively new neural network architecture — Transformers, which use a mechanism called self-attention to capture the relationships between words. There are no convolutions (as in CNNs) or recurrence operations (as in RNNs) in Transformers (“Attention is All You Need”). There have been some excellent tutorials published explaining Transformers and self-attention, so I will not go into much detail here. But briefly:

Self-attention is a sequence-to-sequence operation that updates input token embeddings by baking in each word’s context into its representation. This allows it to model the relationship between all input words simultaneously — contrast this to RNNs, in which input tokens are read in and processed sequentially. Self-attention computes the similarity between word vectors using dot products, and the resultant attention weights are often visualised as an attention weight matrix.
Attention weights capture the strength of relationships between words, and we allow the model to learn different types of relationships by using multiple attention heads. Each attention head often captures a particular type of relationship between words (with some redundancy). Some of these relationships are intuitively interpretable (like subject-object relationships, or keeping track of neighbouring words), and some are rather inscrutable. You can think of attention heads as being analogous to filters in convolutional networks, where each filter extracts a specific type of feature from the data — whichever feature will best help the rest of the neural network make better predictions.
This self-attention mechanism is the core operation in Transformers, but just to put it into context: Transformers were originally developed for machine translation, and they have an encoder-decoder structure. The building block of Transformer encoders and decoders is a Transformer block, which is itself generally composed of a self-attention layer, some amount of normalisation, and a standard feed-forward layer. Each block performs this sequence of operations on input vectors and passes the output to the next block. In Transformers, depth refers to the number of Transformer blocks.

Using this Transformer setup, the BERT model was trained on 2 unsupervised language tasks. The most important thing about BERT training is that it only requires unlabelled data — any text corpus can be used, you do not need any special labelled dataset. The BERT paper used Wikipedia and a book corpus for training the model. As with “normal” language models, data comes cheap, and this is a huge advantage.

How is BERT trained?

But what tasks is BERT trained on that encourage it to learn such a good, generally useful understanding of language? Future work tweaked the learning strategy, but the original paper used two tasks:

The Masked Language Model (MLM) task. This task encouraged the model to learn good representations at the word-level and at the sentence-level (since a sentence is the totality of word representations). Briefly, 15% of the words in a sentence are randomly chosen and hidden (or “masked”) with a <MASK> token. The model’s job is to predict the identity of these hidden words, making use of both the words before and after the <MASK> — hence, we are trying to reconstruct the text from a corrupted input, and both left and right contexts are used to make predictions. This allows us to build up representations of words that take all of the context into account. BERT learns its bidirectional representations simultaneously, in contrast to methods like ELMo (an RNN-based language model used for generating context-aware word embeddings), where left-to-right and right-to-left representations are independently learned by two language models and then concatenated. We could say that ELMo is a ‘shallow bidirectional’ model whereas BERT is a ‘deep bidirectional’ model.
The Next Sentence Prediction (NSP) task. If our model is going to be used as the basis for language understanding, it would be good for it to have some knowledge of inter-sentence coherence. To encourage the model to learn about the relationship between sentences, we add the Next Sentence Prediction task, in which the model has to predict if a pair of sentences are related, namely if one is likely to come after another. Positive training pairs are real adjacent sentences in the corpus; negative training pairs are randomly sampled from the corpus. It’s not a perfect system, since randomly sampled pairs could actually be related, but it is good enough.

The model must learn to do both tasks simultaneously, since the actual training loss is the combination of the losses from the two tasks (namely it’s the sum of the mean MLM and NSP likelihoods).

In case you spotted a bit of a problem with the masking approach: you’re right. Since a random 15% of words in a segment are masked, you’re likely to have multiple <MASK> occurrences present. This is fine, but BERT treats these masked tokens independently of one another, which is a bit limiting since they could easily be dependent. This is one of the points addressed by the XLNet paper, which some people consider the successor to BERT.

Fine-tuning BERT

Once the base BERT model is trained, you would usually fine-tune in 2 steps: first by continuing the “unsupervised” training on your unlabelled data, and then by learning your actual task by adding an additional layer and training on your new objective (using not very many labelled examples). This approach has roots in this 2015 LSTM LM paper by Dai & Le from Google.

BERT fine-tuning will actually update all of the parameters of your model, not just the ones in the new task-specific layer, so this approach differs from techniques which completely freeze transferred layer parameters.

In practice, with BERT transfer learning, often only the the trained Encoder stack is usually reused — you chop off the decoder half of the model and just use the Encoder Transformer blocks as a feature extractor. So, we don’t care about the predictions the Decoder part of the Transformer would have made on whatever language task it was originally trained on, we are just interested in the way that the model has learned to internally represent the textual input.

BERT fine-tuning might take minutes or hours, depending on your task, data size and TPU/GPU resources. In case you’re interested in trying out BERT fine-tuning ASAP, you can use this ready-made code on Google Colab, which provides access to a free TPU.

What was around before BERT?

The original BERT paper is well-written and I recommend checking it out; the following bullet points summarise the paper’s account of the previous major approaches in the language model pretraining and fine-tuning space:

Unsupervised feature-based approaches (like ELMo), which use pre-trained representations as input features but use task-specific architectures (i.e. they change the model’s structure for each new task). All of your favourite word embeddings (word2vec to GLoVe to FastText), sentence embeddings, and paragraph embeddings fall into this category. ELMo also provides word embeddings but in a context-sensitive manner — — the embedding/representation for a token is the concatenation of the left-to-right and a right-to-left language model hidden state vectors.
Unsupervised fine-tuning approaches (like OpenAI’s GPT model), which fine-tune all pre-trained parameters for a supervised downstream task and only minimally change the model structure by introducing a few task-specific parameters. Pre-training is on unlabelled text and the learning tasks are usually either left-to-right language modelling, or text compression (as with autoencoders, which compress text into a vector representation, and reconstruct the text from the vector). However, the ability of these approaches to model context has been limiting because they have generally been unidirectional, left-to-right models — for a given word, there was no ability to incorporate all later words into its representation.
Transfer learning from supervised data. There has also been some work on transferring the knowledge learned from supervised tasks that have lots of training data, e.g. using the weights from a machine translation model to initialise parameters for a different language problem.

Problems, or Things to Think About

There has been some work from computer vision to suggest that pre-training and fine-tuning mostly helps to speed up model converfence.

Conclusion

I hope this post has provided a reasonable recap of the BERT phenomenon, and shown just how popular and powerful this model has become in NLP. The field is progressing rapidly, and the results we are now seeing from state-of-the-art models just would not have been believable even just 5 years ago (e.g. superhuman performance in question answering). The two key trends in recent NLP progress are the rise of transfer learning and Transformers, and I’m keen to see how these will develop in 2020.

—

Welocalize is an industry leader in NLP and translation technology. To chat with someone on our team about your NLP project, email Dave at david.clark@welocalize.com.