Photo by Joakim Honkasalo on Unsplash

Robustly optimized BERT Pretraining Approaches

Summary of changes in pretraining approaches used to enhance the performance of the BERT model

Vikas Bhandary
Towards Data Science
5 min readAug 5, 2019

--

Natural language processing (NLP) has been on the rise recently. All due to the language model (lm) pretraining method. A careful comparison between lm pretraining methods is near impossible because of various reasons (training data, computational costs, and hyperparameter choices). In this post, I will summarize some of the changes made in the BERT’s pretraining method which impacted the final results so that it achieves state-of-the-art results in 4 out of 9 GLUE tasks. For a detailed understanding please read the following paper.

BERT was undertrained

We almost thought BooksCorpus and English Wikipedia datasets (a total of 16 GB ) were enough for a language model to get a basic understanding of language, but it wasn't the case. Facebook and University of Washington researchers (authors of the paper) trained RoBERTa on a total of 160 GB uncompressed English dataset on DGX-1 machines, each with 8 x 32GB Nvidia V100.

Table 1: Comparing the dataset size and the performance of various models. (image is taken from the paper)

In the above table, we can see in similar settings RoBERTa outperformed BERT large model in SQuAD and MNLI-m tasks in just 100k steps( as compared to BERT with 1M).

Next Sentence Prediction Loss

BERT uses two different types of training objectives one is Masked Language Model (MLM) and another is Next Sentence Prediction (NSP). In MLM BERT selects 15% of tokens for replacements, out of which 80% of tokens are replaced with [MASK] and 10% are left unchanged and 10% are replaced with randomly selected vocabulary tokens.

NSP is a binary classification task that predicts if two input segments appear together. This objective was designed to improve the downstream tasks as it would require the model to understand the relationship and context of two given sequences. As observed in the paper removing NSP hurts performance in Natural Language Understanding (NLU) tasks. But in recent models like XLNet, SpanBERT, etc the necessity of NSP loss is questioned.

Therefore experiments were performed under various settings (SEGMENT_PAIR, SENTENCE-PAIR, FULL-SENTENCES, DOC-SENTENCES) to compare the effect of NSP loss on model performance. In SEGMENT_PAIR (with NSP loss), setting original BERT’s original input format is used, similarly in SENTENCE-PAIR (with NSP loss) setting input is sampled from either contiguous portion from the same document or from a different document. But in FULL-SENTENCES(without NSP loss) setting input consists of complete sentences from one or more documents similarly in DOC-SENTENCES setting input is packed with complete sentences from the same documents. The input size for these experiments is maintained at 512.

Table 2: Comparison of performance of models with and without NSP loss (image is taken from the paper)

The table shows, that the model in DOC-SENTENCES (without NSP loss) setting outperforms every other model. As in other settings, the model is not able to learn long-range dependencies. Despite that, RoBERTa uses FULL-SENTENCES setting for training objectives.

Static vs. Dynamic Masking

In MLM training objective, BERT performs masking only once during data preprocessing which means the same input masks are fed to the model on every single epoch. This is referred to as static masking. To avoid using the same mask for every epoch, training data was duplicated 10 times. If the masking is performed every time a sequence is fed to the model, the model sees different versions of the same sentence with masks on different positions. Here this type of masking is referred to as dynamic masking.

Table 3: Comparison of performance of models trained with static and dynamic masking

The performance of the model trained with dynamic masking is slightly better or at least comparable to the original approach used in BERT (i.e., static masking) model so RoBERTa is trained with dynamic masking. You can check this link if you are interested in understanding how masking works and how it improves the BERT model’s overall performance.

Training with large mini-batch

Training with large mini-batches can improve the optimization speed and end task performance if the learning speed is increased accordingly. This is shown in previous work(show paper link) done in machine translation. The computational cost of training a model for 1M steps with a batch size of 256 is equivalent to training 31K steps with a batch size of 8K. Training a model with large mini-batches improves the perplexity of MLM objectives, likewise, it is easier to parallelize via distributed data-parallel training. Even without large-scale parallelization efficiency can be improved by gradient accumulation.

Table 4: Comparison of the effect of increasing batch size (bsz) and learning rate and performance of the model

Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a hybrid between character and word-level representations, which solely relies on subword units. These subword units can be extracted by performing a statistical analysis of the training dataset. Generally, the BPE vocabulary size range from 10K -100K subword units.

BERT uses a character level BPE vocabulary size of 30K which is learned after preprocessing with heuristic tokenization rules. RoBERTa uses the encoding method discussed in the paper by Radford et al. (2019). Here BPE subword vocabulary is reduced to 50K (still bigger than BERT’s vocab size) units with the capability to encode any text without any unknown tokens and no need for preprocessing or tokenization rules. Using this encoding degraded the performance of end-task performance in some cases. Still, this method was used for encoding as it is a universal encoding scheme that doesn't need any preprocessing and tokenization rules.

Conclusions:

  1. Even the smallest decisions made in choosing a pretraining strategy (and hyperparameter choices) play a crucial role in the model’s performance in end-level tasks.
  2. Most of the state-of-the-art Transformer models are undertrained. Even the longest trained RoBERTa model didn't overfit after 500k steps.

--

--