The world’s leading publication for data science, AI, and ML professionals.

Is BERT the answer to everything?

Know the flaws of BERT that gave rise to its cousins!

Introduction

With the onset of the Transformer, there has been a rapid rise in language models. In 2018, BERT came and broke all records. However, shortly after Bert, a long list of its cousins were born, like RoBERTa, ALBERT, StructBERT, DistilBERT, to name a few.

BERT is essentially trained to optimise two tasks; Masked Language Model (MLM) and Next Sentence Prediction (NSP). Since there are a lot of excellent blogs available on BERT, this post’s focus would be to understand why BERT might not work in some cases and what are the alternatives available.

Pre-Training Tasks

Before moving on to the limitations, let us briefly discuss how MLM and NSP work.

Masked Language Model (MLM)

MLM was originally introduced as a "Cloze task" by Taylor. In BERT, it is used as an extension to a unidirectional Language model. In a standard unidirectional language model, context is taken from one side only. However, the context from the opposite direction might be equally important. For example, when trying to predict the word capital in the sentence: "Paris is the capital of France"; words to the left and right of capital are both necessary to make the correct prediction. Since context from both directions cannot be fed into a traditional language model, it is implemented as a Masked Language model in BERT.

In MLM, some of the words or tokens are masked and the task is to predict those tokens. You can think of it as a fill-in-the-blank task.

Next Sentence Prediction (NSP)

Given two sentences s1 and s2, NSP is the task to predict whether s1 is followed by s2 in the same document.

For example,

s1="I am thirsty" s2="Pour me a drink" Label = IsNextSentence

s1="I am thirsty" s2="This house needs cleaning" Label = Not IsNextSentence

For NSP, the inputs are combined as given below:

[CLS] I am thirsty [SEP] Pour me a drink [SEP]

[SEP] is the special token which separates the 2 sentences.

Combining with MLM, the same input will now have masked tokens, as below:

[CLS] I am thirsty [SEP] Pour me a [MASK] [SEP]

Analysing MLM

MLM has proven to be a very useful task for pre-trained language models because of its bidirectional nature.

However, the main issue with it is the mismatch between pre-training and fine-tuning phases. This is because the MASK token does not appear during fine tuning.

To deal with this, BERT used the [MASK] token 80% of the time, a random token 10% of the time, and the original token for the remaining 10% of the time to perform masking.

However, even with this approach, some claim that the mismatch issue still remains unsolved to a large extent. This gave birth to a new pre-training task called Permuted Language Model (PLM). In short, PLM applies random permutation on input sequences and a permutation is randomly sampled from all possible permutations. Some of the tokens are masked which would finally be predicted by this task. Note that this does not affect the original order of tokens. Instead, it just defines the order of token predictions.

Another major improvement over MLM is dynamic masking. In BERT, all mask tokens are created before training (in the preprocessing step), once and for all. Whereas, in the famous RoBERTa, tokens are masked dynamically. This means that after each epoch, the input sequence will have different tokens masked. RoBERTa has shown quite noticeable improvements over BERT on various downstream tasks.

There are multiple researches proposing enhanced versions of MLM. Some of them are:

  • UniLM: Unified Language model extends mask prediction to unidirectional, bidirectional and sequence-to-sequence predictions.
  • TLM: The translation language model takes parallel bilingual data and randomly masks tokens in both source and target languages.
  • Whole word masking: Originally, BERT uses wordpiece tokenizer and masking is done on the pieces/token level. More recent researches have shown improvement when masking the whole word instead of the broken piece.

What’s wrong with NSP?

Many researchers have claimed that NSP is not a necessary task and removing/altering it is a better option. Why is that so?

As mentioned by the authors of ALBERT, NSP conflates topic prediction and coherence prediction. They say that NSP actually learns whether the two sentences belong to the same topic, which is much easier than learning whether the sentences are grammatically coherent or not.

Even while training BERT from scratch on custom data, we notice the NSP accuracy spiking up pretty quickly. However, using the same model on a sentence pair fine tuning task gives a bad performance.

What are the alternatives?

A lot of BERT’s cousins follow a pattern: Remove NSP, use MLM (with or w/o modifications) and add a new task for replacing NSP (or something that works for sentence/segment pair tasks. Some of my personal favourite alternatives are:

Sentence Order Prediction (SOP)

SOP uses two consecutive segments from the same document as a positive example and the same sequences with their order reversed as a negative example.

Following ALBERT’s improvements over BERT on various downstream tasks, StructBERT also took SOP as a pre-training task.

Sequence to Sequence MLM

As used in MASS, Seq2Seq MLM uses an encoder decoder style training. The encoder is fed a masked sequence, and the decoder sequentially produces the masked tokens in auto-regression fashion.

ELECTRA – Replaced Token Detection (RTD)

Electra is a very interesting model which uses a generator to replace some tokens of a sequence. The job of the discriminator is to identify whether the token is an actual or a replaced one.

You would be amazed at the number of options available for language models. This paper does a perfect summarisation of many state of the art language models from all aspects: pre-training tasks, architecture types, model extensions etc. I would suggest to definitely give this a read.

Conclusion

BERT and other pre-trained NLP models have undoubtedly made their mark. More importantly, it has served as a building block for several other research work. However, with so many models available, it becomes impossible to try out all of them for a given task.

I hope this post will make you more aware of why BERT might not work on your custom data and what alternatives are already available and are handy. Personally, my vote would go to RoBERTa and ALBERT as the go-to models for most of the NLP fine tuning tasks.

What’s your take on the plethora of NLP models available out there?

Cheers,

Eram


Related Articles