The world’s leading publication for data science, AI, and ML professionals.

ELECTRA: Developments in Masked Language Modelling

This is a functional outline of the main points of ELECTRA, a model described here:

https://arxiv.org/pdf/2003.10555.pdf

With BERT and the BERT-derived transformers (XLNet, RoBERTa, ALBERT, any transformer named after a Sesame Street character), it became clear to me that as an individual hobbyist with a single GPU and no industry backing, it would be impossible for me to train a deep learning language model myself. When I was preparing a live presentation for AISC on XLNet (on YouTube here), I came across this tweet by Elliot Turner:

Screencap of this tweet by Elliot Turner
Screencap of this tweet by Elliot Turner

Well, there went my vague dreams of training my own model. This was back in 2019, ie, the Before Times. By that, I mean before GPT-3, which uses such a staggering amount of computational resources that it would only be possible to train, currently, with eight figures behind you. This model intrigued me with the claim, in the abstract, that Electra-small could outperform GPT-1 after being trained on a single GPU for four days. They argue that the full model (ELECTRA-base) performs comparably to XLNet and RoBERTa while using approximately a quarter of their resources. Intriguing. What is this methodology?

The architecture and most hyperparameters for this model are the same as in BERT, so won’t be outlined here.

Replaced Token Detection is their pre-training methodology of choice. Whereas BERT selectively replaces tokens in a sequence with [MASK], ELECTRA replaces tokens with a plausible alternative word using a generator.

The input sequences are lists of tokens x = [x1, … xn]. MLM seeks to replace k of those tokens with alternative plausible words, arriving at a list of k masks [m1,…,mk]. For this paper, they recommend masking out about 15% of input tokens.

This process yields x^corrupt, a list of tokens with plausible alternative words inserted into m places.

It is then the discriminator’s job to determine whether a word in the sentence is the original, or the altered word.

The loss functions for the MLM and Discriminator are as follows:

Although this looks mightily like a GAN, the generator is trained with maximum likelihood as opposed to adversarially. Another difference is that if the generator generates a token that is identical to the original token, that token is marked as original, rather than generated.

Experiments

The pretraining data here for the base model is the same as BERT, which is 3.3 billion tokens from Wikipedia and BooksCorpus. Electra-large is trained on the XLNet data, which adds to the BERT data with ClueWeb, CommonCrawl, and GigaWord.

The evaluation of the models is done on the GLUE benchmark, as well as SQuAD, both explained here.

Model Extensions

There are some changes to Electra that improve on the model accuracy.

Weight Sharing: The embeddings used in the generator and discriminator are shared. This yields a small boost in accuracy while training for the same number of epochs, and with the same model parameters as there are with no weight tying. Only the embedding weights are shared, sharing all of the weights has the significant drawback of requiring the two networks to be the same size. A theory for why this works so well here is that with masked language modelling, the input tokens and the corrupted tokens to be discriminated reside in the same vector space. In that case, it doesn’t make sense for the generator to learn an embedding space, and the discriminator to effectively learn an embedding space and the transformation between embedding spaces.

Smaller Generators:

This is largely the corollary to the previous extension, in that if the generator and discriminator share all weights and are the same size, the model needs twice as much computation per training step as it would with just pure [MASK] tokens. With this model, they explore using a unigram generator to generate the masking tokens. With various sizes of generator being used, what they settle on is a generator that is about 0.25–0.5 the size of the discriminator.

Training Algorithms:

Clark et al. tried multiple high-level training algorithms for ELECTRA as a whole, and settled on the following:

  1. Train only the generator (with maximum likelihood) for n steps.
  2. Initialize the weights of the discriminator with the weights of the generator. Then train the discriminator with the discriminator loss function for n steps, keeping the weights of the generator frozen.

Large Models:

To be compared to typically-sized SOTA models, the base ELECTRA has to be comparably large. They train their own BERT-Large model using the same hyperparameters and training time as ELECTRA-400k.

A point that I am curious about is that in the paper, it is a bit of a puzzle how long models take to train relative to each other. What they said is that ELECTRA-Large is the same size as BERT-Large, but they also trained their own BERT with the same hyperparameters and training time as ELECTRA-400k. The selling point for ELECTRA that they describe is using much less physical computational resources, but I wish they outlined those physical and time resources more explicitly beside the other large models, for comparison purposes. They do list the train FLOPs (Floating Point Operation Per Second), which typically could be taken as a measure of how fast their GPU is computing – but in the appendix they clarify that an "operation" is counted as a mathematical operation, rather than a machine instruction (as is typical). A comparison of the resources used and the time it took them to pre-train and train would have been appreciated.

Results on the GLUE dev set.
Results on the GLUE dev set.

Their own model was trained with the same hyperparameters used in ELECTRA-400k.

Efficiency Analysis:

Their suspicion is that having to analyze all of the tokens in a string, rather than fill in the gap created by an explicit mask, led to the efficiency gains in Electra. They created Electra 15%, which only calculates the discriminator loss over the 15% of tokens that were masked out. Base Electra scored 85% on the GLUE task, while Electra 15% scored 82.4%, which is comparable to BERT’s 82.2%.

Negative Results:

In this study, they tried many training paradigms, like model size, generator architecture, weight tying, and some were so generally unsuccessful that they didn’t appear in the main body of the paper.

  1. Strategically applying masking to rarer tokens. This did not result in any significant speedup over the regular BERT.
  2. Raising the temperature of the generator, or not allowing correct word outputs did not improve results.
  3. Adding a sentence-level contrasting series of tokens, like a masked language modelling version of SpanBERT. This actually decreased model scores on the GLUE and SQuAD tasks.

Conclusion:

Though this is not trained adversarially, it is an example of how contrastive learning can be applied to language with great effect. In broad terms, contrastive learning involves discriminating between observed data points and fictitious examples. BERT, Ernie, XLNet, ELMo, RoBERTa, and SpanBERT have all introduced or extended the masked language modelling paradigm, where the model guesses the correct token for a single masked-out token, and Electra extends it further by introducing the possibility that the mask might be any token in the series received by the discriminator. ELECTRA is now a part of Hugging-Face and simpletransformers.


Related Articles