Train Large, Then Compress

This is how to train better transformer models

How to train faster, higher performant transformers

Jonas Vetterle

Published in

Towards Data Science

5 min readApr 14, 2020

For over 2 years now, transformer models, pretrained on large corpora of text, are the state-of-the-art in all things NLP. Researchers and practitioners continue to push boundaries by inventing better architectures or training larger models on more data. Indeed few would disagree that, all else equal, training larger models on more data increases performance. But what if one is time- or resource-constrained?

Common wisdom is to take the hit in accuracy, and train smaller models. Not only are smaller models faster to train and to do inference, it’s also cheaper, right?

Recent research by Berkeley Artificial Intelligence Research (BAIR)¹ suggests this it not the case. Larger models train faster, and can be compressed more efficiently, thereby decreasing inference time. The authors therefore conclude that

“the best strategy for resource- constrained training is to train large models and then heavily compress them”

Contrary to common practice, in a resource constraint setting it is optimal to train large models [Source]

Larger models train faster

The authors ran experiments with the following models:

a version of the RoBERTa model for self-supervised language modelling; and
the standard transformer model for machine translation.

In each experiment, the authors vary the size of the model in terms of its depth (2–24 layers) and width (hidden size 128–2024).

Deeper models achieve better performance faster [Source]

The main results are that larger models:

1 are more sample-efficient: they obtain better results (lower perplexity on the language modelling task, and higher BLEU score on the translation task) after fewer gradient steps; and

2 even after adjusting for wall-clock time, larger models train faster. That is, the decrease in training time more than offsets the increase in computational overhead from increasing model size.

Wider models achieve better performance faster [Source]

Larger models compress better

Not only do larger model train faster, they also predict faster. This is because they are more compressible, so you can trim them to the same inference cost as small models while achieving higher accuracy.

To arrive at this result, the authors compress their models in 2 ways:

Quantization: they store the parameters at different precision formats (down to 4-bits) to save on computation time and memory footprint; and
Pruning: they iteratively zero out 15% of the lowest magnitude parameters and then finetune again in order to reduce the number of operations and memory footprint (as weight matrices are now sparse).

The authors find that:

1with both compression methods, larger models provide the better accuracy-efficiency trade off: the drop in accuracy when compressed is lower than for smaller models; and

2both compression methods can be carried out with little additional computational overhead.

Larger models generally achieve better accuracy at any level of compression [Source]

When and why do we achieve these results?

While there is extensive literature about why bigger models obtain higher test accuracy, less research has been directed towards if and why they also converge faster. The authors offer some explanations for why this might be the case:

1These results hold on large datasets (where overfitting is less of an issue). Empirically, larger models decrease the training error faster. And since for large datasets the generalization gap (difference between train and test error) is less of an issue, larger models also decrease the test error faster. The authors note that

“the challenge in the MLM task is not overfitting, but instead, it is fitting the data — even 8 billion parameter models do not overfit to large pretraining corpora”

When decreasing the training dataset to 1% or 5% of its original size, the authors find that the benefit of training large models vanishes.

2 Larger models use compute more efficiently. The bottleneck when training large models on tasks like Masked Language Modelling is not compute, but memory & storage. Therefore larger models more efficiently use the available compute.

3Larger models obtain smaller compression errors. The authors show that the parameters resulting from quantization & pruning are closer to the parameters of the original (unpruned) models for large models than for small models.

Why is this relevant?

This paper constitutes a potential paradigm shift in how models are trained under resource constraints. This has obvious economic benefits: less time and money spent training models, while achieving higher performance.

It should be noted that in all of the above experiments, the authors did not perform any hyperparameter optimization apart from choosing model depth and width. In a real world setting however good hyperparameters are unknown so a lot of time and resource is spent finding the best setting.

The findings of this paper are important for another reason. Searching good hyperparameters, or simply just training the final model can take many GPU-hours and can therefore have a substantial carbon footprint.² By reducing the time it takes to train models of a certain accuracy it might be possible to reduce emissions. Check out the ML CO₂ Calculator tool by the Mila lab if you are interested in finding out more about your carbon footprint.

[1]: Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein and Joseph E. Gonzalez, Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers (2020).

[2]: Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt and Thomas Dandres, Quantifying the Carbon Emissions of Machine Learning (2019), Climate Change AI workshop at NeurIPS 2019.