Opinion

DeepMind’s latest paper dismantles the tired trend of building larger and larger models to improve performance.
The company has found a key aspect of scaling large language models that no one has ever applied before. OpenAI, Google, Microsoft, Nvidia, Facebook, and even DeepMind themselves, all big tech companies committed to creating powerful language models, are doing it wrong: Making models larger is neither the best nor the most efficient approach.
Increasing model size as a proxy for increasing performance was established in 2020 by Kaplan and others at OpenAI. They found a power law between those variables and concluded that, as more budget is available to train models, the majority should be allocated to making them bigger.
That’s why we’ve seen ever-larger models being released every few months since 2020: GPT-3 (175B), LaMDA (137B), Jurassic-1 (178B), Megatron-Turing NLG (530B), Gopher (280B) – and that’s just the dense models. As predicted by Kaplan’s law, these models are significantly better than the previous generation (GPT-2, BERT), just not as good as they could’ve been.
They came to the wrong conclusion of thinking model size carried all the responsibility for improving the models. They missed another key factor: Data.
DeepMind’s findings will define language model scaling in the future
In a new paper ("Training Compute-Optimal Large Language Models" by Hoffmann et al.), DeepMind researchers revisited Kaplan’s conclusions and found that scaling the number of training tokens (that is, the amount of text data the model is fed) is as important as scaling model size.
Given a fixed compute budget, researchers should allocate it in similar proportions to increase model size and number of training tokens to reach the compute-optimal model (measured by minimal training loss). "For every doubling of model size the number of training tokens should also be doubled." This implies that a smaller model can vastly outperform a larger – but suboptimal – model if trained on a significantly higher number of tokens.
And they proved it. The star of the new paper is Chinchilla, a 70B-parameter model 4 times smaller than the previous leader in language AI, Gopher (also built by DeepMind), but trained on 4 times more data. Researchers found that Chinchilla "uniformly and significantly" outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG across a large set of language benchmarks.
The conclusion is clear: Current large language models are "significantly undertrained," which is a consequence of blindly following the scaling hypothesis – making models larger isn’t the only way toward improved performance.
And not only that. Because Chinchilla is smaller, inference and fine-tuning costs less, easing the use of these models to smaller companies or universities that may not have the budget or last-generation hardware to run larger models. "The benefits of a more optimally trained smaller model, therefore, extend beyond the immediate benefits of its improved performance."
Compute-optimal large language models
Compute budget is usually the limiting factor – known in advance and independent. Model size and number of training tokens are irremediably determined by the amount of money the company can spend on better hardware. To study how these variables affect performance, DeepMind’s researchers considered this question: "Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?"
As stated above, models like GPT-3, Gopher, and MT-NLG follow the scaling laws devised by Kaplan (Table 1). To put a concrete example, if compute budget increases by a factor of 10, Kaplan’s law predicts optimal performance when model size is increased by 5.5x and the number of training tokens is increased by 1.8x.

Kaplan and colleagues arrived at this conclusion because they fixed the number of training tokens in their analysis. This assumption prevented them from finding DeepMind’s answer – that both model size and number of tokens should increase in parallel, roughly by 3.16x (or √10x).
To study the relationship between compute budget, model size, and number of training tokens, researchers used three approaches (see section 3 of the paper for a more detailed explanation).
- Fixed model size: They defined a family of model sizes (70M-16B) and varied the number of training tokens (4 variations) for each model. Then they determined the optimal combination for each compute budget. Using this approach, a compute-optimal model trained with the same amount of compute as Gopher would have 67B params and 1.5T tokens.
- IsoFLOP curves: They fixed compute budget (9 variations ranging from 6×10¹⁸ to 3×10²¹) and explored model size (determining automatically number of tokens). Using this approach, a compute-optimal model trained with the same amount of compute as Gopher would have 63B params and 1.4T tokens.
- Fitting a parametric loss function: Using the results from approaches 1 and 2 they modeled the losses as parametric functions of model size and number of tokens. Using this approach, a compute-optimal model trained with the same amount of compute as Gopher would have 40B params.
In total, they evaluated over 400 models, ranging from 70M to 16B parameters and from 5B to 500B training tokens. All three approaches yielded similar predictions for optimal model size and number of training tokens – significantly different than that from Kaplan.
These findings suggest that models from the current generation are "considerably over-sized, given their respective compute budgets" (figure 1).

As it’s shown in table 3 (first approach), a 175B model (GPT-3-like) should be trained with a compute budget of 3.85×10²⁴ FLOPs and trained on 3.7T tokens (more than 10 times what OpenAI used for their GPT-3 175B model). A 280B model (Gopher-like) should be trained with 9.90×10²⁴ FLOPs and on 5.9T tokens (20 times what DeepMind used for Gopher).

They took the conservative estimates (approaches 1 & 2) to determine the size and number of training tokens of a compute-optimal model trained on the budget they used for Gopher. Chinchilla is the resulting model. 70B parameters, trained on 1.4T tokens (4x smaller and 4x more data than Gopher). Chinchilla outperformed Gopher – and all other previous language models – "uniformly and significantly."
They proved their hypothesis: Increasing the number of training tokens at the same rate as model size provides the best results, other things being equal.
Results comparison: Chinchilla vs Gopher & Co
Saying that Chinchilla outperformed Gopher feels like an understatement when we look at the results for each benchmark. To not overload the article with graphs, I’ll show below only the results for Massive Multitask Language Understanding (MMLU) and Big-bench (which amount to 80% of the tasks) and ethics-related benchmarks – which always deserve preferential scrutiny. (See section 4 of the paper for a detailed analysis that includes reading, commonsense, and Q&A benchmarks.)
MMLU & BIG-bench
Chinchilla got new SOTA scores in both benchmarks. 67.6% average accuracy on MMLU and 65.1% average accuracy on BIG-bench, while Gopher got 60% and 54.4%, respectively (figures 2, 3). For MMLU, Chinchilla even surpasses the 63.4% mark established by experts as the predicted SOTA for June 2023. No one was expecting such an improvement so soon.


Chinchilla uniformly outperforms previous LLMs across other benchmarks like commonsense reasoning and reading comprehension, undoubtedly claiming the throne of language AI.
However, its dominance lasted very little. Chinchilla was further surpassed just a week after its release by Google’s latest model, PaLM (at 540B parameters it became the current largest and the most performant language model). This continuous chain of passings between companies illustrates the fast pacing of the field. Although Google didn’t fully take into account DeepMind’s findings to build PaLM that’s because they were testing a different approach. (Expect a new article soon on PaLM!)
Gender bias and toxicity
It’s expected that Chinchilla, which shares the same dataset and architecture as Gopher, will show similar behavior regarding bias and toxicity. It shows some improvements over Gopher in the Winogender dataset of gender and occupation bias (table 7), but not equally across groups.

In the PerspectiveAPI toxicity benchmark, Chinchilla and Gopher show similar results: "The large majority of generated samples are classified as non-toxic, and the difference between the models is negligible." This also implies that, even if a model is trained on more data, it doesn’t necessarily get more toxic.
Hypothesis: How could they further improve Chinchilla performance-wise?
DeepMind found a new relationship between compute budget, model size, and number of training tokens. But those aren’t the only parameters that affect performance and efficiency.
One key problem when training large models is finding the optimal hyperparameters (HPs). Current language models are so big that companies can only afford to train them once: Searching for the best set of HPs is unfeasible. Researchers often have to make difficult assumptions – often wrong – to set them.
Recently, Microsoft and OpenAI studied a new type of parameterization (μP) that scales well across different-size models of the same family. The optimal HPs for a smaller model can be transferred to the larger model, yielding considerably better results.
DeepMind’s paper mentions previous work on hyperparameter tuning but not this particular paper that came out a few weeks ago. Combining the optimal-compute paradigm with the μP would presumably yield even better results for any large language model.
Another improvement could be a retrieval mechanism. RETRO matched GPT-3’s performance across tasks despite being 25 times smaller. Its retrieval abilities allowed the model to access a huge database (3T tokens) in real-time (In an analogous way to how we do internet searches).
Finally, if we wanted to go the last mile, an alignment technique could improve results not only in language benchmarks but in real-world situations. OpenAI used a method to improve GPT-3 into InstructGPT with great performance results. However, AI alignment is extremely complex and InstructGPT doesn’t seem to improve over previous models in safety or toxicity.
If a company combined all these features into one model, they’d create the best overall model possible with what we know today about large language models.
Four critical reflections from Chinchilla
A new trend
Chinchilla’s performance isn’t just impressive in terms of the magnitude of improvement but more so because the model is smaller than all large language models developed in the last two years that showed SOTA performances. Instead of focusing on making the models larger, as many AI experts have criticized, companies and researchers should focus on optimizing the resources and parameters they have – otherwise they’re wasting their money.
Performance-wise and efficiency-wise, Chinchilla is a breakthrough.
Chinchilla’s performance is no longer the best in the field, as Google’s PaLM has achieved SOTA results in many benchmarks. However, Chinchilla’s main influence doesn’t lie in being the best model out there but in being extremely good while breaking the pattern of making models larger and larger.
The consequences of this will define the Future of the field. First, companies should recognize that model size isn’t the only variable that matters for performance, but one of many. Second, it may calm down the hype of the general public to see ever-larger models in the future – as a sign that we’re getting closer to AGI much faster than we really are. Finally, it may help reduce the environmental effects of large models and the barriers of entry to smaller companies that can’t follow along with the big Tech.
This last point brings me to the second reflection.
Limited reproducibility
Despite being smaller than other models, it’s still unfeasible for most companies and universities to train or study models like Chinchilla. Calling a 70B model "small" should make anyone realize how problematic this is. Most entities that have the required human resources (researchers who can get the most out of studying these models) don’t have the financial depth to carry out the necessary experiments. Because of that, current AI is being built on fragile foundations and driven by a few big companies that define the directions in which the Science is done.
But there’s another limiting factor unrelated to money.
DeepMind will most likely not release Chinchilla. Neither will Google release PaLM and OpenAI release DALL·E – at least while they’re relevant. These models are often only published as a means to signal who is advancing the state of the art but without the intention of letting others use them for research purposes. To their credit, DeepMind is one of the AI companies that have made the biggest efforts to advance science and research by allowing others to build on its discoveries (they made AlphaFold predictions freely available), but the tendency of showing off is still dominant in the field.
DeepMind is trying to revert a damaging trend by building a model that’s better and smaller at the same time. But given that Chinchilla is still a huge model, we should realize how far off we’ve come from the possibility to democratize a Technology that will redefine our future. If we keep going in a direction in which a few control the resources for scientific inquiry, the direction of research, and the resulting breakthroughs, creating AGI will not be worth it.
Data audit
Current models are undertrained (or oversized). To build optimal-compute models companies will need larger datasets than what they currently can use. Large-size high-quality text datasets will be very demanded in the near future.
Emily M. Bender, a professor of linguistics at the University of Washington, criticized Google’s approach to PaLM because 780B tokens (the amount of data they used to train the model) is too much to be well documented, which makes the model "too big to deploy safely." Chinchilla was trained on twice as many tokens. If we extrapolate Bender’s criticisms (which would depend on the process DeepMind followed to train the model), we can conclude that Chinchilla is also not safe enough to be deployed.
To make models better while being smaller, they need more data. But using more data makes the models less safe. We have a hard choice between making models larger (i.e. they get increasingly out of reach for most players in the field and at the same time their carbon footprint increases) or training them on more tokens (i.e. making data audits harder and the models less safe). Saying Chinchilla is better overall because it’s smaller seems now a far-fetched statement.
The alternative can always be to put more focus on other lines of research that don’t include training huge models with huge datasets. However, because the Big Tech has the money to fund the research lines they want, only those provide results – not because other lines won’t work, but because they aren’t being well explored.
Inherent bias
It seems that it doesn’t matter how much researchers optimize models in terms of performance or efficiency, they can’t seem to reach acceptable levels of bias and toxicity. Transformer-based large language models may be inherently subjected to these issues, regardless of model size, dataset size, hyperparameter quality, compute budget, etc.
We won’t solve the ethical issues of language models simply by making them better at performance benchmarks.
Subscribe to The Algorithmic Bridge. Bridging the gap between algorithms and people. A newsletter about the AI that matters to your life.
You can also support my work on Medium directly and get unlimited access by becoming a member using my referral link here!