![Scaling law behavior of LLMs— Image from [1]](https://towardsdatascience.com/wp-content/uploads/2024/07/1XD22gv-oEBkr-8KtVnaySA-1.png)
The world of artificial intelligence is witnessing a revolution, and at its forefront are Large Language Models that seem to grow more powerful by the day. From BERT to GPT-3 to PaLM, these AI giants are pushing the boundaries of what’s possible in natural language processing. But have you ever wondered what fuels their meteoric rise in capabilities?
In this post, we’ll embark on a fascinating journey into the heart of language model scaling. We’ll uncover the secret sauce that makes these models tick – a potent blend of three crucial ingredients: model size, training data, and computational power. By understanding how these factors interplay and scale, we’ll gain invaluable insights into the past, present, and future of AI language models.
So, let’s dive in and demystify the scaling laws that are propelling language models to new heights of performance and capability.
Table of content: This post consists of the following sections:
- Introduction
- Overview of recent language model developments
- Key factors in language model scaling
2. Power Law Distribution: A Quick Review
- Understanding power law relationships
- Visualizing power laws
3. Scaling Law Behavior in Language Models
- Model size and performance
- Dataset size and performance
- Compute resources and performance
4. The Interplay of Scaling Factors
- The ‘6 FLOPs’ rule
5. The Chinchilla Paper: A Game-Changer
- Key findings and implications
- The Chinchilla predictive formula
6. Closing Thoughts
- The importance of understanding scaling laws
7. References
Introduction
As you know, there has been a rapid scaling in language model development over the past few years. As we see in the image below, language models have scaled from 109M parameters in BERT-base in 2018, to 540B parameters in PaLM in 2022. Each model not only increased in size (i.e. number of parameters), but also increased in number of training tokens and the training computes (in terms of floating point operations or FLOPs).

A natural question that arise is that "what is the relationship between these three factors"? Do model size and training data contribute equally to model performance (i.e. test loss)? Which one is more important? If I want to reduce the test loss by 10%, should I increase the model size or the training data? By how much?
The answer to these questions lies in the scaling law behavior of LLMs. But before we dive into the answer, let’s review the power law distribution.
Power Law Distribution: A Quick Review
Power law is a nonlinear relationship between two quantities x and y that can be modeled generically as:

where
k
and a are constants.Notice, if I plot a power law relationship in a log-log plot, it will be a line, because

Let’s plot Power Law for two different values of k
to see its different behavior. If k
is positive, there is an increasing relationship between y
and x
. However, if k
is negative, there is a decreasing relationship between them. Here is a simple code to plot the power law curve.
import numpy as np
import matplotlib.pyplot as plt
def plot_power_law(k, x_range=(0.1, 100), num_points=10000):
"""
Plot the power law function y = x^k for any non-zero k.
Parameters:
k (float): The exponent for the power law (can be positive or negative, but not zero).
x_range (tuple): The range of x values to plot (default is 0.1 to 10).
num_points (int): Number of points to calculate for a smooth curve.
"""
if k == 0:
raise ValueError("k cannot be zero")
# Generate x values
x = np.linspace(x_range[0], x_range[1], num_points)
# Calculate y values
y = x**k
# Create the plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', label=f'y = x^{k}')
plt.title(f'Power Law: y = x^{k}')
plt.xlabel('x')
plt.ylabel('y')
plt.grid(True)
plt.legend()
plt.show()
And let’s plot it for a positive k
as following:
plot_power_law(2) # This will plot y = x^2

And if we choose a negative exponent the relationship will be decreasing:
plot_power_law(-0.5) # This will plot y = x^(-0.5)

Note, above plots are in linear scales on both x-axis and y-axis. If we plot them in logarithmic scale, they will be a line as shown in equation 2. Now, let’s tie all these together and show how power law is related to test loss of LLMs.
Scaling Law Behavior in Language Models
Scaling law behavior in language models refers to the observed relationships between model performance and various factors like model size, dataset size, and compute resources. These relationships follow predictable patterns as models are scaled up. Key factors involved in scaling law behavior are as following:
- Model size: As the number of parameters in a model increases, performance tends to improve following a power law.
- Dataset size: Larger training datasets generally lead to better performance, also following a power law relationship.
- Compute: The amount of computational resources (flops) used for training correlates with improved performance.
The three plots below show the scaling law in LLMs.
![scaling law behavior in LLMs - Image from paper in [2]](https://towardsdatascience.com/wp-content/uploads/2024/07/10D3gBUBd4B87pZjY42nx8g-1.png)
All three plots are in log-log space and are linear which testifies that test loss follows a power law relationship with each of compute, dataset size and model parameters. In addition, these plots show that language modeling performance improves as we increase the model size, dataset size and the amount of compute used for training.
So far, we have seen the individual relationship between these three factors and test loss. Now there are few questions: what is the relationship between these three factors themselves? How do the factors contribute to the test loss? Do they contribute equally? Or is one more important than the other?
The Interplay of Scaling Factors
In short, for every parameter in the model and every training example, approximately 6 floating point operations is needed. Therefore the relationship between the three factors is as following:

The reason that for each parameter and each training example, we need approximately 6 flops is as following:
Consider a parameter w
during training:
- Exactly 2 flops is needed in the forward pass to multiply
w
with the input node, and add it to the output node in the computational graph of the language model. (1 flop for multiplication, 1 flop for addition) - Exactly 2 flops is needed in computing gradient of the loss with respect to
w
. - Exactly 2 flops is needed to update parameter
w
with the gradient of the loss.
If you want more in detail explanation on this matter, please see this post.
With α ≈ 6, we can estimate the computational needs for training a language model if we know its size and the amount of training data used.
Next, let’s answer the question that how the two factors of model size and training data contribute to the mode performance?
The Chinchilla Paper: A Game-Changer
Chinchilla [1] is a paper that came out in 2022 from DeepMind. The authors found out that current large language models are under-trained due to the focus on scaling model size while keeping training data constant!! The authors in fact trained over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, and concluded that for compute-optimal training, both model size and the number of training tokens should be scaled equally.
They suggested the following empirical predictive formula that connects model size and training data to model performance.

N
is the number of parameters (a.k.a model size), D
is the training tokens. The notation L(N,D)
refers to the model performance or test loss for a model that has N
parameters and it is trained on D
tokens. The E
is a constant that represents the irreducible loss, which is the minimum loss the model can achieve given perfect training. It accounts for the inherent difficulty of the tasks the model is trained on and the noise in the data.
The constants A and B and the exponents α and β are determined empirically through experiments and fitting the data. Specifically, they find out that α≈0.50 and β≈0.50. This reinforces the main finding of the paper that for every doubling of the model size, the number of training tokens should also be doubled to achieve compute-optimal training [1].
Closing Thoughts
The scaling laws of language models provide crucial insights into the development and optimization of these powerful AI systems. As we’ve explored, the relationships between model size, training data, and computational resources follow predictable power law patterns. These laws have significant implications for AI researchers and engineers:
- Balanced scaling: The Chinchilla findings emphasize the importance of scaling both model size and training data equally for optimal performance. This challenges the previous focus on increasing model size alone.
- Resource allocation: Understanding these relationships allows for more efficient allocation of computational resources, potentially leading to more cost-effective and environmentally sustainable AI development.
- Performance prediction: These laws enable researchers to make educated predictions about model performance based on available resources, helping to set realistic goals and expectations.
As the field of AI continues to evolve rapidly, keeping these scaling laws in mind will be crucial for making informed decisions about model development, resource allocation, and research directions. By understanding and leveraging these relationships, we can work towards creating more efficient, powerful, and responsible language models in the future.