The world’s leading publication for data science, AI, and ML professionals.

How to Estimate the Number of Parameters in Transformer models

An inside look at the Transformer Encoder/Decoder building blocks

Preview. Image by Author
Preview. Image by Author

Thanks to Regan Yue, you can read the Chinese version of this article at mp.weixin.qq.com, juejin.cn, segmentfault.com and xie.infoq.cn!

The most effective way to understand new Machine Learning architecture (as well as any new technology in general) is to implement it from scratch. This is the best approach that helps you understand the implementation down to the smallest details, although it is very complex, time-consuming, and sometimes just impossible. For example, if you do not have similar computing resources or data you will not be able to make sure that there is no hidden bug in your solution.

However, there is a much easier way – to count the number of parameters. It’s not much harder than just reading the paper, but allows you to dig quite deep and check that you fully understand the building blocks of the new architecture (Transformer Encoder and Decoder blocks in our case).

You can think about this with the following diagram, which presents three ways to understand a new ML architecture –the size of the circle represents the level of understanding.

Ways to understand ML architecture. Calculating the number of parameters is not much more difficult than simply reading the paper, but it will allow you to delve deeper into the topic. Image by Author
Ways to understand ML architecture. Calculating the number of parameters is not much more difficult than simply reading the paper, but it will allow you to delve deeper into the topic. Image by Author

In this article, we will take a look at the famous Transformer architecture and consider how to calculate the number of parameters in PyTorch TransformerEncoderLayer and TransformerDecoderLayer classes. Thus, we will make sure that there are no mysteries left for us about what this architecture consists of.

TL;DR

All formulas are summarized in the Conclusions section. You are welcome to take a look at them right now.

I present not only exact formulas but also their less accurate approximate versions, which will allow you to quickly estimate the number of parameters in any Transformer-based model.

Transformer Architecture

The famous Transformer architecture was presented in the breathtaking "Attention Is All You Need" paper in 2017 and became the de-facto standard in the majority of Natural Language Processing and Computer Vision tasks because of its ability to effectively capture long-range dependencies.

Now, in early 2023, diffusion is gaining extreme popularity, mainly due to text-to-image generative models. Maybe, soon they will become the new state-of-the-art in various tasks, as it was with Transformers vs LSTMs and CNNs. But let’s take a look at Transformers first…

My article is not an attempt to explain Transformer architecture since there are enough articles that do it very well. It just might allow you to look at it from a different side or clarify some details if you haven’t fully figured it out yet. So if you are seeking more resources to learn about this architecture, I’m referring you to some of them; otherwise, you can just keep reading.

Resources to know more about Transformer

If you are looking for a more detailed Transformer architecture overview, take a look at these materials (note that there are plenty of other recourses on the Internet, I just personally like these):

Original Transformer

To begin with, let’s remember Transformer basics.

The architecture of the Transformer consists of two components: the encoder (on the left) and the decoder (on the right). The encoder takes a sequence of input tokens and produces a sequence of hidden states, while the decoder takes this sequence of hidden states and produces a sequence of output tokens.

Transformer architecture. Figure 1 from the public domain paper
Transformer architecture. Figure 1 from the public domain paper

Both the encoder and decoder consist of a stack of identical layers. For the encoder, this layer includes multi-head attention (1 – here, and later numbers refer to the image below) and a feed-forward neural network (2) with some layer normalizations (3) and skip connections.

The decoder is similar to the encoder, but in addition to the first multi-head attention (4) (which is masked for machine translation task so the decoder doesn’t cheat by looking at the future tokens) and a feedforward network (5), it also has a second multi-head attention mechanism (6). It _ allows the decoder to use the context provided by the encoder when generating output. As the encoder, the decoder also has some layer normalization (7)_ and skip connections components.

Transformer architecture with signed components. Adapted from figure 1 from the public domain paper
Transformer architecture with signed components. Adapted from figure 1 from the public domain paper

I will not consider the input embedding layer with positional encoding and final output layer (linear + softmax) as Transformer components, focusing only on Encoder and Decoder blocks. I do so since these components are specific to the task and embedding approach, while both Encoder and Decoder stacks formed the basis of many other architectures later.

Examples of such architectures include BERT-based models for the Encoder (BERT, RoBERTa, ALBERT, DeBERTa, etc.), GPT-based models for the Decoder (GPT, GPT-2, GPT-3, ChatGPT), and models built on the full Encoder-Decoder framework (T5, BART, and others).

Although we counted seven components in this architecture, we can see that there are only three unique ones:

  1. Multi-head attention;
  2. Feed-forward network;
  3. Layer normalization.
Transformer building blocks. Adapted from figure 1 from the public domain paper
Transformer building blocks. Adapted from figure 1 from the public domain paper

Together they form the basis of a Transformer. Let’s look at them in more detail!

Transformer Building Blocks

Let’s consider the internal structure of each block and how many parameters it requires. In this section, we will also start using PyTorch to validate our calculations.

To check the number of parameters of a certain model block, I will use the following one-line function:

Before we start, pay attention to the fact that all blocks are standardized and used with skip connections. This means that the shape (more precisely, its last number, since the batch size and the number of tokens may vary) of all inputs and outputs must be the same. For the original paper, this number (_dmodel) is 512.

Multi-Head Attention

A famous attention mechanism is a key to Transformer architecture. But putting aside all motivations and technical details it is just a few matrix multiplications.

Transformer multi-head attention. Adapted from figure 2 from the public domain paper
Transformer multi-head attention. Adapted from figure 2 from the public domain paper

After calculating attention for every head, we concatenate all heads together and pass it through a linear layer (_WO matrix). In turn, each head is scaled dot-product attention with three separate matrix multiplications for the query, key, and value (_WQ, _WK, _and WV matrices respectively). These three matrices are different for each head, which is why subscript i is present.

The shape of the final linear layer (_WO) is _d_model to dmodel. The shape of the rest three matrices (_WQ, _WK, _and WV) is the same: _d_model to dqkv.

Note that d_qkv in the image above is denoted as d_k or d_v in the original paper. I just find this name more intuitive, because although these matrices may have different shapes, it is almost always the same.

Also, note that d_qkv = d_model / num_heads (h in the paper). That’s why d_model must be divisible by num_heads: **** to ensure correct concatenation later.

You can test yourself by checking the shapes in all the intermediate phases in the picture above (the correct ones are indicated at the bottom right).

As a result, we need three smaller matrices for each head and one large final matrix. How many parameters do we need (do not forget biases)?

The formula for calculating the number of parameters in the Transformer attention module. Image by Author
The formula for calculating the number of parameters in the Transformer attention module. Image by Author

I hope it’s not too tedious – I tried to make the deduction as clear as possible. Don’t worry! The future formulas will be much smaller.

The approximate number of parameters is such because we can neglect 4*d_model compared to 4*d_model^2. Let’s test ourselves using PyTorch.

The numbers match, meaning we are good!

Feed-forward Network

The feed-forward network in Transformer consists of two fully connected layers with a ReLU activation function in between. The network is built in such a way that its internal part is more expressive than input and output (which, as we remember, must be the same).

In general case, it is _MLP(d_model, d_ff) -> ReLU -> MLP(d_ff, dmodel), and for original paper _dff = 2048.

Feed-forward neural network description. Public domain paper
Feed-forward neural network description. Public domain paper

A little visualization never hurts.

Transformer feed-forward network. Image by Author
Transformer feed-forward network. Image by Author

The calculation of parameters is quite easy, the main thing, again, is not to get entangled by biases.

The formula for calculating the number of parameters in the Transformer feed-forward net. Image by Author
The formula for calculating the number of parameters in the Transformer feed-forward net. Image by Author

We can describe such a simple network and check the number of its parameters using the following code (note that official PyTorch implementation also uses dropout, which we will see later in Encoder/Decoder code. But as we know, the dropout layer does not have trainable parameters, so I omit it here for simplicity):

The numbers match again, and only one component remains.

Layer Normalization

The last building block of the Transformer architecture is layer normalization. Long story short, it is just an intelligent (i.e. learnable) way of normalization with scaling, that improves the stability of the training process.

Transformer layer normalization. Image by Author
Transformer layer normalization. Image by Author

The trainable parameters here are two vectors gamma and beta, each of which has a _dmodel dimension.

The formula for calculating the number of parameters in the Transformer layer normalization module. Image by Author
The formula for calculating the number of parameters in the Transformer layer normalization module. Image by Author

Let’s check our assumptions with code.

Good! In approximate calculations, this number can be neglected, since layer normalization has dramatically fewer parameters than feed-forward network or multi-head attention block (despite the fact that this module occurs several times).

Derive Complete Formulas

Now we have everything to count the parameters for the entire Encoder/Decoder block!

Encoder and Decoder in PyTorch

Let’s remember that the Encoder consists of an attention block, feed-forward net, and two layer normalizations.

Transformer Encoder. Adapted from figure 1 from the public domain paper
Transformer Encoder. Adapted from figure 1 from the public domain paper

We can verify that all components are in place by looking inside the PyTorch code. Here multi-head attention is indicated in red (on the left), the feed-forward network in blue and layer normalizations in green (screenshot of the Python console in PyCharm).

PyTorch TransformerEncoderLayer. Image by Author
PyTorch TransformerEncoderLayer. Image by Author

As noted above, this implementation includes dropout in the feed-forward net. Now we can see the dropout layers related to layer normalizations as well.

The decoder, in turn, consists of two attention blocks, a feed-forward net, and three layer normalizations.

Transformer Decoder. Adapted from figure 1 from the public domain paper
Transformer Decoder. Adapted from figure 1 from the public domain paper

Let’s look at PyTorch again (the colors are the same).

PyTorch TransformerDecoderLayer. Image by Author
PyTorch TransformerDecoderLayer. Image by Author

Final Formula

After making sure, we can write the following function to calculate the number of parameters. In fact, these are just three lines of code that can even be combined into one. The rest of the function is a docstring for clarification.

Now it’s time to test it.

The exact formulas are correct, meaning we have correctly identified all building blocks and deconstructed them into their components. Interestingly, since we ignored relatively small values (thousands compared to millions) in the approximate formulas, the error is only about 0.2% compared to the exact results! But there is a way to make these formulas even simpler.

The approximate number of parameters for the attention block is 4*d_model^2. It sounds pretty simple, considering that _dmodel is an important hyperparameter. But for the feed-forward network, we need to know _dff, since the formula is 2*d_model*d_ff.

_dff is a separate hyperparameter that you have to memorize in the formula, so let’s think about how to get rid of it. In fact, as we saw above, _dff = 2048 when _dmodel = 512, so _*d_ff = 4dmodel**.

For many Transformer models, such an assumption will make sense, greatly simplifying the formula, and still giving you an estimate of the approximate number of parameters. After all, no one wants to know the precise amount, it’s just useful to understand whether this number is in the hundreds of thousands or tens of millions.

Approximate Encoder-Decoder formulas. Image by Author
Approximate Encoder-Decoder formulas. Image by Author

To get an understanding of the order of magnitude you are dealing with, you can also round the multipliers and you will get 10*d_model^2 for every Encoder/Decoder layer.

Conclusions

Here is a summary of all the formulas that we have deduced today.

Formulas recap. Image by Author
Formulas recap. Image by Author

In this article, we’ve calculated the number of parameters in Transformer Encoder/Decoder blocks, but of course, I don’t invite you to count the parameters of all new models. I have just chosen this method because I was surprised that I didn’t find such an article when I started studying Transformers.

While the number of parameters can give us an indication of the complexity of the model and the amount of data it will require to train, it is just one way to gain a deeper understanding of the architecture. I want to encourage you to explore and experiment: take a look at implementation, run the code with different hyperparameters, etc. So, keep learning and have fun!


Thank you for reading!

  • I hope these materials were useful to you. Follow me on Medium to get more articles like this.
  • If you have any questions or comments, I will be glad to get any feedback. Ask me in the comments, or connect via LinkedIn or Twitter.
  • To support me as a writer and to get access to thousands of other Medium articles, get Medium membership using my referral link (no extra charge for you).

Related Articles