Counting No. of Parameters in Deep Learning Models by Hand
5 simple examples to count parameters in FFNN, RNN and CNN models
Why do we need to count the number of parameters in a deep learning model again? We don’t. But in cases where we need to reduce the file size of the model or even reduce the time taken for model inference, knowing the number of parameters before and after model quantization would come in handy. (See video here on Efficient Methods and Hardware for Deep Learning.)
Counting the number of trainable parameters of deep learning models is considered too trivial, because your code can already do this for you. But I’d like to keep my notes here for us to refer to once in a while. Here are the models that we’ll run through:
In parallel, I will build the model with APIs from Keras for easy prototyping and a clean code so let’s quickly import the relevant objects here:
from keras.layers import Input, Dense, SimpleRNN, LSTM, GRU, Conv2D
from keras.layers import Bidirectional
from keras.models import Model
After building the model
, call model.count_params()
to verify how many parameters are trainable.
1. FFNNs
- i, input size
- h, size of hidden layer
- o, output size
For one hidden layer,
num_params
= connections between layers + biases in every layer
= (i×h + h×o) + (h+o)
Example 1.1: Input size 3, hidden layer size 5, output size 2
- i = 3
- h = 5
- o = 2
num_params
= connections between layers + biases in every layer
= (3×5 + 5×2) + (5+2)
= 32
input = Input((None, 3))
dense = Dense(5)(input)
output = Dense(2)(dense)
model = Model(input, output)
Example 1.2: Input size 50, hidden layers size [100,1,100], output size 50
- i = 50
- h = 100, 1, 100
- o = 50
num_params
= connections between layers + biases in every layer
= (50×100 + 100×1 + 1×100 + 100×50) + (100+1+100+50)
= 10,451
input = Input((None, 50))
dense = Dense(100)(input)
dense = Dense(1)(dense)
dense = Dense(100)(dense)
output = Dense(50)(dense)
model = Model(input, output)
2. RNNs
- g, no. of FFNNs in a unit (RNN has 1, GRU has 3, LSTM has 4)
- h, size of hidden units
- i, dimension/size of input
Since every FFNN has h(h+i) + h parameters, we have
num_params = g × [h(h+i) + h]
Example 2.1: LSTM with 2 hidden units and input dimension 3.
- g = 4 (LSTM has 4 FFNNs)
- h = 2
- i = 3
num_params
= g × [h(h+i) + h]
= 4 × [2(2+3) + 2]
= 48
input = Input((None, 3))
lstm = LSTM(2)(input)
model = Model(input, lstm)
Example 2.2: Stacked Bidirectional GRU with 5 hidden units and input size 8 (whose outputs are concatenated) + LSTM with 50 hidden units
Bidirectional GRU with 5 hidden units and input size 8
- g = 3 (GRU has 3 FFNNs)
- h = 5
- i = 8
num_params_layer1
= 2 × g × [h(h+i) + h] (first term is 2 because of bidirectionality)
= 2 × 3 × [5(5+8) + 5]
= 420
LSTM with 50 hidden units
- g = 4 (LSTM has 4 FFNNs)
- h = 50
- i = 5+5 (outputs from bidirectional GRU concatenated; output size of GRU is 5, same as no. of hidden units)
num_params_layer2
= g × [h(h+i) + h]
= 4 × [50(50+10) + 50]
= 12,200
total_params = 420 + 12,200 = 12,620
input = Input((None, 8))
layer1 = Bidirectional(GRU(5, return_sequences=True))(input)
layer2 = LSTM(50)(layer1)
model = Model(input, layer2)
merge_mode
is concatenation by default.
CNNs
For one layer,
- i, no. of input maps (or channels)
- f, filter size (just the length)
- o, no. of output maps (or channels. this is also defined by how many filters are used)
One filter is applied to every input map.
num_params
= weights + biases
= [i × (f×f) × o] + o
Example 3.1: Greyscale image with 2×2 filter, output 3 channels
- i = 1 (greyscale has only 1 channel)
- f = 2
- o = 3
num_params
= [i × (f×f) × o] + o
= [1 × (2×2) × 3] + 3
= 15
input = Input((None, None, 1))
conv2d = Conv2D(kernel_size=2, filters=3)(input)
model = Model(input, conv2d)
Example 3.2: RGB image with 2×2 filter, output of 1 channel
There is 1 filter for each input feature map. The resulting convolutions are added element-wise, and a bias term is added to each element. This gives an output with 1 feature map.
- i = 3 (RGB image has 3 channels)
- f = 2
- o = 1
num_params
= [i × (f×f) × o] + o
= [3 × (2×2) × 1] + 1
= 13
input = Input((None, None, 3))
conv2d = Conv2D(kernel_size=2, filters=1)(input)
model = Model(input, conv2d)
Example 3.3: Image with 2 channels, with 2×2 filter, and output of 3 channels
There are 3 filters (purple, yellow, cyan) for each input feature map. The resulting convolutions are added element-wise, and a bias term is added to each element. This gives an output with 3 feature maps.
- i = 2
- f = 2
- o = 3
num_params
= [i × (f×f) × o] + o
= [2 × (2×2) × 3] + 3
= 27
input = Input((None, None, 2))
conv2d = Conv2D(kernel_size=2, filters=3)(input)
model = Model(input, conv2d)
That’s all for now! Do leave comments below if you have any feedback!
Related Articles on Deep Learning
Step-by-Step Tutorial on Linear Regression with Stochastic Gradient Descent
Follow me on Twitter @remykarem or LinkedIn. You may also reach out to me via raimi.bkarim@gmail.com. Feel free to visit my website at remykarem.github.io.