
Modern Deep Learning research is mainly focused on creating and improving new, better, and optimized solutions for various problems such as object detection, segmentation, self-supervised learning, etc. This is helping many businesses and startups start using more accessible AI solutions for their products. However, when it comes to real-world applications, especially when models are developed for edge devices, we face many limitations such as large model size and low latency.
There are many methods to make AI models accessible to mobile and other edge devices. One option is to use small models designed for mobile devices (such as MobileNet and Yolo for mobile devices). Other methods include optimization at the inference level. The latter include methods such as model pruning, quantization, module fusion, etc. In this blog post, we will look at quantization and fusion methods for convolutional neural networks. We are going to use PyTorch’s quantization module and compare the size and latency of models with and without quantization.
Blog overview:
- What is quantization?
- Quantization techniques
- What is module fusion?
- Application and comparison in PyTorch
What Is Quantization?
Quantization is a simple technique to speed up deep learning models at the inference stage. It is a method of compressing information. Model parameters are stored in floating point numbers, and model operations are calculated using these floating point numbers. Floating points have high precision, however, they can be very memory intensive and computationally expensive. Quantization converts 32-bit floating point numbers to 8-bit integers. It performs some or all of the operations on 8-bit integers, which can reduce the model size and memory requirements by a factor of 4.
However, there is a cost to that. In order to reduce the size of the model and improve the execution time, we will sacrifice some precision. So there will be a trade-off between model accuracy and size/latency.
To do quantization, we need a mapping function that maps floating point numbers to integers. The most common and easiest way to do this is a linear transformation.
Q(r) = round(1/S + Z), where
- r is the input,
-
S is a scaling factor that is the ratio of the input range to the quantized output range. The easiest way to find the scaling factor is to use the minimum and maximum values of the input and output ranges. However, PyTorch has other methods such as MSE minimization, entropy minimization, etc. The process of finding the scaling factor is called calibration.

- Z is a zero point, this parameter helps to correctly map the zero point from the input to the output space.

PyTorch has an Observer module that can be used to calibrate a model. It collects statistics on the input values and calculates the S and Z parameters. The quantization parameters can also be calculated by tensor or by channel. In the first case, the Observer will take the entire tensor and extract statistics from it, in the second case, the S and Z parameters are calculated for each channel separately.
Quantization Techniques
In PyTorch there are 3 different ways to implement quantization:
-
Dynamic Quantization Model weights are quantized, and activations are quantized dynamically during inference. This can result in higher accuracy, as calibration is implemented for each input. This technique works on LSTM, GRU, RNN
-
Post-Training Static Quantization In this scenario model weights and activations are pre-quantized. Calibration is implemented on validation data. This is faster than dynamic quantizations, however, it may need re-calibration from time to time to stay robust.
-
Quantization-aware Training (QAT) These techniques aim to improve post-training quantization. It adds quantization error in the training loss. S and Z parameters can be learned during the training.
In the continuation of this post, we will use Post-training quantization for the convolutional neural network.
What is module fusion?
Before moving on to the PyTorch codes, let’s understand one more technique, which is called fusion. Fusion aims to merge multiple layers into one, which can save inference time as well as reduce memory access. It sends the combined sequence of operations to a low-level library that can compute it in one go without an intermediate representation back to PyTorch. However, this technique also comes at a cost. Now the model is hard to debug when the layers are merged. Fusion only works for the following layer group: [Conv, Relu], [Conv, BatchNorm], [Conv, BatchNorm, Relu], [Linear, Relu].
Application and comparison in PyTorch
# Import packages
from torch import nn
from torchsummary import summary
import torch
import os
First, let’s create a simple convolutional neural network.
As you can see from the code snippet above, we have created a small convolutional network with two convolutional layers followed by relu and maxpooling. In the end, we have two fully connected layers and a Softmax output.
n = Net().cuda()
summary(n, (3, 224, 224))

The model summary shows that we have about 70 million parameters and estimated model size of 294MB.
To create a quantized version of the same model, we will create 2 new attributes to quantize and dequantize the model. Next, during the forward pass, we will quantize the network input and dequantize before softmax.
To run the quantized model for the eval() setting we need to define the configuration. Torch has 2 backends for quantization: ‘fbgemm’ for x86 CPUs with AVX2 support or higher and ”qnnpack” for ARM CPUs(mobile/embedded device).
Next, we need to prepare the model using torch.quantization.prepare. This runs an observer method that collects statistics for input. And torch.quantization.convert converts from an observer state to a quantized state.
# Define original and quantized models and prepae for evaluation
net = Net()
net.eval()
net_quant = NetQuant()
net_quant.eval()
# Prepare Model Quantization and convert to quantized version
net_quant.qconfig = torch.quantization.get_default_qconfig("fbgemm")
torch.backends.quantized.engine = "fbgemm"
net_quant = torch.quantization.prepare(net_quant.cpu(), inplace=False)
net_quant = torch.quantization.convert(net_quant, inplace=False)
Check model size
# Check model size
def print_model_size(mdl):
torch.save(mdl.state_dict(), "tmp.pt")
size = round(os.path.getsize("tmp.pt")/1e6)
os.remove('tmp.pt')
return size
net_size = print_model_size(net)
quant_size = print_model_size(net_quant)
print(f'Size whitout quantization: {net_size} MB n Size whit quantization: {quant_size} MB')
print(f'Size ratio: {round(net_size/quant_size, 2)}')

Now, we got the model which has 4 times smaller size than the original one.
Check Model Latency
# input for the model
inpp = torch.rand(32, 3, 224, 224)
# compare the performance
print("Floating point FP32")
%timeit net(inpp)
print("Quantized INT8")
%timeit net_quant(inpp)

The original model on CPU runs at 162 ms, while the quantized model is about 1.7 times faster.
Fusion
Next, let’s implement fusion for more Optimization. As we can see in our model structure, there are only 3 ways to fuse layers:
# Perpare blocks for the fusion
moduls_to_fuse = [['conv1', 'relu1'],
['conv2', 'relu2'],
['fc1', 'relu3']]
net_quant_fused = torch.quantization.fuse_modules(net_quant, moduls_to_fuse)
net_fused = torch.quantization.fuse_modules(net, moduls_to_fuse)
Now let’s check the latency again:
print("Fused and quantized model latency")
%timeit net_quant_fused(inpp)
print("Fused model latency")
%timeit net_fused(inpp)

As we can see, we saved some run time by using fusion with or without quantization. However, this did not affect accuracy like quantization did.
Conclusion
Optimizing models for inference is not an easy task. PyTorch makes it much easier to use these optimization techniques. However, always remember to carefully check the accuracy of the model before and after quantization to ensure that your model is not only fast but also accurate.
Find all the codes on GitHub: https://github.com/LilitYolyan/inference_optimization_cnn
Check references: