(part 1)

Introduction
Transformer-based models like a BERT, RoBERTa, or T5 provide a state-of-the-art solution for many custom problems in natural language processing. A common way to deliver the models on production is to build a Docker image that provides an API to the model. The image encapsulates the required dependencies, the model itself, and the code to process the input data with the model. Compared to the large generative models (GenAI), these models are relatively small, from 0.5 to 2 GB. Nevertheless, when you follow the straightforward way to deploy the model as a Docker image, you may be surprised by the size of the image, which can reach 8 GB. Have you wondered why the target image is so large and if there is a way to reduce its size? In this story, I will discuss why the docker image might be so large and how to reduce its size.
The examples of the Python scripts and Docker files used in the story are also available on this repo [1]:
GitHub – CodeNLP/codenlp-docker-ml: This repository demonstrates how to create a small Docker image…
Baseline docker image
Let’s build a simple docker image for a language detection model. Here are some assumptions to build the model:
- I will use a trained model:
papluca/xlm-roberta-base-language-detection
[2]. - I will utilize the GPU to get the best possible performance.
- I will use FastAPI to provide a simple endpoint to process a single text.
Here is the Dockerfile to build the image:
The code used to load the model and perform the inference is the following:
Here is a command used to build the image:
docker build -t language_detection_cuda . -f Dockerfile_cuda
… and run the image:
docker run --gpus 0 -p 8000:8000 language_detection_cuda
… let’s test the endpoint:
time curl -X 'POST' 'http://localhost:8000/process' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"text": "Certo ci sono stati dei problemi - problemi che dovremo risolvere in vista, per esempio, dell'''ampliamento - ma a volte ne esageriamo il lato negativo."
}'
We got the following output:
"it"
So far, there is nothing fancy. The endpoint does what it is supposed to do.
The model size is 1.11 GB (the model.safetensors
file), and another 10 MB for the tokenizer. Now, let’s see what is the size of our docker image:
docker images | grep language_detection_cuda
… the output is:
language_detection_cuda latest 47f4c1c0de2d 33 minutes ago 7.05GB
The Docker image is 7.05 GB in total. Wow, that’s quite a lot, isn’t it? But why is the image so large? Let’s jump into the container and see what is inside.
docker run -it --gpus 0 -p 8000:8000 --entrypoint "/bin/bash" language_detection_cuda
To analyze the size of the image, I will use a combination of the du
command and trace the largest folders.
du -h --max-depth 1 /
The output for the root, including the largest folders:
5.9G /usr
1.1G /workspace
...
The workspace folder contains the model and the Python script, and its size is mainly the size of the model.safetensors
file. Nothing surprising here.
The usr folder contains the dependencies required to run the Python code. Let’s examine what is in the folder.
5340M /usr/local/lib/python3.9/dist-packages/
2961M /usr/local/lib/python3.9/dist-packages/nvidia
1644M /usr/local/lib/python3.9/dist-packages/torch
439M /usr/local/lib/python3.9/dist-packages/triton
77M /usr/local/lib/python3.9/dist-packages/transformers
53M /usr/local/lib/python3.9/dist-packages/sympy
...
5.3G out of the 5.9G is for the Python modules. The largest packages are:
- 3.0 GB – nvidia (cuda, cudnn, cublas, and so on)
- 1.6 GB – torch,
- 0.4 GB – triton.
The nvidia and triton modules are torch dependences. The nvidia module is required to run the inference on the GPU. The torch module is, in turn, required by the transformers module to run the inference. The diagram below presents the contribution of the mentioned module to the whole picture.

There is nothing to do to significantly reduce the image size if we want to run the inference on GPU. However, an alternative to GPU inference can help us reduce the size of the Docker image up to 10x – ONNX [4] with quantization.
Docker image with ONNX model
ONNX with int8 quantization can 4-fol reduce the model size with only a slight loss in performance [5]. Another benefit is the possibility of reducing the size of the Docker images up to 10 times. How is this **** possible? Let’s see what is required to build the Docker images with the ONNX variant of the model:
Here is the Python code used to run inference with onnxruntime
:
First, let’s build the image and compare the sizes. Then, we will analyze the difference between this and the previous images.
docker build -t language_detection_onnx . -f Dockerfile_onnx
… and run the image:
docker run -p 8000:8000 language_detection_onnx
Let’s compare the sizes of the images:
docker images | grep language_detection
Output:
language_detection_cuda latest 47f4c1c0de2d 33 minutes ago 7.05GB
language_detection_onnx latest 3086089bd994 9 hours ago 699MB
7.05 GB vs 699 MB – that is indeed a 10x smaller Docker image. How was it possible?
There are three main differences between the two images.
1. Base docker image
Instead, nvidia/cuda:11.8.0-base-ubuntu22.04
we used a much smaller base Docker image python:3.9-slim
. The first image contains all the Nvidia libraries required to run the inference on GPU (CUDA, cuDNN, cuBLAS). With ONNX and the quantized model, we do not need the GPU to run the inference. Thus, we do not need the Nvidia libraries.
2. Python modules
Instead torch
we used the onnxruntime
which does not require the nvidia
and trition
modules. This way, we could get rid of the three large Python modules.
3. Quantized model in the ONNX format
The last significant difference is that we used a quantized model converted to the ONNX format [3]. The model_quantized.onnx
has only 279 MB. It is one-fourth of the original model’s size.
Conclusions
The ONNX with quantization can reduce the size of the production image up to 10 times.
In some cases, the size of the production model might be more important than the exorbitant model performance. In such cases, the model quantization and conversion to the ONNX format may come with the help. The quantization can not only reduce the size of the Docker image but can also reduce the costs – running instances with CPU is cheaper than the ones with GPU. Nevertheless, the final decision is based on several factors – the problem is being addressed, the expected performance, the expected inference time, the performance loss for the quantized model, and the configuration of the production environment.
Troubleshooting
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
If you encounter a problem running docker images with the--gpus
parameter, check the following:
- Install Nvidia container toolkit
sudo apt install nvidia-container-toolkit
- Restart the docker service
sudo systemctl restart docker
The following command shout print info about your GPUs:
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
References
[1] https://github.com/CodeNLP/codenlp-docker-ml
[2] https://huggingface.co/papluca/xlm-roberta-base-language-detection
[3] https://huggingface.co/protectai/xlm-roberta-base-language-detection-onnx
[4] https://onnx.ai/