The world’s leading publication for data science, AI, and ML professionals.

Reducing the Size of Docker Images Serving Large Language Models

Have you encountered a problem where a 1 GB transformer-based model increases even up to 8 GB when deployed using Docker containerization?

(part 1)

Photo by Dominik Lückmann on Unsplash
Photo by Dominik Lückmann on Unsplash

Introduction

Transformer-based models like a BERT, RoBERTa, or T5 provide a state-of-the-art solution for many custom problems in natural language processing. A common way to deliver the models on production is to build a Docker image that provides an API to the model. The image encapsulates the required dependencies, the model itself, and the code to process the input data with the model. Compared to the large generative models (GenAI), these models are relatively small, from 0.5 to 2 GB. Nevertheless, when you follow the straightforward way to deploy the model as a Docker image, you may be surprised by the size of the image, which can reach 8 GB. Have you wondered why the target image is so large and if there is a way to reduce its size? In this story, I will discuss why the docker image might be so large and how to reduce its size.

The examples of the Python scripts and Docker files used in the story are also available on this repo [1]:

GitHub – CodeNLP/codenlp-docker-ml: This repository demonstrates how to create a small Docker image…


Baseline docker image

Let’s build a simple docker image for a language detection model. Here are some assumptions to build the model:

  • I will use a trained model: papluca/xlm-roberta-base-language-detection [2].
  • I will utilize the GPU to get the best possible performance.
  • I will use FastAPI to provide a simple endpoint to process a single text.

Here is the Dockerfile to build the image:

The code used to load the model and perform the inference is the following:

Here is a command used to build the image:

docker build -t language_detection_cuda . -f Dockerfile_cuda

… and run the image:

docker run --gpus 0 -p 8000:8000 language_detection_cuda

… let’s test the endpoint:

time curl -X 'POST'   'http://localhost:8000/process'   -H 'accept: application/json'   -H 'Content-Type: application/json'   -d '{
  "text": "Certo ci sono stati dei problemi - problemi che dovremo risolvere in vista, per esempio, dell'''ampliamento - ma a volte ne esageriamo il lato negativo."
}'

We got the following output:

"it"

So far, there is nothing fancy. The endpoint does what it is supposed to do.

The model size is 1.11 GB (the model.safetensors file), and another 10 MB for the tokenizer. Now, let’s see what is the size of our docker image:

docker images | grep language_detection_cuda

… the output is:

language_detection_cuda    latest   47f4c1c0de2d   33 minutes ago   7.05GB

The Docker image is 7.05 GB in total. Wow, that’s quite a lot, isn’t it? But why is the image so large? Let’s jump into the container and see what is inside.

docker run -it --gpus 0 -p 8000:8000 --entrypoint "/bin/bash"  language_detection_cuda

To analyze the size of the image, I will use a combination of the ducommand and trace the largest folders.

du -h --max-depth 1 /

The output for the root, including the largest folders:

5.9G    /usr
1.1G    /workspace
 ...

The workspace folder contains the model and the Python script, and its size is mainly the size of the model.safetensors file. Nothing surprising here.

The usr folder contains the dependencies required to run the Python code. Let’s examine what is in the folder.

5340M /usr/local/lib/python3.9/dist-packages/
2961M /usr/local/lib/python3.9/dist-packages/nvidia
1644M /usr/local/lib/python3.9/dist-packages/torch
 439M /usr/local/lib/python3.9/dist-packages/triton
  77M /usr/local/lib/python3.9/dist-packages/transformers
  53M /usr/local/lib/python3.9/dist-packages/sympy
  ...

5.3G out of the 5.9G is for the Python modules. The largest packages are:

  • 3.0 GB – nvidia (cuda, cudnn, cublas, and so on)
  • 1.6 GB – torch,
  • 0.4 GB – triton.

The nvidia and triton modules are torch dependences. The nvidia module is required to run the inference on the GPU. The torch module is, in turn, required by the transformers module to run the inference. The diagram below presents the contribution of the mentioned module to the whole picture.

Image by author: Size of the nvidia, torch, and trition modules compare to other Python modules.
Image by author: Size of the nvidia, torch, and trition modules compare to other Python modules.

There is nothing to do to significantly reduce the image size if we want to run the inference on GPU. However, an alternative to GPU inference can help us reduce the size of the Docker image up to 10x – ONNX [4] with quantization.


Docker image with ONNX model

ONNX with int8 quantization can 4-fol reduce the model size with only a slight loss in performance [5]. Another benefit is the possibility of reducing the size of the Docker images up to 10 times. How is this **** possible? Let’s see what is required to build the Docker images with the ONNX variant of the model:

Here is the Python code used to run inference with onnxruntime:

First, let’s build the image and compare the sizes. Then, we will analyze the difference between this and the previous images.

docker build -t language_detection_onnx . -f Dockerfile_onnx

… and run the image:

docker run -p 8000:8000 language_detection_onnx

Let’s compare the sizes of the images:

docker images | grep language_detection

Output:

language_detection_cuda    latest   47f4c1c0de2d   33 minutes ago   7.05GB
language_detection_onnx    latest   3086089bd994   9 hours ago       699MB

7.05 GB vs 699 MB – that is indeed a 10x smaller Docker image. How was it possible?

There are three main differences between the two images.

1. Base docker image

Instead, nvidia/cuda:11.8.0-base-ubuntu22.04 we used a much smaller base Docker image python:3.9-slim. The first image contains all the Nvidia libraries required to run the inference on GPU (CUDA, cuDNN, cuBLAS). With ONNX and the quantized model, we do not need the GPU to run the inference. Thus, we do not need the Nvidia libraries.

2. Python modules

Instead torch we used the onnxruntime which does not require the nvidia and trition modules. This way, we could get rid of the three large Python modules.

3. Quantized model in the ONNX format

The last significant difference is that we used a quantized model converted to the ONNX format [3]. The model_quantized.onnx has only 279 MB. It is one-fourth of the original model’s size.


Conclusions

The ONNX with quantization can reduce the size of the production image up to 10 times.

In some cases, the size of the production model might be more important than the exorbitant model performance. In such cases, the model quantization and conversion to the ONNX format may come with the help. The quantization can not only reduce the size of the Docker image but can also reduce the costs – running instances with CPU is cheaper than the ones with GPU. Nevertheless, the final decision is based on several factors – the problem is being addressed, the expected performance, the expected inference time, the performance loss for the quantized model, and the configuration of the production environment.


Troubleshooting

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

If you encounter a problem running docker images with the--gpus parameter, check the following:

  1. Install Nvidia container toolkit
sudo apt install nvidia-container-toolkit
  1. Restart the docker service
sudo systemctl restart docker

The following command shout print info about your GPUs:

docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

References

[1] https://github.com/CodeNLP/codenlp-docker-ml

[2] https://huggingface.co/papluca/xlm-roberta-base-language-detection

[3] https://huggingface.co/protectai/xlm-roberta-base-language-detection-onnx

[4] https://onnx.ai/

[5] https://medium.com/codenlp/reducing-inference-time-of-t5-models-76e996523fb2?sk=f02379f5a8363d2de73a332fcef55f78


Related Articles