The world鈥檚 leading publication for data science, AI, and ML professionals.

Serving TensorRT Models with NVIDIA Triton Inference Server

Achieving optimal throughput and latency with model inference on high client-server traffic

Image by Florian Krumm on Unsplash
Image by Florian Krumm on Unsplash

In real-time AI model deployment en masse, efficiency of model inference and hardware/GPU usage is paramount. The speed of a single client-server inference request depends on the Latency and Throughput of the server. This is because deep learning models are typically mounted on a server or a cluster of servers that would receive multiple incoming requests or data from public client devices. Let me describe some definitions here:

Latency: The time taken for a single request-response loop through the client-server connection. Assuming steady internet connection, latency would depend on speed of model inference, transferring of data packets and some other factors.

Throughput: The amount of incoming requests the server can process in a single time instance. When the incoming traffic exceeds the Throughput, excess requests are queued, thus slowing down the request-response process.

In my previous article, we have discussed about setting up and running model inference on edge (a stand-alone GPU Linux device) with TensorRT. As we have seen, it is pretty straightforward:

Running Tensorflow Model on Edge with TensorRT for Fast Inference

On the other hand, in this article we would be discussing on setting up and running model inference with TensorRT on a single server (a GPU workstation that will receive requests from other devices over the network), namely the NVIDIA Triton Inference Server.

There are several merits of running deep learning models on the Triton server, and it is reportedly superior over other server frameworks such as Tensorflow Serving and TorchServe. For instance, it is able to optimize Throughput through dynamic batch inferencing and concurrency in model inference on multiple requests. Combined with using TensorRT to optimize Latency, Triton server offer blazing fast inference en masse.

More information about Triton server’s performance can be found in articles like these:

Accelerating AI/Deep learning models using tensorRT & triton inference

NVIDIA Triton Inference Server Boosts Deep Learning Inference | NVIDIA Technical Blog


Prerequisites

  • A local workstation/laptop with a NVIDIA GPU
  • Basics of Docker container and Linux Terminal
  • Knowledge of Python basic and deep learning libraries

Without further ado, lets get started!

1. An Example of a Tensorflow Model

In our example deep learning model to test on the Triton server, we choose a classic CNN model – ResNet50 – pretrained on the ImageNet dataset, as shown in the code snippet below. Next we will optimize this Tensorflow model to a TensorRT model.

2. Conversion to ONNX Model

While there are different TensorRT frameworks, as such Tensorflow-TensorRT and ONNX TensorRT, the framework adopted by NVIDIA Triton server is only ONNX TensorRT. Therefore, we would need to convert any Keras or Tensorflow models to ONNX format first, as shown in the code snippet below. To follow along first, you might want to save the model.onnx model in the following directory:

${PWD}/models/tensorrt_fp16_model/1/model.onnx

3. Conversion to TensorRT Model using Docker Container

Next we would want to convert the ONNX model model.onnx to the TensorRT model model.plan . If you have TensorRT installed locally, you might be tempted to attempt a local conversion. However, this is problematic as the TensorRT and CUDA software in the Triton server container – as we shall see later – might behave differently when running the model.plan file. This is true even if the version of the local TensorRT is similar to that of Triton TensorRT.

Fortunately, NVIDIA offers a Docker image for TensorRT, with a version tag complementary to that of Triton server. Assuming you have Docker installed locally, run on the terminal:

docker pull nvcr.io/nvidia/tensorrt:22.11-py3

Once the Docker image is pulled, we will do a volume bind mount when running the container. Note that the argument for volume bind mount must be an absolute path.

docker run -it --gpus all --rm -v ${PWD}/models/tensorrt_fp16_model/1:/trt_optimize nvcr.io/nvidia/tensorrt:22.11-py3

This will start a terminal within the container, following which we do the TensorRT conversion within the container to create the model.plan , which will also be available locally due to the bind mount.

# cd /trt_optimize
# /workspace/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.plan  --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16

4. Setting up Local Directory to Mirror Triton Server

The Triton server software is similarly a Docker image which we will download, following which we will also do a volume bind mount. However, before we do the Docker pull, we would want to adhere to and initialize locally a certain directory structure plus a necessary config.pbtxt model configuration file.

The directory structure should be as follows:

${PWD}/models
         |
         |-- tensorrt_fp16_model
         |          |
         |          |-- config.pbtxt
         |          |-- 1
         |          |   |
         |          |   |--model.plan

An example for the config.pbtxt file for our use case is as follows:

name: "tensorrt_fp16_model"
platform: "tensorrt_plan"
max_batch_size: 32

input [ 
    {
        name: "input_1"
        data_type: TYPE_FP16
        dims: [ 224, 224, 3 ]

    }
]

output [
    {
        name: "predictions"
        data_type: TYPE_FP16
        dims: [ 1000 ]
    }
]

5. Setting up Triton Server Container

After making sure that the Docker image version tag of the Triton server is similar to that of TensorRT, we are ready to do the download:

docker pull nvcr.io/nvidia/tritonserver:22.11-py3

Then run the Docker container in attached mode:

docker run --gpus=all --rm  --name triton_server -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}/models:/models nvcr.io/nvidia/tritonserver:22.11-py3 tritonserver --model-repository=/models --model-control-mode=poll --repository-poll-secs 30

If the Triton server container is set up correctly and ready to do inference, you should see the following output in the terminal without any error message in the ‘Status’:

6. Inference from Client Devices

Here I will demonstrate the sample codes to run inference through the Triton server on client workstations. Of course we also need to have the tritonclient library installed on the client workstations with:

pip install tritonclient[all]

7. Triton TensorRT is Slower than Local TensorRT

Before we end the article, one caveat I have to mention is that Triton server really shines when doing inference en masse across heavy client-server traffic due to advantages like optimized GPU usage and batch inference.

However, when we consider running a single local inference between Triton TensorRT and local TensorRT, local TensorRT would still have faster inference speed due to the fact that Triton TensorRT has additional overheads running inferences over a network. Some arguments can be seen in the following thread:

Incomprehensible overhead in Tritonserver inference 路 Issue #4812 路 triton-inference-server/server

8. Conclusion

Thanks for reading this article! Deployment of AI models with scalability is an important skill set for any aspiring AI Engineer or Machine Learning Engineer, and learning about the NVIDIA Triton Server definitely gives an edge in the competitive Data Science domain. Otherwise, powerful and accurate models could only languish behind Jupyter Notebooks and VS Code scripts.

Check out the codes for this article on my GitHub repository.

In addition, thanks for joining me in learning about AI and Data Science. If you have enjoyed the content, pop by my other articles on Medium and follow me on LinkedIn.

Before you go, you may also like to check out these recommended articles that I have curated below:

In MLOps, having the right data to train your Machine learning model is critical. If you are interested to learn more about different strategies to data-centric Machine Learning, I would highly encourage you to explore the following article, which is curated based on my experience in Machine Learning:

Data-Centric AI – Data Collection and Augmentation Strategy

Also, if you wish to know how to learn fast and deeply in Data Science and Artificial Intelligence, I would encourage you to check out my top article in learning:

The Complete Guide to Effective Learning in Data Science

Support me! If you are not subscribed to Medium, and like my content, do consider supporting me by joining Medium via my referral link.

Join Medium with my referral link – Tan Pengshi Alvin


Related Articles