
In real-time AI model deployment en masse, efficiency of model inference and hardware/GPU usage is paramount. The speed of a single client-server inference request depends on the Latency and Throughput of the server. This is because deep learning models are typically mounted on a server or a cluster of servers that would receive multiple incoming requests or data from public client devices. Let me describe some definitions here:
Latency: The time taken for a single request-response loop through the client-server connection. Assuming steady internet connection, latency would depend on speed of model inference, transferring of data packets and some other factors.
Throughput: The amount of incoming requests the server can process in a single time instance. When the incoming traffic exceeds the Throughput, excess requests are queued, thus slowing down the request-response process.
In my previous article, we have discussed about setting up and running model inference on edge (a stand-alone GPU Linux device) with TensorRT. As we have seen, it is pretty straightforward:
Running Tensorflow Model on Edge with TensorRT for Fast Inference
On the other hand, in this article we would be discussing on setting up and running model inference with TensorRT on a single server (a GPU workstation that will receive requests from other devices over the network), namely the NVIDIA Triton Inference Server.
There are several merits of running deep learning models on the Triton server, and it is reportedly superior over other server frameworks such as Tensorflow Serving and TorchServe. For instance, it is able to optimize Throughput through dynamic batch inferencing and concurrency in model inference on multiple requests. Combined with using TensorRT to optimize Latency, Triton server offer blazing fast inference en masse.
More information about Triton server’s performance can be found in articles like these:
Accelerating AI/Deep learning models using tensorRT & triton inference
NVIDIA Triton Inference Server Boosts Deep Learning Inference | NVIDIA Technical Blog
Prerequisites
- A local workstation/laptop with a NVIDIA GPU
- Basics of Docker container and Linux Terminal
- Knowledge of Python basic and deep learning libraries
Without further ado, lets get started!
1. An Example of a Tensorflow Model
In our example deep learning model to test on the Triton server, we choose a classic CNN model – ResNet50 – pretrained on the ImageNet dataset, as shown in the code snippet below. Next we will optimize this Tensorflow model to a TensorRT model.
2. Conversion to ONNX Model
While there are different TensorRT frameworks, as such Tensorflow-TensorRT and ONNX TensorRT, the framework adopted by NVIDIA Triton server is only ONNX TensorRT. Therefore, we would need to convert any Keras or Tensorflow models to ONNX format first, as shown in the code snippet below. To follow along first, you might want to save the model.onnx
model in the following directory:
${PWD}/models/tensorrt_fp16_model/1/model.onnx
3. Conversion to TensorRT Model using Docker Container
Next we would want to convert the ONNX model model.onnx
to the TensorRT model model.plan
. If you have TensorRT installed locally, you might be tempted to attempt a local conversion. However, this is problematic as the TensorRT and CUDA software in the Triton server container – as we shall see later – might behave differently when running the model.plan
file. This is true even if the version of the local TensorRT is similar to that of Triton TensorRT.
Fortunately, NVIDIA offers a Docker image for TensorRT, with a version tag complementary to that of Triton server. Assuming you have Docker installed locally, run on the terminal:
docker pull nvcr.io/nvidia/tensorrt:22.11-py3
Once the Docker image is pulled, we will do a volume bind mount when running the container. Note that the argument for volume bind mount must be an absolute path.
docker run -it --gpus all --rm -v ${PWD}/models/tensorrt_fp16_model/1:/trt_optimize nvcr.io/nvidia/tensorrt:22.11-py3
This will start a terminal within the container, following which we do the TensorRT conversion within the container to create the model.plan
, which will also be available locally due to the bind mount.
# cd /trt_optimize
# /workspace/tensorrt/bin/trtexec --onnx=model.onnx --saveEngine=model.plan --explicitBatch --inputIOFormats=fp16:chw --outputIOFormats=fp16:chw --fp16
4. Setting up Local Directory to Mirror Triton Server
The Triton server software is similarly a Docker image which we will download, following which we will also do a volume bind mount. However, before we do the Docker pull, we would want to adhere to and initialize locally a certain directory structure plus a necessary config.pbtxt
model configuration file.
The directory structure should be as follows:
${PWD}/models
|
|-- tensorrt_fp16_model
| |
| |-- config.pbtxt
| |-- 1
| | |
| | |--model.plan
An example for the config.pbtxt
file for our use case is as follows:
name: "tensorrt_fp16_model"
platform: "tensorrt_plan"
max_batch_size: 32
input [
{
name: "input_1"
data_type: TYPE_FP16
dims: [ 224, 224, 3 ]
}
]
output [
{
name: "predictions"
data_type: TYPE_FP16
dims: [ 1000 ]
}
]
5. Setting up Triton Server Container
After making sure that the Docker image version tag of the Triton server is similar to that of TensorRT, we are ready to do the download:
docker pull nvcr.io/nvidia/tritonserver:22.11-py3
Then run the Docker container in attached mode:
docker run --gpus=all --rm --name triton_server -p 8000:8000 -p 8001:8001 -p 8002:8002 -v ${PWD}/models:/models nvcr.io/nvidia/tritonserver:22.11-py3 tritonserver --model-repository=/models --model-control-mode=poll --repository-poll-secs 30
If the Triton server container is set up correctly and ready to do inference, you should see the following output in the terminal without any error message in the ‘Status’:

6. Inference from Client Devices
Here I will demonstrate the sample codes to run inference through the Triton server on client workstations. Of course we also need to have the tritonclient
library installed on the client workstations with:
pip install tritonclient[all]
7. Triton TensorRT is Slower than Local TensorRT
Before we end the article, one caveat I have to mention is that Triton server really shines when doing inference en masse across heavy client-server traffic due to advantages like optimized GPU usage and batch inference.
However, when we consider running a single local inference between Triton TensorRT and local TensorRT, local TensorRT would still have faster inference speed due to the fact that Triton TensorRT has additional overheads running inferences over a network. Some arguments can be seen in the following thread:
Incomprehensible overhead in Tritonserver inference 路 Issue #4812 路 triton-inference-server/server
8. Conclusion
Thanks for reading this article! Deployment of AI models with scalability is an important skill set for any aspiring AI Engineer or Machine Learning Engineer, and learning about the NVIDIA Triton Server definitely gives an edge in the competitive Data Science domain. Otherwise, powerful and accurate models could only languish behind Jupyter Notebooks and VS Code scripts.
Check out the codes for this article on my GitHub repository.
In addition, thanks for joining me in learning about AI and Data Science. If you have enjoyed the content, pop by my other articles on Medium and follow me on LinkedIn.
Before you go, you may also like to check out these recommended articles that I have curated below:
In MLOps, having the right data to train your Machine learning model is critical. If you are interested to learn more about different strategies to data-centric Machine Learning, I would highly encourage you to explore the following article, which is curated based on my experience in Machine Learning:
Also, if you wish to know how to learn fast and deeply in Data Science and Artificial Intelligence, I would encourage you to check out my top article in learning:
The Complete Guide to Effective Learning in Data Science
Support me! – If you are not subscribed to Medium, and like my content, do consider supporting me by joining Medium via my referral link.