
Machine Learning’s (ML) value is truly recognized in real-world applications when we arrive at Model Hosting and Inference. It’s hard to productionize ML workloads if you don’t have a highly performant model-serving solution that will help your model scale up and down.
What is a model server/what is model serving? Think of a model server as a bit of an equivalent to a web server in the ML world. It’s not enough to just throw large amounts of hardware behind the model, you need a communication layer that will help process your client’s requests while efficiently allocating the hardware needed to address the traffic your application is seeing. Model Servers are a tunable feature for users: we can squeeze performance from a latency perspective by controlling aspects such as gRPC vs REST, etc. Popular examples of model servers include the following:
The one we explore today is Nvidia Triton Inference Server, a highly flexible and performant model serving solution. Each model server requires for model artifacts and inference scripts to be presented in its own unique way that it can understand. In today’s article we take a sample PyTorch model and show how we can host it utilizing Triton Inference Server.
NOTE: This article assumes basic understanding of Machine Learning and does not delve into any theory behind model building. A fluency of Python and a basic understanding of Docker containers is also assumed. We will also be working in a SageMaker Classic Notebook Instance for development, so please create an AWS account if needed (you can also run this sample elsewhere if you prefer).
DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.
Why Triton Inference Server?
Triton Inference Server is an open source model serving solution that has a variety of benefits including the following:
- Framework Support: Triton natively supports a multitude of frameworks such as PyTorch, TensorFlow, ONNX, and custom Python/C++ environments. Triton recognizes these different frameworks in its setup as a "backend". In this example we use the PyTorch backend it provides for hosting our TorchScript model.
- Dynamic Batching: Inference performance tuning is an iterative experiment. Dynamic Batching is one inference optimization technique where you can group together multiple requests into one. With Triton you can enable Dynamic Batching and control the Maximum Batch Size to try to optimize both throughput and latency.
- Ensembles/Model Pipelines: Via ensembles and model pipelines you can stitch together multiple ML models along with any pre/post processing logic you have into a universal execution flow behind a single container.
- Hardware Support: Triton supports both CPU/Gpu based workloads. This makes it easy to pair with a hosting platform such as SageMaker Real-Time Inference, where you can host thousands of models utilizing Multi-Model Endpoints with GPUs and Triton as the model serving stack.
Triton Inference Server like other Model Servers has its pros and cons depending on the use-case, so based off of your model and hosting requirements, it’s essential you choose the right model server option.
Local Model Setup
For this example, we will be working in a SageMaker Classic Notebook Instance on the PyTorch Kernel. The instance will be a g4dn.4xlarge, so we can run our Triton container on a GPU based instance.
To get started with we will work with a sample PyTorch Linear Regression model that we will train on some dummy data generated by numpy.
# Dummy data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 1 + 2 * X + np.random.randn(100, 1)
# Model
class LinearRegression(nn.Module):
def __init__(self):
super(LinearRegression, self).__init__()
self.linear = nn.Linear(1, 1)
def forward(self, x):
return self.linear(x)
# Training
for epoch in range(num_epochs):
# Forward pass
outputs = model(X_tensor)
loss = criterion(outputs, y_tensor)
# Backward pass and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
We then save this model as a TorchScript model for our Triton PyTorch backend and run a sample inference so we can understand what a sample input for our model’s inference will look like.
# save model as a torchscript model
torch.jit.save(torch.jit.script(model), 'model.pt')
# Load the saved model
loaded_model = torch.jit.load('model.pt')
# sample inference
test = torch.tensor([[2.5]])
pred = loaded_model(test)
print(pred)
Now that we have our model.pt and understand how to run inference with this model, we can focus on setting up Triton to host this specific model.
Triton Inference Server Hosting
Before we can work on the getting the Triton Inference Server up we need to understand what artifacts it requires and the format it expects it to be presented in. We already have our model.pt, this is our model metadata file. Triton also expects a config.pbtxt file that essentially defines your serving properties. In this case we define a few fields that are necessary:
- model_name: This is the model name in our model repository, you can also version it incase there are multiple versions of the model.
- platform: This is the backend environment for the Server, in this case we define the backend as pytorch_libtorch to setup the environment for our TorchScript model.
- input/output: For Triton we have to define our input/output data shapes and format (you can find out this info with numpy if you are unaware)
Optionally you can also define more advanced parameters such as enabling dynamic batching, max batch size, and backend specific variables that you would like to tweak for performance testing. Our base config.pbtxt looks like the following:
name: "linear_regression_model"
platform: "pytorch_libtorch"
input {
name: "input"
data_type: TYPE_FP32
dims: [ 1, 1 ]
}
output {
name: "output"
data_type: TYPE_FP32
dims: [ 1, 1 ]
}
Triton also expects this file and the model.pt to be presented in a specific format before we can kickstart the server. For the PyTorch backend the following model repository structure is expected for your artifacts:
- models/
- linear_regression_model
- 1
- model.pt
- model.py (optional, not included here)
- config.pbtxt
- optional incldue any other models for example if you have an ensemble
We then shift our artifacts into this existing structure using the following bash command:
mkdir linear_regression_model
mv config.pbtxt model.pt linear_regression_model
cd linear_regression_model
mkdir 1
mv model.pt 1/
cd ..
To start the Triton Inference Server, ensure you have Docker installed in your environment; this comes pre-installed in SageMaker Classic Notebook Instances (not SageMaker Studio at the time of this article).
To start out container we first pull the latest Triton image utilizing the following command:
docker pull nvcr.io/Nvidia/tritonserver:23.08-py3
We can then utilize Nvidia CLI utility commands in conjunction with Docker to start our Triton Inference Server. Triton has three default ports it exposes for inference that we specify in our Docker run command:
- 8000: HTTP REST API requests, we utilize this port in this example
- 8001: [gRPC requests](https://aws.amazon.com/compare/the-difference-between-grpc-and-rest/#:~:text=In%20gRPC%2C%20one%20component%20(the,updates%20data%20on%20the%20server.), this is especially useful for optimizing Computer Vision workloads
- 8002: Metrics and monitoring related to Triton Inference Server via Promotheus
Make sure to also replace the path provided with the path to your model repository directory. In this instance it is ‘/home/ec2-user/SageMaker’, which you want to replace with the path to your model repository.
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002
-v /home/ec2-user/SageMaker:/models nvcr.io/nvidia/tritonserver:23.08-py3
tritonserver --model-repository=/models --exit-on-error=false --log-verbose=1
Once we run this command you should see the Triton Inference Server has been started up.

We can now make requests to the model server, which we can conduct in two separate ways that we’ll explore:
- Python Request Library: Here you can pass in the inference URL for the Triton Server address and specify your input parameters.
- Triton Client Library: A client provided by Triton that you can instantiate and use their built-in API calls. We use Python in this case, but you can also use Java and C++.
For the requests library we pass in the appropriate URL with our model name and version like the following:
# sample data
input_data = np.array([[2.5]], dtype=np.float32)
# Specify the model name and version
model_name = "linear_regression_model" #specified in config.pbtxt
model_version = "1"
# Set the inference URL based on the Triton server's address
url = f"http://localhost:8000/v2/models/{model_name}/versions/{model_version}/infer"
# payload with input params
payload = {
"inputs": [
{
"name": "input", # what you named input in config.pbtxt
"datatype": "FP32",
"shape": input_data.shape,
"data": input_data.tolist(),
}
]
}
# sample invoke
response = requests.post(url, data=json.dumps(payload))
For the Triton Client Library we use the same values we specified in our config file and above, but utilize the HTTP client specific API calls.
import tritonclient.http as httpclient #pip install if needed
# setup triton inference client
client = httpclient.InferenceServerClient(url="localhost:8000")
# triton can infer the inputs from your config values
inputs = httpclient.InferInput("input", input_data.shape, datatype="FP32")
inputs.set_data_from_numpy(input_data) #we set a numpy array in this case
# output configuration
outputs = httpclient.InferRequestedOutput("output")
#sample inference
res = client.infer(model_name = "linear_regression_model", inputs=[inputs], outputs=[outputs],
)
inference_output = res.as_numpy('output') #serialize numpy output
Additional Resources & Conclusion
triton-inference-server-examples/pytorch-backend/triton-pytorch-ann.ipynb at master ·…
The entire code for the example can be found at the link above. Triton Inference Server is a dynamic model serving option that can be used for advanced ML model hosting. For more Triton specific examples please refer to the following Github repository. In coming articles we will continue to explore how we can harness different model servers to host various ML models.
As always thank you for reading and feel free to leave any feedback.
If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter. If you’re new to Medium, sign up using my Membership Referral.