Training a Machine Learning (ML) model is only one step in the ML lifecycle. There’s no purpose to ML if you cannot get a response from your model. You must be able to host your trained model for inference. There’s a variety of hosting/deployment options that can be used for ML, with one of the most popular being TensorFlow Serving.
TensorFlow Serving helps take your trained model’s artifacts and host it for inference. The easiest way to use TensorFlow Serving for inference is Docker. In this article we’ll take an example of training a model and then hosting it using TensorFlow Serving in conjunction with Docker to serve a REST API for inference. If you would like to directly jump to the code, take a look at this repository.
Table of Contents
- Prerequisites/Setup
- TensorFlow Serving Architecture
- Training & Saving Your Model
- Hosting Your Model With TensorFlow Serving & Docker
- Credits + Additional Resources & Conclusion
Prerequisites/Setup
For this article we will not be diving into any actual model building. We’ll be taking the popular Boston Dataset and training a simple Artificial Neural Network. This article is centered more around the Infrastructure for hosting ML models not optimizing model performance itself.
Make sure to have Docker installed and up and running. To fully follow the article make sure to also have a basic understanding of Docker commands as well as Python.
NOTE: The Dataset was originally published and licensed by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management.
TensorFlow Serving Architecture

The main point to understand behind TensorFlow Serving is the Servable object that a client uses to perform inference. Typically this Servable is the SavedModelBundle that you will create after training your model. Each Servable will have an associated version, this can be specified while you’re saving your model. Using this you can also serve multiple different models at once if you wish.
The flow is as generally follows –
- Source: Loads the model from the file path that is provided, this is the servable object. We’ll specify this path when we start our container.
- Loader: Takes a new model from the source, contains the functionality for loading and unloading the servable, an AspiredVersion is created based off of the new model.
- Model Manager: Maintains the lifecycle of the model, this means any updates to model versions/metadata. The Manager essentially listens to the Source and fulfills requests if the necessary resources are available (Ex: metadata for a specific version that does or does not exist).
- ServableHandle: This is the talking point with the Client code, here the appropriate API is exposed (REST or gRPC).
To get a full deep dive on the TensorFlow Serving architecture make sure to read the official documentation here.
Training & Saving Your Model
We’ll first train a simple model on the Boston Housing dataset. We provide all this in a train.py script, make sure to have the following imports installed.
We can load the Boston dataset directly from Sklearn and split the dataset for training.
We then create our neural network and train our model.
The last line of the script is key for serving your model properly.
The "boston_model" will be the model_name and directory in which we’ll be pointing to our trained model artifacts. The sub-directory ‘0000001’ that is created is the directory with your model metadata: assets, variables, keras_metadata.pb, and saved_model.pb. Make sure to run this script and ensure that the following directories are created after execution as shown below.

Hosting Your Model With TensorFlow Serving & Docker
Make sure to have Docker up and running. The first thing we’ll do is pull the TensorFlow Serving Docker image with the following command.
After you’ve pulled the image we need to start the container, for the REST API we need to expose Port 8501. Note that gRPC is also supported, for this expose Port 8500.
If you run the following Docker command you should see the container name that you’ve added display.

Make sure to add the path to where your project is in the source command. The "MODEL_NAME" environment variable is the directory you saved your model to, this is where the latest version of your model is saved. Note that you can also inject other environment variables that your container may be working with here as well when running the container.

Now that we have a REST endpoint we can send a sample invocation using a Curl command. For a single data point run the following command.

If you want to send multiple invocations you can also make a sample shell script to iterate over the payload. We can package the sample data point in a JSON and feed that to our endpoint.
You can execute the shell script with the following command and see 10 results returned.
To get even more creative or to load test at a larger scale you can also use packages such as the Python requests library.
Conclusion
GitHub – RamVegiraju/TF-Serving-Demo: Example of Inference with TF Serving with Docker
For the entire code for this example, access the link above. I hope this article was a good introduction into TensorFlow Serving. It’s becoming more essential than ever for Data Scientist’s to also understand the infrastructure and hosting side of Machine Learning. There’s a number of hosting options out there, I’ve attached my SageMaker Series that I’ve been working on below along with some other resources that also helped me out when I got started with model hosting and inference.
Credits/Additional Resources
Serving Models With TF Serving and Docker
TF Serving With Amazon SageMaker
If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter. If you’re new to Medium, sign up using my Membership Referral.