Deploy your machine learning models with tensorflow serving and kubernetes

François Paupier
Towards Data Science
8 min readJan 25, 2019

--

Machine learning applications are booming and yet there is not a lot of tools available for Data Engineers to integrate those powerful models in production systems. Here I discuss how tensorflow-serving can help you accelerate delivering models in productions. This blog post is about serving machine learning models — what does it mean?

Serving is how you apply a ML model after you’ve trained it — Noah Fiedel Software Engineer working on tensorflow serving

To illustrate the capabilities of tensorflow serving, I will go through the steps of serving an object detection model. Find all the code related to this article on my GitHub: https://github.com/fpaupier/tensorflow-serving_sidecar

Summary of a machine learning pipeline — here we focus on serving the model

Tensorflow serving in a nutshell

Tensorflow serving enables you to seamlessly serve your machine learning models.

  • Deploy a new version of your model and let tensorflow serving gracefully finish current requests while starting to serve new requests with the new model.
  • Separate concerns, data scientists can focus on building great models while Ops can focus on building highly resilient and scalable architectures that can serve those models.

Part 1 — Warm up: Set up a local tensorflow server

Before going online it’s good to make sure your server works on local. I’m giving the big steps here, find more documentation in the project readme .
Take a look at the setup steps to make sure you can take the most out of this tutorial:

  1. git clone https://github.com/fpaupier/tensorflow-serving_sidecar , create a python3.6.5 virtual env and install the requirements.txt
  2. Get tensorflow serving docker image docker pull tensorflow/serving
  3. Get a model to serve → I use this one, it performs object detection faster_rcnn_resnet101_coco
  4. Go to the model directory and rename the saved model subdirectory with a version number, since we are doing a v1 here let’s call it 00001 (it has to be figures). We do this because tensorflow serving docker image search for folders named with that convention when searching for a model to serve.
  5. Now run the tensorflow server:
# From tensorflow-serving_sidecar/
docker run -t --rm -p 8501:8501 \
-v "$(pwd)/data/faster_rcnn_resnet101_coco_2018_01_28:/models/faster_rcnn_resnet" \
-e MODEL_NAME=faster_rcnn_resnet \
tensorflow/serving &

Just a note before going further:

docker -v arg in our use case

Here we bind the port of the container and the localhost. Thus when we will call for inference on localhost:8501 we will actually call the tensorflow server.

You also notice we link our localhost directory faster_rcnn_resnet101_coco_2018_01_28 — where the model is stored — with the container /models/faster_rcnn_resnet path.

Just keep in mind that at this point the savedModel.pb is solely on your machine, not in the container.

6. Perform the client call:

# Don't forget to activate your python3.6.5 venv

# From tensorflow-serving_sidecar/
python client.py --server_url "http://localhost:8501/v1/models/faster_rcnn_resnet:predict" \
--image_path "$(pwd)/object_detection/test_images/image1.jpg" \
--output_json "$(pwd)/object_detection/test_images/out_image1.json" \
--save_output_image "True" \
--label_map "$(pwd)/data/labels.pbtxt"

Go check the path specified by --output_json and enjoy the result. (json and jpeg output available)

expected inference with our object detection model

Great, now that our model works well, let’s deploy it on the cloud.

Part 2— Serve your machine learning application on a kubernetes cluster with tensorflow serving

In a production setting, you want to be able to scale as the load is increasing on your app. You don’t want your server to be overwhelmed.

An exhausted tensorflow server directly exposed over the network

To avoid this issue, you will use a kubernetes cluster to serve your tensorflow-server app. Main improvements to expect:

  • The load will be balanced among your replicas without you having to think about it.
  • Do you want to deploy a new model with no downtime? No problem, kubernetes got your back. Perform a rolling update to progressively serve your new model while gracefully terminating the current requests on the former model.
a tensorflow server application running on many replicas in a k8s cluster, ensuring high availability to users

Let’s dive in

First, we want to create a complete docker image with the object detection model embedded. Once this is done, we will deploy it on a kubernetes cluster. I run my example on Google Cloud Platform because the free tier makes it possible to run this tutorial for free. To help you set up your cloud environment at GCP you can check my tutorial here.

Create a custom tensorflow-serving docker image

  1. Run a serving image as a daemon:
docker run -d --name serving_base tensorflow/serving

2. Copy the faster_rcnn_resnet101_coco model data to the container's models/ folder:

# From tensorflow-serving_sidecar/
docker cp $(pwd)/data/faster_rcnn_resnet101_coco_2018_01_28 serving_base:/models/faster_rcnn_resnet

3. Commit the container to serve the faster_rcnn_resnet model:

docker commit --change "ENV MODEL_NAME faster_rcnn_resnet" serving_base faster_rcnn_resnet_serving

Note: if you use a different model, change faster_rcnn_resnet in the --change argument accordingly.

faster_rcnn_resnet_serving will be our new serving image. You can check this by running docker images, you should see a new docker image:

docker images result after creating a custom tensorflow-serving image

4. Stop the serving base container

docker kill serving_base
docker rm serving_base

Great, the next step is to test our brand new faster_rcnn_resnet_serving image.

Test the custom server

Before deploying our app on kubernetes, let’s make sure it works correctly.

  1. Start the server:
docker run -p 8501:8501 -t faster_rcnn_resnet_serving &

Note: Make sure you have stopped (docker stop <CONTAINER_NAME>) the previously running server otherwise the port 8501 may be locked.

2. We can use the same client code to call the server.

# From tensorflow-serving_sidecar/
python client.py --server_url "http://localhost:8501/v1/models/faster_rcnn_resnet:predict" \
--image_path "$(pwd)/object_detection/test_images/image1.jpg" \
--output_json "$(pwd)/object_detection/test_images/out_image2.json" \
--save_output_image "True" \
--label_map "$(pwd)/data/labels.pbtxt"

We can check we have the same Ok let’s run this on a kubernetes cluster now.

Deploy our app on kubernetes

Unless you already have run a project on GCP, I advise you to check the Google Cloud setup steps.

I assume you have created and logged in a gcloud project named tensorflow-serving

You will use the container image faster_rcnn_resnet_serving built previously to deploy a serving cluster with Kubernetes in the Google Cloud Platform.

  1. Login to your project, first list the available projects with gcloud projects list, select the PROJECT_ID of your project and run
# Get the PROJECT_ID, not the name
gcloud projects list
# Set the project with the right PROJECT_ID, i.e. for me it is tensorflow-serving-229609
gcloud config set project tensorflow-serving-229609
gcloud auth login

2. Create a container cluster

  • First, we create a Google Kubernetes Engine cluster for service deployment. Due to the free trial limitation, you cannot do more than 2 nodes here, you can either upgrade or go with the two nodes which will be good enough for our use case. (You are limited to a quota of 8 CPUs in your free trial.)
gcloud container clusters create faster-rcnn-serving-cluster --num-nodes 2 --zone 'us-east1'

You may update the zone arg, you can choose among e.g: europe-west1, asia-east1 - You check all the zones available with gcloud compute zones list.
You should have something like that:

kubernetes cluster creation output

3. Set the default cluster for gcloud container command and pass cluster credentials to kubectl.

gcloud config set container/cluster faster-rcnn-serving-cluster
gcloud container clusters get-credentials faster-rcnn-serving-cluster --zone 'us-east1'

You should have something like this afterward:

gcloud container clusters get-credentials output

4. Upload the custom tensorflow-serving docker image we built previously. Let’s push our image to the Google Container Registry so that we can run it on Google Cloud Platform.

Tag the faster_rcnn_resnet_serving image using the Container Registry format and our project id, change the tensorflow-serving-229609 with your PROJECT_ID. Also change the tag at the end, here it's our first version so I set the tag to v0.1.0.

docker tag faster_rcnn_resnet_serving gcr.io/tensorflow-serving-229609/faster_rcnn_resnet_serving:v0.1.0

If you run docker images, you now see an additional gcr.io/tensorflow-serving-229609/faster_rcnn_resnet_serving:v0.1.0 image.

This gcr.io prefix allows us to push the image directly to the Container registry,

# To do only once
gcloud auth configure-docker
docker push gcr.io/tensorflow-serving-229609/faster_rcnn_resnet_serving:v0.1.0

You have successfully pushed your image on GCP Container Registry, you can check it online:

docker image successfully pushed on Google Container Registry

5. Create Kubernetes Deployment and Service

The deployment consists of a single replica of faster-rcnn inference server controlled by a Kubernetes Deployment. The replica is exposed externally by a Kubernetes Service along with an External Load Balancer.

Using a single replica does not really make sense. I just do so to pass within the free tier. Load balancing if you have only one instance to direct your query on is useless. In a production setup, use multiple replicas.

We create them using the example Kubernetes config faster_rcnn_resnet_k8s.yaml. You simply need to update the docker image to use in the file, replace the line image: <YOUR_FULL_IMAGE_NAME_HERE> with your actual image full name, it looks like something like that:

image: gcr.io/tensorflow-serving-229609/faster_rcnn_resnet_serving@sha256:9f7eca6da7d833b240f7c54b630a9f85df8dbdfe46abe2b99651278dc4b13c53

You can find it in your container registry:

find your docker full image name on google container registry

And then run the following command

# From tensorflow-serving_sidecar/
kubectl create -f faster_rcnn_resnet_k8s.yaml

To check the status of the deployment and pods use the kubectl get deployments for the whole deployment, kubectl get pods to monitor each replica of your deployment and kubectl get services for the service.

sanity check for deployment

It can take a while for everything to be up and running. The service external IP address is listed next to LoadBalancer Ingress. You can check it with the kubectl describe service command:

kubectl describe service faster-rcnn-resnet-service
find the IP address to query upon to perform inference

Query your online model

And finally, let’s test this. We can use the same client code. Simply replace the previously used localhost in the --server-url arg with the IP address of the LoadBalancer Ingress as specified above.

# From tensorflow-serving_sidecar/
python client.py --server_url "http://34.73.137.228:8501/v1/models/faster_rcnn_resnet:predict" \
--image_path "$(pwd)/object_detection/test_images/image1.jpg" \
--output_json "$(pwd)/object_detection/test_images/out_image3.json" \
--save_output_image "True" \
--label_map "$(pwd)/data/labels.pbtxt"

Takeaways

Tensorflow serving offers a great basis on which you can rely to quickly deploy your model in production with very little overhead.

  • Containerization of machine learning applications for their deployment enables to separate the concerns between Ops and Data Scientists
  • Container orchestration solutions such as Kubernetes combined with tensorflow-serving offer the possibility to deploy high availability models in minutes even for people not familiar with distributed computing.

Resources 📚

--

--