Deploy TensorFlow models with Istio on Kubernetes

Masroor Hasan
Towards Data Science
13 min readMay 19, 2019

--

Typical cloud application requirements of deployment, versioning, scaling, monitoring are also among the common operational challenges of Machine Learning (ML) services.

This post will focus on building a ML serving infrastructure to continuously update, version and deploy models.

Infrastructure Stack

In building our ML serving infrastructure, we will setup a Kubernetes cluster in a cloud environment and leverage Istio to handle service level operations. Next, we will use TensorFlow Serving to deploy and serve a ResNet model hosted on a S3 bucket. Lastly, we will take a look at how to perform staged canary rollouts of newer model versions and eventually automating the rollout process with Flagger.

At a high-level, our infrastructure stack includes:

  • Kubernetes: open-source container orchestration system for application infrastructure and management.
  • Istio: open-source “service-mesh” to enable operational management of micro-services in distributed environments.
  • TensorFlow Serving: open-source high-performance ML model serving system.
  • S3 Storage: AWS cloud object storage.
  • Flagger: open-source automated canary deployment manager as Kubernetes operator.

Kubernetes Cluster

Kubernetes has done wonders in re-shaping the cloud infrastructure landscape. Spinning up a cluster is supported on multiple environments, with almost all major cloud providers offering managed Kubernetes as hosted solutions.

For this post, we will take the opportunity to test-drive one of the newest solutions around the block, DigitalOcean’s managed Kubernetes solution. Get started by creating a new DigitalOcean Kubernetes cluster on a chosen datacenter and node pool configurations.

Download the configuration file and add it to bash session.

export KUBECONFIG=k8s-1-13-5-do-1-sfo2-1555861262145-kubeconfig.yaml

Check the status of the node and verify that it is healthy and ready to accept workloads.

kubectl get nodesNAME                                        STATUS    ROLES     AGE       VERSIONk8s-1-13-5-do-1-sfo2-1555861262145-1-msa8   Ready     <none>    57s       v1.13.5

Istio

Istio is an open-source “service mesh” that layers itself transparently onto existing distributed infrastructure.

A “service mesh” is the abstraction for inter-connected services interacting with each other. This kind of abstraction helps reduce the complexity of managing connectivity, security, and observability of applications in a distributed environment.

Istio helps tackle these problems by providing a complete solution with insights and operational control over connected services within the “mesh”. Some of core features of Istio includes:

  • Load balancing on HTTP, gRPC, TCP connections
  • Traffic management control with routing, retry and failover capabilities
  • A monitoring infrastructure that includes metrics, tracing and observability components
  • End to end TLS security

Installing Istio on an existing Kubernetes cluster is pretty simple. For installation guide, take a look at this excellent post by @nethminiromina:

Create the custom resource definitions (CRDs) from downloaded Istio package directory:

kubectl apply -f install/kubernetes/helm/istio/templates/crds.yaml

Next deploy Istio operator resources to the cluster from the packaged “all-in-one” manifest:

kubectl apply -f install/kubernetes/istio-demo.yaml

Once deployed, we should see the following list of services in the istio-system namespace:

Ensure all pods are in Running state and the above services are available. Depending on installation setup, enable side-car injection to the default namespace:

kubectl label namespace default istio-injection=enabled --overwrite

Istio Traffic Management

Istio provides easy rules and traffic routing configurations to setup service-level properties like circuit-breakers, timeouts, and retries as well as deployment-level tasks such as A/B testing, canary rollouts, and staged rollouts.

At the heart of Istio traffic management is Pilot and Envoy. Pilot is the central operator that manages service discovery and intelligent traffic routing between all services by translating high-level routing rules and propagate them to necessary Envoy side-car proxies.

Envoy is a high-performance proxy used to mediate all inbound and outbound traffic for services in the mesh. It is deployed as a side-car container with all Kubernetes pods within the mesh. Some built-in features of Envoy include:

  • Service discovery
  • Load balancing
  • HTTP and gRPC proxies
  • Staged rollouts with %-based traffic split
  • Rich metrics
  • TLS termination, circuit breakers, fault injection and many more!

We will use Istio’s traffic management and telemetry features to deploy, serve and monitor ML models in our cluster.

Istio Egress and Ingress

Istio de-couples traffic management from infrastructure with easy rules configuration to manage and control the flow of traffic between services.

In order for traffic to flow in and out of our “mesh” we must setup the following Istio configuration resources:

  • Gateway: load-balancer operating at the edge of the “mesh” handling incoming or outgoing HTTP/TCP connections.
  • VirtualService: configuration to manage traffic routing between Kubernetes services within “mesh”.
  • DestinationRule: policy definitions intended for a service after routing.
  • ServiceEntry: additional entry to internal Service Configuration; can be specified for internal or external endpoints.

Configure Egress

By default, Istio-enabled applications are unable to access URLs outside the cluster. Since we will use S3 bucket to host our ML models, we need to setup a ServiceEntry to allow for outbound traffic from our Tensorflow Serving Deployment to S3 endpoint:

Create the ServiceEntry rule to allow outbound traffic to S3 endpoint on defined ports:

kubectl apply -f resnet_egress.yamlserviceentry "aws-s3" created

Configure Ingress

To allow incoming traffic into our “mesh”, we need to setup an ingress Gateway. Our Gateway that will act as a load balancing proxy by exposing port 31400 to receive traffic:

Use Istio default controller by specifying the label selector istio=ingressgateway so that our ingress gateway Pod will be the one that receives this gateway configuration and ultimately expose the port.

Create the Gateway resource we defined above:

kubectl apply -f resnet_gateway.yamlgateway "resnet-serving-gateway" created

Tensorflow Serving

Tensorflow Serving provides a flexible ML serving architecture designed to serve ML models on gRPC/REST endpoints.

For this post, we will use a pre-trained, exported ResNet model as our example to deploy on serving infrastructure.

S3 Model Repository

To access ML models, Tensorflow Serving abstracts Loaders for arbitrary file system paths as interfaces, with out-of-the-box implementations of GCS and S3 cloud storage file systems.

Create ml-models-repository S3 bucket using AWS Console, setup IAM users and roles on the bucket.

Upload the NHWC ResNet-50 v1 model architecture into ml-models-repository bucket, with the following directory structure:

resnet/  
1/
saved_model.pb
variables/
variables.data-00000-of-00001
variables.index

To securely read from our bucket, create a Kubernetes Secret with the S3 access credentials:

kubectl create secret generic s3-storage-creds \
--from-literal=access_id=$AWS_ACCESS_KEY_ID \
--from-literal=access_key=$AWS_SECRET_ACCESS_KEY
secret "s3-storage-creds" created

Kubernetes Deployment

The following manifest defines Tensorflow Serving as a Kubernetes Deployment, fronted by a Service to expose the server’s gRPC and REST endpoints:

Some points to note from the manifest definitions above:

  • Deployment has two labels app=resnet-serving and version=v1.
  • The Service selects and exposes deployment on the app=resnet-serving label for gRPC port 9000 and REST port 9001.
  • The model_config_list protobuf in the ConfigMap can be used to define model path and version policies; in our case we’ve pinned the model to load on version 1.

Deploy the Tensorflow Serving resources into the cluster:

kubectl apply -f resnet_serving_v1.yamldeployment "resnet-serving" created
service "resnet-serving" created
configmap "tf-serving-models-config" created

NAME READY STATUS RESTARTS AGE
resnet-serving-65b954c449-6s8kc 2/2 Running 0 11s

As expected, the Pod should show that there are 2 containers running in it, the main tensorflow-serving container and the istio-proxy side-car.

Check the tensorflow-serving container logs in the Pod to verify the server is running and has successfully loaded the model specified from our S3 repository:

kubectl logs resnet-serving-65b954c449-6s8kc -c tensorflow-serving2019-03-30 22:31:23.741392: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: s3://ml-models-repository/resnet/12019-03-30 22:31:23.741528: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: s3://ml-models-repository/resnet/12019-03-30 22:31:33.864215: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:285] SavedModel load for tags { serve }; Status: success. Took 10122668 microseconds.2019-03-30 22:31:37.616382: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: resnet version: 1}2019-03-30 22:31:37.622848: I tensorflow_serving/model_servers/server.cc:313] Running gRPC ModelServer at 0.0.0.0:9000 ...2019-03-30 22:31:37.626888: I tensorflow_serving/model_servers/server.cc:333] Exporting HTTP/REST API at:localhost:9001 ...

A bug with S3 client in Tensorflow Serving results in large spam of warning logs. This can be turned off by setting TF_CPP_MIN_LOG_LEVEL=3 environment variable.

Model Deployment and Canary Rollouts

So far we’ve setup our infrastructure with Kubernetes, Istio and Tensorflow Serving. Now we can start versioning our models, setup routing to specified deployments, and then leverage Istio’s traffic splitting rules perform staged canary rollout of model server deployments.

Setup V1 Routing

We need to setup traffic rules in order for the Gateway to know what services to route to as it receives requests.

The rule is defined with a VirtualService that allows routing to destination “in-mesh” services without knowledge of underlying deployments in the infrastructure. The following VirtualService attaches itself to the Gateway we defined previously to route 100% of the traffic to resnet-serving of the v1 subset on port 9000. The DestinationRule resource is used to define routing policy for the VirtualService.

Apply the VirtualService and DestinationRule manifest with the following:

kubectl apply -f resnet_v1_routing.yamlvirtualservice "resnet-serving" created
destinationrule "resnet-serving" created

For this post, we haven’t exposed any public load balancers or setup TLS on our cluster. Thus for now, test traffic can be sent on the port-forwarded gateway port:

kubectl -n istio-system port-forward istio-ingressgateway-5b64fffc9f-xh9lg 31400:31400Forwarding from 127.0.0.1:31400 -> 31400

Use the following Tensorflow Serving python gRPC client to make prediction requests with an input image:

python tf_serving_client.py --port=31400 --image=images/001.jpgname: "resnet"
version {
value: 1
}
signature_name: "serving_default"
dtype: DT_INT64
tensor_shape {
dim {
size: 1
}
}
int64_val: 228

Run a small load test and observe p50, p90 and p99 server-side latency on Istio Mesh Grafana dashboard:

That latency time is a total bummer, and there are several reasons for this:

  • (Relatively) Low compute profile — 2vCPUs per node.
  • Vanilla Tensorflow Serving binary not optimized for underlying CPU platform.

Techniques on building a CPU optimized Tensorflow Serving binary, along with performance tuning for latency/throughput is explained in the following post in our Mux Engineering Blog.

Model Deployment V2

Say, our v1 model is not performing well and we want to deploy a new model version as well as an optimized TensorFlow serving binary. Making changes to model deployments should always be done in an iterative way such that the new model behavior and performance can be properly tested, validated before being promoted as GA to all clients .

Our new model deployment will use new Resnet-50 v2 model and the updated CPU optimized Tensorflow Serving image.

Upload the ResNet-50 v2 SavedModel to S3 bucket under resnet/2/ path and same directory hierarchy as before. Then, create a new Deployment and ConfigMap that will load and serve version 2 of the resent model.

The changes from previous manifest are the following:

diff --git a/tf_serving.yaml b/tf_serving_v2.yaml
index 90d133d..05047a3 100644
--- a/tf_serving.yaml
+++ b/tf_serving_v2.yaml
@@ -1,10 +1,10 @@
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
- name: resnet-serving
+ name: resnet-serving-v2

@@ -13,7 +13,7 @@ spec:
metadata:
labels:
app: resnet-serving
- version: v1
+ version: v2
spec:
containers:
- name: tensorflow-serving
- image: tensorflow/serving:latest
+ image: masroorhasan/tensorflow-serving:cpu
@@ -55,32 +55,15 @@ spec:
volumes:
- name: tf-serving-models-config
configMap:
- name: tf-serving-models-config
+ name: tf-serving-models-config-v2

___
apiVersion: v1
kind: ConfigMap
metadata:
- name: tf-serving-models-config
+ name: tf-serving-models-config-v2

@@ -90,19 +73,8 @@ data:
model_platform: "tensorflow",
model_version_policy: {
specific: {
- versions: 1
+ versions: 2

Full manifest for v2 Deployment and ConfigMap can be found here. Some important parts to note here:

  • New deployment has updated version label version=v2.
  • Updated Docker image masroorhasan/tensorflow-serving:cpu is a pre-built CPU optimized binary.
  • The ConfigMap is also bumped with new version policy to specifically pull version 2 of the model.

Apply the deployment to the cluster:

kubectl apply -f resnet_serving_v2.yamldeployment "resnet-serving-v2" created
configmap "tf-serving-models-config-v2" created

Note that we did not have to update the resnet-serving Service, which is fronting both deployments at this point on the label selector app=resnet-serving.

Setup V2 Canary Routing

Now that we have our new model deployment, we would like to gradually roll it out to a subset of the users.

This can be done by updating our VirtualService to route a small % of traffic to v2 subset.

We will be cautious and update our VirtualService to route 30% of incoming requests to v2 model deployment:

kubectl replace -f resnet_v2_canary.yamlvirtualservice "resnet-serving" replaced
destinationrule "resnet-serving" replaced

Run another load test and observe the Istio Mesh dashboard for latency metrics across two versions of the resnet-serving workloads.

The requests by destination shows a similar pattern with traffic split between resnet-serving and resnet-serving-v2 deployments.

Setup V2 Routing

Once the canary version satisfies the model behavior and performance thresholds, the deployment can be promoted to be GA to all users. The following VirtualService and DestinationRule is configured to route 100% of traffic to v2 of our model deployment.

Update the routing rules to promote v2 to be GA to all incoming traffic:

kubectl replace -f resnet_v2_routing.yamlvirtualservice "resnet-serving" replaced
destinationrule "resnet-serving" replaced

While a load test is running, the Mesh Dashboard would show traffic completely moved away from v1 and instead flowing into v2 of our deployment.

Automating Canary Releases

So far we’ve done a gradual, staged deployment of new model version to the cluster monitored model performance. However, manually updating traffic rules is not in the spirit of operational scalability.

Istio traffic routing configuration can be used to perform canary releases by programmatically adjusting the relative weighting of traffic between downstream service versions. In this section, we will automate canary deployments using an open-source progressive canary deployment tool by Weaveworks: Flagger.

Flagger is a Kubernetes operator that automates iterative deployment and promotion of canary releases using Istio and App Mesh traffic routing features based on custom Prometheus metrics.

Let’s clone the Flagger repository and create the service accounts, CRDs and the Flagger operator:

git clone git@github.com:weaveworks/flagger.git

Create the service account, CRDs and operator in that order:

# service accounts
kubectl apply -f flagger/artifacts/flagger/account.yaml
# CRD
kubectl apply -f flagger/artifacts/flagger/crd.yaml
# Deployment
kubectl apply -f flagger/artifacts/flagger/deployment.yaml

The Flagger deployment should be created in the istio-system namespace. This is the main operator that will operate on customer resources denoted with kind: Canary.

Flagger takes a Kubernetes deployment, like resnet-serving, and creates a series of resources including Kubernetes deployments (primary vs canary), ClusterIP service, and Istio virtual services.

Since a lot of the manual traffic routing services will be taken care of by Flagger operator, we need to clean up our cluster of previously Istio traffic routing related resources and v2 serving deployment.

# clean up routing
kubectl delete -f resnet_v1_routing.yaml
kubectl delete -f resnet_serving_v2.yaml
kubectl delete -f resnet_v2_routing.yaml
# clean up svc
kubectl delete svc/resnet-serving

Define a Flagger Canary custom resource that will reference our resnet-serving model deployment:

Apply the manifest to create the Canary resource:

kubectl apply -f flagger_canary.yamlcanary "resnet-serving" created

As mentioned, this will create initialize the Canary while creating a series of Kubernetes resources.

To trigger an automated canary promotion, trigger a deployment by updating the container image:

kubectl set image deployment/resnet-serving \
tensorflow-serving=masroorhasan/tensorflow-serving:cpu

At this point, Flagger operator will detect the change in deployment revision and queue a new rollout:

When canary deployment is promoted as GA, Flagger will scale down the old deployment automatically. Applying changes to deploying during a rollout will trigger Flagger to restart the analysis and re-deploy.

Wrapping Up

For this post, we looked at building a ML serving environment by deploying TensorFlow Serving on Kubernetes infrastructure. Then, we leveraged Istio’s intelligent traffic routing features to manage model deployments in a staged rollout fashion. Lastly, we deployed Flagger operator on our cluster to automate the process of staged canary rollouts.

While we have chosen to focus on the serving aspect of ML models, there are many frameworks and tools to help bring together end to end ML pipelines. Some of them include (among many others):

  • Kubeflow: collection of tools, frameworks to provide cohesive training and serving infrastructure on Kubernetes.
  • SeldonCore: provides a runtime ML graph engine to manage serving ML models on Kubernetes.
  • Clipper: a prediction serving system for data scientists to get start on Docker-only or Kubernetes environments.
  • Pipeline.ai: brings together many popular open-source tools to build, experiment end-to-end machine learning pipelines.
  • Cloud ML services: Google ML Engine, AWS SageMaker, and Azure ML service

--

--