Deploy TensorFlow models with Istio on Kubernetes
Typical cloud application requirements of deployment, versioning, scaling, monitoring are also among the common operational challenges of Machine Learning (ML) services.
This post will focus on building a ML serving infrastructure to continuously update, version and deploy models.
Infrastructure Stack
In building our ML serving infrastructure, we will setup a Kubernetes cluster in a cloud environment and leverage Istio to handle service level operations. Next, we will use TensorFlow Serving to deploy and serve a ResNet model hosted on a S3 bucket. Lastly, we will take a look at how to perform staged canary rollouts of newer model versions and eventually automating the rollout process with Flagger.
At a high-level, our infrastructure stack includes:
- Kubernetes: open-source container orchestration system for application infrastructure and management.
- Istio: open-source “service-mesh” to enable operational management of micro-services in distributed environments.
- TensorFlow Serving: open-source high-performance ML model serving system.
- S3 Storage: AWS cloud object storage.
- Flagger: open-source automated canary deployment manager as Kubernetes operator.
Kubernetes Cluster
Kubernetes has done wonders in re-shaping the cloud infrastructure landscape. Spinning up a cluster is supported on multiple environments, with almost all major cloud providers offering managed Kubernetes as hosted solutions.
For this post, we will take the opportunity to test-drive one of the newest solutions around the block, DigitalOcean’s managed Kubernetes solution. Get started by creating a new DigitalOcean Kubernetes cluster on a chosen datacenter and node pool configurations.
Download the configuration file and add it to bash session.
export KUBECONFIG=k8s-1-13-5-do-1-sfo2-1555861262145-kubeconfig.yaml
Check the status of the node and verify that it is healthy and ready to accept workloads.
kubectl get nodesNAME STATUS ROLES AGE VERSIONk8s-1-13-5-do-1-sfo2-1555861262145-1-msa8 Ready <none> 57s v1.13.5
Istio
Istio is an open-source “service mesh” that layers itself transparently onto existing distributed infrastructure.
A “service mesh” is the abstraction for inter-connected services interacting with each other. This kind of abstraction helps reduce the complexity of managing connectivity, security, and observability of applications in a distributed environment.
Istio helps tackle these problems by providing a complete solution with insights and operational control over connected services within the “mesh”. Some of core features of Istio includes:
- Load balancing on HTTP, gRPC, TCP connections
- Traffic management control with routing, retry and failover capabilities
- A monitoring infrastructure that includes metrics, tracing and observability components
- End to end TLS security
Installing Istio on an existing Kubernetes cluster is pretty simple. For installation guide, take a look at this excellent post by @nethminiromina:
Create the custom resource definitions (CRDs) from downloaded Istio package directory:
kubectl apply -f install/kubernetes/helm/istio/templates/crds.yaml
Next deploy Istio operator resources to the cluster from the packaged “all-in-one” manifest:
kubectl apply -f install/kubernetes/istio-demo.yaml
Once deployed, we should see the following list of services in the istio-system
namespace:
Ensure all pods are in Running
state and the above services are available. Depending on installation setup, enable side-car injection to the default
namespace:
kubectl label namespace default istio-injection=enabled --overwrite
Istio Traffic Management
Istio provides easy rules and traffic routing configurations to setup service-level properties like circuit-breakers, timeouts, and retries as well as deployment-level tasks such as A/B testing, canary rollouts, and staged rollouts.
At the heart of Istio traffic management is Pilot and Envoy. Pilot is the central operator that manages service discovery and intelligent traffic routing between all services by translating high-level routing rules and propagate them to necessary Envoy side-car proxies.
Envoy is a high-performance proxy used to mediate all inbound and outbound traffic for services in the mesh. It is deployed as a side-car container with all Kubernetes pods within the mesh. Some built-in features of Envoy include:
- Service discovery
- Load balancing
- HTTP and gRPC proxies
- Staged rollouts with %-based traffic split
- Rich metrics
- TLS termination, circuit breakers, fault injection and many more!
We will use Istio’s traffic management and telemetry features to deploy, serve and monitor ML models in our cluster.
Istio Egress and Ingress
Istio de-couples traffic management from infrastructure with easy rules configuration to manage and control the flow of traffic between services.
In order for traffic to flow in and out of our “mesh” we must setup the following Istio configuration resources:
Gateway
: load-balancer operating at the edge of the “mesh” handling incoming or outgoing HTTP/TCP connections.VirtualService
: configuration to manage traffic routing between Kubernetes services within “mesh”.DestinationRule
: policy definitions intended for a service after routing.ServiceEntry
: additional entry to internal Service Configuration; can be specified for internal or external endpoints.
Configure Egress
By default, Istio-enabled applications are unable to access URLs outside the cluster. Since we will use S3 bucket to host our ML models, we need to setup a ServiceEntry to allow for outbound traffic from our Tensorflow Serving Deployment to S3 endpoint:
Create the ServiceEntry
rule to allow outbound traffic to S3 endpoint on defined ports:
kubectl apply -f resnet_egress.yamlserviceentry "aws-s3" created
Configure Ingress
To allow incoming traffic into our “mesh”, we need to setup an ingress Gateway
. Our Gateway
that will act as a load balancing proxy by exposing port 31400
to receive traffic:
Use Istio default controller by specifying the label selector istio=ingressgateway
so that our ingress gateway Pod will be the one that receives this gateway configuration and ultimately expose the port.
Create the Gateway
resource we defined above:
kubectl apply -f resnet_gateway.yamlgateway "resnet-serving-gateway" created
Tensorflow Serving
Tensorflow Serving provides a flexible ML serving architecture designed to serve ML models on gRPC/REST endpoints.
For this post, we will use a pre-trained, exported ResNet model as our example to deploy on serving infrastructure.
S3 Model Repository
To access ML models, Tensorflow Serving abstracts Loaders for arbitrary file system paths as interfaces, with out-of-the-box implementations of GCS and S3 cloud storage file systems.
Create ml-models-repository
S3 bucket using AWS Console, setup IAM users and roles on the bucket.
Upload the NHWC ResNet-50 v1
model architecture into ml-models-repository
bucket, with the following directory structure:
resnet/
1/
saved_model.pb
variables/
variables.data-00000-of-00001
variables.index
To securely read from our bucket, create a Kubernetes Secret with the S3 access credentials:
kubectl create secret generic s3-storage-creds \
--from-literal=access_id=$AWS_ACCESS_KEY_ID \
--from-literal=access_key=$AWS_SECRET_ACCESS_KEYsecret "s3-storage-creds" created
Kubernetes Deployment
The following manifest defines Tensorflow Serving as a Kubernetes Deployment
, fronted by a Service
to expose the server’s gRPC and REST endpoints:
Some points to note from the manifest definitions above:
- Deployment has two labels
app=resnet-serving
andversion=v1
. - The Service selects and exposes deployment on the
app=resnet-serving
label for gRPC port9000
and REST port9001
. - The
model_config_list
protobuf in the ConfigMap can be used to define model path and version policies; in our case we’ve pinned the model to load on version1
.
Deploy the Tensorflow Serving resources into the cluster:
kubectl apply -f resnet_serving_v1.yamldeployment "resnet-serving" created
service "resnet-serving" created
configmap "tf-serving-models-config" created
NAME READY STATUS RESTARTS AGE
resnet-serving-65b954c449-6s8kc 2/2 Running 0 11s
As expected, the Pod should show that there are 2 containers running in it, the main tensorflow-serving
container and the istio-proxy
side-car.
Check the tensorflow-serving
container logs in the Pod to verify the server is running and has successfully loaded the model specified from our S3 repository:
kubectl logs resnet-serving-65b954c449-6s8kc -c tensorflow-serving2019-03-30 22:31:23.741392: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: s3://ml-models-repository/resnet/12019-03-30 22:31:23.741528: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: s3://ml-models-repository/resnet/12019-03-30 22:31:33.864215: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:285] SavedModel load for tags { serve }; Status: success. Took 10122668 microseconds.2019-03-30 22:31:37.616382: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: resnet version: 1}2019-03-30 22:31:37.622848: I tensorflow_serving/model_servers/server.cc:313] Running gRPC ModelServer at 0.0.0.0:9000 ...2019-03-30 22:31:37.626888: I tensorflow_serving/model_servers/server.cc:333] Exporting HTTP/REST API at:localhost:9001 ...
A bug with S3 client in Tensorflow Serving results in large spam of warning logs. This can be turned off by setting
TF_CPP_MIN_LOG_LEVEL=3
environment variable.
Model Deployment and Canary Rollouts
So far we’ve setup our infrastructure with Kubernetes, Istio and Tensorflow Serving. Now we can start versioning our models, setup routing to specified deployments, and then leverage Istio’s traffic splitting rules perform staged canary rollout of model server deployments.
Setup V1 Routing
We need to setup traffic rules in order for the Gateway
to know what services to route to as it receives requests.
The rule is defined with a VirtualService
that allows routing to destination “in-mesh” services without knowledge of underlying deployments in the infrastructure. The following VirtualService
attaches itself to the Gateway
we defined previously to route 100% of the traffic to resnet-serving
of the v1
subset on port 9000.
The DestinationRule
resource is used to define routing policy for the VirtualService
.
Apply the VirtualService
and DestinationRule
manifest with the following:
kubectl apply -f resnet_v1_routing.yamlvirtualservice "resnet-serving" created
destinationrule "resnet-serving" created
For this post, we haven’t exposed any public load balancers or setup TLS on our cluster. Thus for now, test traffic can be sent on the port-forwarded gateway port:
kubectl -n istio-system port-forward istio-ingressgateway-5b64fffc9f-xh9lg 31400:31400Forwarding from 127.0.0.1:31400 -> 31400
Use the following Tensorflow Serving python gRPC client to make prediction requests with an input image:
python tf_serving_client.py --port=31400 --image=images/001.jpgname: "resnet"
version {
value: 1
}
signature_name: "serving_default"dtype: DT_INT64
tensor_shape {
dim {
size: 1
}
}
int64_val: 228
Run a small load test and observe p50, p90 and p99 server-side latency on Istio Mesh Grafana dashboard:
That latency time is a total bummer, and there are several reasons for this:
- (Relatively) Low compute profile — 2vCPUs per node.
- Vanilla Tensorflow Serving binary not optimized for underlying CPU platform.
Techniques on building a CPU optimized Tensorflow Serving binary, along with performance tuning for latency/throughput is explained in the following post in our Mux Engineering Blog.
Model Deployment V2
Say, our v1
model is not performing well and we want to deploy a new model version as well as an optimized TensorFlow serving binary. Making changes to model deployments should always be done in an iterative way such that the new model behavior and performance can be properly tested, validated before being promoted as GA to all clients .
Our new model deployment will use new Resnet-50 v2
model and the updated CPU optimized Tensorflow Serving image.
Upload the ResNet-50 v2
SavedModel to S3 bucket under resnet/2/
path and same directory hierarchy as before. Then, create a new Deployment and ConfigMap that will load and serve version 2
of the resent
model.
The changes from previous manifest are the following:
diff --git a/tf_serving.yaml b/tf_serving_v2.yaml
index 90d133d..05047a3 100644
--- a/tf_serving.yaml
+++ b/tf_serving_v2.yaml@@ -1,10 +1,10 @@
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
- name: resnet-serving
+ name: resnet-serving-v2
@@ -13,7 +13,7 @@ spec:
metadata:
labels:
app: resnet-serving
- version: v1
+ version: v2
spec:
containers:
- name: tensorflow-serving
- image: tensorflow/serving:latest
+ image: masroorhasan/tensorflow-serving:cpu@@ -55,32 +55,15 @@ spec:
volumes:
- name: tf-serving-models-config
configMap:
- name: tf-serving-models-config
+ name: tf-serving-models-config-v2
___
apiVersion: v1
kind: ConfigMap
metadata:
- name: tf-serving-models-config
+ name: tf-serving-models-config-v2
@@ -90,19 +73,8 @@ data:
model_platform: "tensorflow",
model_version_policy: {
specific: {
- versions: 1
+ versions: 2
Full manifest for v2
Deployment and ConfigMap can be found here. Some important parts to note here:
- New deployment has updated version label
version=v2
. - Updated Docker image
masroorhasan/tensorflow-serving:cpu
is a pre-built CPU optimized binary. - The ConfigMap is also bumped with new version policy to specifically pull version
2
of the model.
Apply the deployment to the cluster:
kubectl apply -f resnet_serving_v2.yamldeployment "resnet-serving-v2" created
configmap "tf-serving-models-config-v2" created
Note that we did not have to update the resnet-serving
Service, which is fronting both deployments at this point on the label selector app=resnet-serving
.
Setup V2 Canary Routing
Now that we have our new model deployment, we would like to gradually roll it out to a subset of the users.
This can be done by updating our VirtualService
to route a small % of traffic to v2
subset.
We will be cautious and update our VirtualService
to route 30% of incoming requests to v2
model deployment:
kubectl replace -f resnet_v2_canary.yamlvirtualservice "resnet-serving" replaced
destinationrule "resnet-serving" replaced
Run another load test and observe the Istio Mesh dashboard for latency metrics across two versions of the resnet-serving
workloads.
The requests by destination shows a similar pattern with traffic split between resnet-serving
and resnet-serving-v2
deployments.
Setup V2 Routing
Once the canary version satisfies the model behavior and performance thresholds, the deployment can be promoted to be GA to all users. The following VirtualService
and DestinationRule
is configured to route 100% of traffic to v2
of our model deployment.
Update the routing rules to promote v2
to be GA to all incoming traffic:
kubectl replace -f resnet_v2_routing.yamlvirtualservice "resnet-serving" replaced
destinationrule "resnet-serving" replaced
While a load test is running, the Mesh Dashboard would show traffic completely moved away from v1
and instead flowing into v2
of our deployment.
Automating Canary Releases
So far we’ve done a gradual, staged deployment of new model version to the cluster monitored model performance. However, manually updating traffic rules is not in the spirit of operational scalability.
Istio traffic routing configuration can be used to perform canary releases by programmatically adjusting the relative weighting of traffic between downstream service versions. In this section, we will automate canary deployments using an open-source progressive canary deployment tool by Weaveworks: Flagger.
Flagger is a Kubernetes operator that automates iterative deployment and promotion of canary releases using Istio and App Mesh traffic routing features based on custom Prometheus metrics.
Let’s clone the Flagger repository and create the service accounts, CRDs and the Flagger operator:
git clone git@github.com:weaveworks/flagger.git
Create the service account, CRDs and operator in that order:
# service accounts
kubectl apply -f flagger/artifacts/flagger/account.yaml# CRD
kubectl apply -f flagger/artifacts/flagger/crd.yaml# Deployment
kubectl apply -f flagger/artifacts/flagger/deployment.yaml
The Flagger deployment should be created in the istio-system
namespace. This is the main operator that will operate on customer resources denoted with kind: Canary
.
Flagger takes a Kubernetes deployment, like resnet-serving
, and creates a series of resources including Kubernetes deployments (primary vs canary), ClusterIP
service, and Istio virtual services.
Since a lot of the manual traffic routing services will be taken care of by Flagger operator, we need to clean up our cluster of previously Istio traffic routing related resources and v2
serving deployment.
# clean up routing
kubectl delete -f resnet_v1_routing.yaml
kubectl delete -f resnet_serving_v2.yaml
kubectl delete -f resnet_v2_routing.yaml# clean up svc
kubectl delete svc/resnet-serving
Define a Flagger Canary custom resource that will reference our resnet-serving
model deployment:
Apply the manifest to create the Canary resource:
kubectl apply -f flagger_canary.yamlcanary "resnet-serving" created
As mentioned, this will create initialize the Canary while creating a series of Kubernetes resources.
To trigger an automated canary promotion, trigger a deployment by updating the container image:
kubectl set image deployment/resnet-serving \
tensorflow-serving=masroorhasan/tensorflow-serving:cpu
At this point, Flagger operator will detect the change in deployment revision and queue a new rollout:
When canary deployment is promoted as GA, Flagger will scale down the old deployment automatically. Applying changes to deploying during a rollout will trigger Flagger to restart the analysis and re-deploy.
Wrapping Up
For this post, we looked at building a ML serving environment by deploying TensorFlow Serving on Kubernetes infrastructure. Then, we leveraged Istio’s intelligent traffic routing features to manage model deployments in a staged rollout fashion. Lastly, we deployed Flagger operator on our cluster to automate the process of staged canary rollouts.
While we have chosen to focus on the serving aspect of ML models, there are many frameworks and tools to help bring together end to end ML pipelines. Some of them include (among many others):
- Kubeflow: collection of tools, frameworks to provide cohesive training and serving infrastructure on Kubernetes.
- SeldonCore: provides a runtime ML graph engine to manage serving ML models on Kubernetes.
- Clipper: a prediction serving system for data scientists to get start on Docker-only or Kubernetes environments.
- Pipeline.ai: brings together many popular open-source tools to build, experiment end-to-end machine learning pipelines.
- Cloud ML services: Google ML Engine, AWS SageMaker, and Azure ML service
Thank you for reading and I hope this post helps with building TensorFlow model deployments on Kubernetes. I’d love to hear your setup of ML serving infrastructure — feel free to reach out on Twitter. Thanks for reading!
References