The world’s leading publication for data science, AI, and ML professionals.

KServe: Highly scalable machine learning deployment with Kubernetes

Kubenetes model inferencing made easy.

Image sourced from KServe
Image sourced from KServe

In the wake of the release of chatGPT, it is becoming increasingly difficult to avoid technologies that leverage machine learning. From text prediction on your messaging app to facial recognition on your smart door bell, machine learning (ML) can be found in almost every piece of tech we use today.

How machine learning technologies are delivered to consumers is one of the many challenges organisations will have to address during development. The deployment strategies of ML products have a significant impact on the end users of your product. This can mean the difference between Siri on your iPhone or chatGPT in your web browser.

Beyond the sleek user interface and overly assertive chat dialogues of ChatGPT, hides the complex mechanisms required to deploy the large language ML model. ChatGPT is built on a highly scalable framework that is designed to deliver and support the model during its exponential adoption. In reality, the actual ML model will only make up a small proportion of the whole project. Such projects are often cross disciplinary and require expertise in data engineering, data science and software development. Therefore, frameworks that simplifies the model deployment processes are becoming increasingly vital in delivering models to production, helping organisations save time and money.

Without the proper operational framework to support and manage ML models, organisations will often meet bottle necks when attempting to scale the number of ML model in production.

While no single tool has emerged as a clear winner in the highly saturated market of MLOps toolkits, KServe is becoming an increasingly popular tool to help organisations meet scalability requirements of ML models.

Note: I am not affiliated with KServe nor have been sponsored to write this article.

What is KServe?

KServe is a highly scalable machine learning deployment toolkit for Kubernetes. It is an orchestration tool that is built on top of Kubernetes and leverages two other open sourced projects, Knative-Serving and Istio; more on this later.

Image sourced from KServe
Image sourced from KServe

KServe significantly simplifies the deployment process of ML Models into a Kubernetes cluster by unifying the deployment into a single resource definition. It makes the machine learning deployment part of any ML project easy to learn and ultimately decreases the barrier to entry. Therefore, models deployed using KServe can be easier to maintain than models deployed using traditional Kubernetes deployment that require a Flask or FastAPI service.

With KServe, there is no requirement to wrap your model inside a FastAPI or Flask app before exposing it through the internet via HTTPS. KServe has built-in functionality that essentially replicates this process but without the overhead of having to maintain API endpoints, configure pod replicas, or configure internal routing networks in Kubernetes. All you have to do is point KServe to your model, and it will handle the rest.

Beyond the simplification of the deployment processes, KServe also offers many features, including canary deployments, inference autoscaling, and request batching. These features will not be discussed as they are out of scope. However, this guide will hopefully set the foundation of understanding required to explore further.

First, let’s talk about the two key technologies, Istio and Knative, that accompany KServe.

Istio

Much of the functionality that KServe brings to the table would be difficult without Istio. Istio is a service mesh that extends your applications deployed in Kubernetes. It is a dedicated infrastructure layer that adds capabilities such as observability, traffic management, and security. For those familiar with Kubernetes, Istio replaces the standard ingress definitions typically found in a Kubernetes cluster.

The complexity of managing traffic and maintaining observability grows as a Kubernetes-based system scales. One of the best features of Istio is its ability to centralize the controls of service-level communications. This gives developers greater control and transparency over communications among services.

With Istio, developers do not need to design applications that can handle traffic authentication or authorization. Ultimately, Istio helps reduce the complexity of deployed apps and allows developers to concentrate on the important components of the apps.

By leveraging the networking features of Istio, KServe can bring features that include canary deployment, inference graphs, and custom transformers.

KNative

Knative, on the other hand, is an open-source enterprise-level solution to build serverless and event-driven applications. Knative is built on top of Istio to bring serverless code execution capabilities that are similarly offered by AWS Lambdas and Azure Functions. Knative is a platform-agnostic solution for running serverless deployments in Kubernetes.

One of the best features of Knative is the scale-to-zero feature. This is a critical component of KServe’s ability to scale up or down ML model deployment and one that maximizes resource utilization and saves on costs.

Should I use KServe?

KServe, like many other tools, is not a one size fits all solution that will fit your organisation’s requirements. It has a high cost of entry due to the fact that some experience in working with Kubernetes is required. If you are just getting started with Kubernetes, there are many resources online and I highly recommend checking out resources such as the DevOps guy on Youtube. Nonetheless, even without a deep understanding of Kubernetes, it is possible to learn to use KServe.

KServe will be ideal in organisations already leveraging Kubernetes where there are existing knowledge in working with Kubernetes. It may also suit organisations looking to move away or complement managed services like SageMaker or Azure Machine Learning in order to have greater control over your model deployment process. The increase in ownership can result in significant cost reductions and increased configurability to meet project specific requirements.

Nevertheless, the right cloud infrastructure decision will depend on a case by case basis as infrastructure requirements will differ across companies.

Pre-requisites

This guide will take you though the steps required to get you set up with KServe. You will be walked through the steps to install KServe and serve your first model.

There are several pre-requisites that will need to be met before proceeding. You will require the following:

Kubernetes Cluster

For this tutorial, I recommend experimenting with a Kubernetes cluster using Kind. It is a tool to run a local Kubernetes cluster without needing to spin up cloud resources. In addition, I highly recommend Kubectx as a tool to easily switch between Kubernetes context if you are working across multiple clusters.

However, when running production workload, you will need access to a fully functioning Kubernetes cluster to configure DNS and HTTPS.

Deploy a Kubernetes cluster in Kind with:

kind create cluster --name kserve-demo

Switch to the correct Kubernetes context with:

kubectx kind-kserve-demo

Installation

The following steps will install Istio v1.16, Knative Serving v1.7.2 and KServe v0.10.0. These versions are best suited for this tutorial as Knative v1.8 onwards will require DNS configuration for ingress which adds a layer of complexity that is currently out of scope.

  1. Istio Installation.
# Install istio
curl -L https://istio.io/downloadIstio | ISTIO_VERSION=1.16.0 TARGET_ARCH=x86_64 sh -
istioctl install --set profile=default -y
  1. Install KNative Serving.
# Install the Knative Serving component
export KNATIVE_VERSION="v1.7.2"
kubectl apply -f https://github.com/knative/serving/releases/download/knative-$KNATIVE_VERSION/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-$KNATIVE_VERSION/serving-core.yaml

# Install istio-controller for knative
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.7.0/net-istio.yaml
  1. Install cert manager. Cert manager is required to manage valid certificate for HTTPs traffic.
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.11.0 --set installCRDs=true
  1. Create a namespace for model.
kubectl create namespace kserve
  1. Clone the KServe repository.
git clone [email protected]:kserve/kserve.git
  1. Install KServe Cutom Resource Definitions and KServe Runtimes into the model namespace in your cluster.
cd kserve
helm install kserve-crd charts/kserve-crd -n kserve
helm install kserve-resources charts/kserve-resources -n kserve

Great! We now have KServe installed on the cluster. Let’s get deploying!

First Inference Service

In order to ensure that the deployment went smoothly, let us deploy a demo inference service. The source code for the deployment can be found here.

kubectl apply -n kserve -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
EOF

The yaml resource definition above deploys a test inference service that sources a publicly available model trained using the SciKit-Learn library. KServe supports many different flavours of machine learning libraries. These include MLFlow, PyTorch or XGBoost models; more are added at each release. If none of these out-of-the-box libraries meet your requirements, KServe also support custom predictors.

It is possible to monitor the status of the current deployment by getting the available pods in the namespace.

kubectl get pods -n kserve
Image by author
Image by author

If you run into issues with the deployment use the following to debug:

kubectl describe pod <name_of_pod> -n kserve

We can also check the status of the inference service deployment with:

kubectl get isvc -A
Image by author
Image by author

If the inference service is marked true, we are ready to perform our first prediction.

Performing a prediction

To perform a prediction, we will need to determine if our Kubernetes cluster is running in an environment that supports external load balancers.

kubectl get svc istio-ingressgateway -n istio-system

Kind Cluster

Clusters deployed using Kind does not support external load balancers therefore you will have an ingress gateway that looks similar to below.

Kind External Load Balancer (Image by author)
Kind External Load Balancer (Image by author)

In this case we would have to port-forward the istio-ingressgateway which will allow us to access it via localhost.

Port-forward the istio ingress gateway service to port 8080 on localhost with:

kubectl port-forward -n istio-system service/istio-ingressgateway 8080:80

Then set the ingress host and port with:

export INGRESS_HOST=localhost
export INGRESS_PORT=8080

Kubernetes Cluster

If the external IP is valid and does not display <pending>, we are able to send an inference request through the internet at the IP address.

Ingress Gateway IP address (Image by author)
Ingress Gateway IP address (Image by author)

Set the ingress host and port with:

export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')

Perform Inference

Prepare an input request json file for the inference request.

cat <<EOF > "./iris-input.json"
{
  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]
  ]
}
EOF

Then perform an inference with curl:

SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n kserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json 

The request will be sent to the KServe deployment through the istio-ingress gateway. If everything is in order, we will get a json reply from the inference service with the prediction of [1,1] for each of the instances.

Scaling to Zero

By leveraging the features of Knative, KServe supports scale-to-zero capabilities. This feature effectively manages limited resources across the cluster by scaling pods not in use to zero. Scaling to zero capabilities allow the creation of a reactive system that responds to requests, as opposed to a system that is always up. This will facilitate the deployment of a greater number of models in the cluster than traditional deployment configurations can.

However, note that there is a cold start penalty for pods that have been scaled down. This will vary depending on the size of the image/model and the available cluster resources. A cold start can take 5 minutes if the cluster needs to scale an additional node or 10 seconds if the model is already cached on the node.

Let us modify the existing scikit-learn inference service and enable scale to zero by defining minReplicas: 0.

kubectl apply -n kserve -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    minReplicas: 0
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
EOF

By setting minReplicas to 0, this will command Knative to scale the inference service down to zero when there are no HTTP traffic. You will notice that after a period of 30 seconds, the pods for Sklearn-Iris model will be scaled down.

kubectl get pods -n kserve
Sklearn-Iris predictors scales down to zero
Sklearn-Iris predictors scales down to zero

To reinitialise the inference service, send a prediction request to the same endpoint.

SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -n kserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json

This will trigger pod initialisation from cold start and return a prediction.

Conclusion

KServe simplifies the process of machine learning deployment and shortens the path to production. When combined with Knative and Istio, KServe has the added bonus of being highly customisable and bring many features that easily rivals those offered in managed cloud solutions.

Of course, migrating the model deployment process in-house has its own innate complexities. However, the increase in platform ownership will confer greater flexibility in meeting project specific requirements. With the right Kubernetes expertise, KServe can be a powerful tool that will allow organisations to easily scale their machine learning deployment across any cloud provider to meet increasing demand.


Related Articles