Kubeflow (is not) for Dummies

Deploy and destroy Kubeflow on EKS with one script, no sweat

Mateusz Kwaśniak
Towards Data Science

--

Figure 1. Kubeflow Dashboard (Source: Kubeflow docs)

Tools, libraries, frameworks are created to make our work easier. They introduce new functionalities, simplify code, reduce boilerplate, automate stuff.

Imagine your project with no dependencies, imagine you need to replace a single function call (e.g. yaml.safe_load) with your own piece of code for that functionality. All these tools make applications easier to build and maintain, develop and deploy. But what if these tools themselves are difficult to be deployed? Ouch.

Introducing Kubeflow

I won’t lie — there’s no coincidence that I introduce Kubeflow right after writing that tools can be difficult to be deployed. Ladies and gentlemen, meet Kubeflow — one of the most popular and yet one of the most irritating tools I used in years (CMake, brother, I will never forget you).

Kubeflow is one of the hottest things in ML and MLOps area recently with around 30 actively developed repositories with almost 20,000 stars from GitHub users. Does “The Machine Learning Toolkit for Kubernetes” make you think that this could be and do literally everything? Well, then you got it quite right. Kubeflow has few key components:

  • Pipelines using YAML template or SDK for creating them and GUI (Fig. 2) for visualizing pipeline executions as well as their results,
  • Jupyter Notebook Server,
  • Katib — hyperparameter optimization or neural architecture search,
  • Artifact Store,
  • Dashboard (Fig. 1) web app to manage all of that and much more.
Figure 2. Kubeflow Pipelines (Source: Pipelines Quickstart)

Most importantly, Kubeflow runs on top of Kubernetes. So if we are speaking of Kubeflow Jupyter notebooks, these notebooks run in scalable Kubernetes cluster. So do Pipelines and Katib. This offers a great advantage over other tools that were available and used so far. If you need to train and manage ML models at scale, of course.

An official list of use cases is available in Kubeflow documentation.

Kubeflow (is not) for Dummies

Unlike other ML Engineering tools e.g. MLFlow or DVC, installing Kubeflow is not that easy and pip install won’t suffice. Not even close. First of all, you need a Kubernetes cluster where KF will be deployed. This is the very first obstacle on your way to giving Kubeflow a try in your project.

It can be deployed both in local cluster as well as in the cloud. Of course deploying it locally has its drawbacks (e.g. limited resources) but can also be a lot cheaper. Setting up Kubeflow in the cloud can get expensive especially if something is not right and you need to repeat deployment a few times.

Been there. I took me a couple of attempts to finally setup Kubeflow on AWS cloud (deploying it in Amazon EKS cluster). A couple of attempts and around 70$ as it turned out later.

Figure 3. Image by Ariel Biller (@LSTMeow)

So then after reading that other people struggle with deploying Kubeflow just like I did, I decided to come up with a solution that will allow me to minimize number of mistakes and effort next time I want to deploy KF. I also wanted cleaning up the environment to be as quick and easy as setting it up.

And I made it, so let me show off now.

Amazon EKS + Terraform

Despite the fact that Amazon EKS cluster can be provisioned within a couple of minutes using AWS console, I decided to choose another way to do that, so that I have more control over resources I create (and delete).

Following Infrastructure as a Code approach, I used Terraform to set up whole EKS cluster from scratch, including EC2 instances, VPC and other necessary resources. Thanks to that, I write Terraform script once and I am able to set up the infrastructure and then tear it down with just a single command and I’m pretty sure that all resources will be cleaned up nicely. This way I won’t be charged for EC2 that I left over the weekend unconsciously.

More than that, IaaC gives me possibility to store building instructions under version control in my GitHub repository. I can track and revert any changes done to infrastructure, which would be hardly possible if I just kept clicking buttons in Amazon console UI.

Deploying Kubeflow in EKS cluster

Once EKS cluster is provisioned, Kubeflow can be finally deployed inside of it. This can be done from soup to nuts using shell, but I needed to make some tiny changes in order to make it work (e.g. replace region or cluster name in YAML file, special thanks to sed).

There are not many changes to be introduced if you just want to have your little playground (and you don’t care about privacy and good infrastructure design). All adjustments are saved in my delpoy_kubeflow script in the repo.

Then, Kubeflow can be deployed using kfctl CLI tool as done in the script. It should take a couple of minutes and hopefully finish successfully. When all pods and services are ready, dashboard UI should be available to you.

Type

kubectl get service istio-ingressgateway -n istio-system

to see where. When deployed with my script, Dashboard was available for me at localhost (using NodePort) and LoadBalancer service was not necessary.

Figure 4. Image by Ariel Biller (@LSTMeow)

All at once: deploy_kubeflow.sh

If you would like to have some fun with Kubeflow components on EKS and you want to follow my footsteps, just simply clone this repo and take a look at README file for instructions.

Prerequisites: configured AWS CLI and profile with admin privileges.

All required variables have to be set in set_env_variables script. Then you are good to go and just need to execute deploy_kubeflow.sh. Full list of required dependencies and actions is in README.

Notes

  • I wanted to keep it simple and use Kubeflow only for my private purposes so I didn’t care about IAM or authentication. I wouldn’t use this code in any real project but I think this may be a good foundation,
  • There are other ways to create EKS cluster and deploy Kubeflow on AWS, depending on your needs you may use CloudShell, eksctl or other tools,
  • If you are on a budget, play with KF locally: here are my instructions for WSL and here my colleague Paweł wrote a similar tutorial for Windows,
  • Whole KF deployment is quite big and quite expensive (I had to use two m5.xlarge instances to be able to handle it, I had no success with cheaper machines),
  • If you need to only Kubeflow Pipelines, you can have it. Pipelines is only a small piece of whole KF system, therefore you can use cheaper instances if you use that standalone deployment,
  • I know I used Kubeflow word too many times in this article, sorry.

--

--

Lead MLOps Engineer, ML Architect // I write about machine learning engineering, system design and platforms // linkedin.com/in/mtszkw