The world’s leading publication for data science, AI, and ML professionals.

Automate your ML model retraining with Kubeflow

Learn how to implement a cost-efficient and automated model retraining solution with Kubeflow Pipelines - Part 1

This is the first part of a 3 parts series where I explain how you can build a cost-efficient and automated ML retraining system with Kubeflow. Along the way, we’ll also pick some best practices around building pipelines.

While Kubeflow Pipelines isn’t yet the most popular batch jobs orchestrator, a growing number of companies is adopting it to handle their data and ML jobs orchestration and monitoring. Actually, Kubeflow is designed to benefit from Kubernetes strengths and that’s what makes it very attractive.

In this article, I’ll show you how you can build an automated and cost-efficient ML model retraining pipeline using Kubeflow Pipelines. As you might already know, retraining ML models is necessary to keep them accurate and cure the model drift curse. In case you are familiar with Airflow or planning to use it, make sure to have a look at this article where I demo how to build the exact retraining system using Airflow.

Now, let’s say you have created a nice ML model to predict the taxi fare of a car drive and serve a first version of the model. The retraining system you’ll be building is made of 2 pipelines:

  1. The first pipeline taxi-fare-predictor trains the ML model and serves it for prediction if the model outperforms the live version of the model.
  2. The second pipeline retraining-checker runs every now and then and checks if the model has become obsolete. If so, it triggers the first pipeline.

To make things look real, you’ll also create two additional pipelines to send predictions to the live version of the model, one of which sends inputs that are purposely far from the inputs from the training set. The goal here is to simulate a model drift and have our retraining-checker pipeline __ triggers the _taxi-fare-predicto_r pipeline. Please, bear with me till the end to build the solution pictured below.

Whenever possible, define your pipeline components as yaml files

There are three ways to build a Kubeflow pipeline component:

  1. You can write a python self-contained function: that is, all imports statements and helper functions should be inside the definition of the function. Then, you convert the function to a component using the Kubeflow Pipelines SDK. This is the fastest way of building component, provided you are comfortable writing python code. However, I’d recommend using self-contained function only for simple, straight-forward components. Rest assured! You have an example of self-contained function in the taxi-fare-predictor pipeline.
  2. You can also package your component code in a docker image and instructs Kubeflow Pipelines to run that image inside a container in the Kubeflow cluster. I think the main advantage of this method is its flexibility because you can now write your code in any programming language you like. In addition, this approach works better for complex components and produces more shareable and reusable components as sharing those components boils down to sharing images. This is also demonstrated in the taxi-fare-predictor pipeline.
  3. The third approach is to define your component as a yaml file using the Kubeflow Pipelines component specification. This is by far the recommended way of building components as it makes them easily shareable and reusable between pipelines, but also between teams. I agree! This demands from you to allocate some time to learn the component specification. But you’ll rapidly make up for any lost time with your easily shareable and reusable components. You’ll have an example of this approach in the taxi-fare-predictor pipeline.

What you will get

By the end of this article, you will get

  1. a guide toward implementing the aforementioned solution, as well as some best practices about writing Kubeflow Pipelines.
  2. access to a gitlab repository containing all the material used to build the solution. You’ ll then be able to use the repo codebase as a good starting point to build your own ML retraining system with Kubeflow Pipelines.

What we will discuss

The steps you take to build the concept drift mitigation solution include:

  1. Setting up Kubeflow Pipelines and Cloud SQL (Part 1)
  2. Building pipelines components (Part 1)
  3. Conditional Components and externally triggered pipelines (Part 1)
  4. Sharing data between components (Part 2)
  5. Passing secrets to components (Part 3)
  6. Designing pipelines and components inputs (Part 3)

Sounds great? Then let’s jump right into it!


Setting up Kubeflow Pipelines and Cloud SQL

Kubeflow is an open source framework and can be installed wherever there is a Kubernetes cluster up and running. You’ll find here useful guidance toward installing Kubeflow the way that best suits you. You can also install Kubeflow Pipelines only, which is the component of Kubeflow that this article is about.

To be able to detect and handle model drift, our system uses a Postgres database hosted in a Google Cloud SQL instance to store training jobs metrics and prediction results. A Cloud SQL proxy is deployed in the Kubeflow cluster as a service to provide an encrypted connection to the Postgres database. To learn more about connecting securely to Cloud SQL from Kubernetes, check this article.

It’s perfectly fine to choose another database as long as you plan for a secure way to allow any pod running on your Kubeflow cluster to connect to the database.

The database shall contain the following tables:


Building pipelines components

Model drift mitigation is achieved simply by retraining the model with newer data. Here, we assume that there is another process responsible for fetching new data and storing it in the same repository where we read our training and evaluation data from.

This retraining happens in the first step of the taxi-fare-predictor pipeline in the component named ‘Submitting a Cloud ML training job as a pipeline step’. An ML training job is then submitted to AI Platform which is the AI service of Google Cloud Platform.

The ‘Submitting a Cloud ML training job as a pipeline step’ is defined as a yaml file following the Pipelines Component specification. All you have to do is to load that yaml file into a component and then use it in your pipeline.

When the model is trained, it exports its metrics in a csv file in Google Cloud Storage. Which is why you need to download and store those metrics in the Postgres database hosted in Google Cloud SQL. To accomplish that, firstly, a python self contained function is used for the file download.

Secondly, the code to store metrics in Cloud SQL is packaged in a docker image, which is used to create a pipeline component.

Remember, the goal here is not only to retrain a model, but also to serve it for predictions if that newly retrained model is better than the live one. That’s why there are three additional components to get the model location, check if the new model is superior to the current one (check-deploy) and launch the new model serving if it happens to be better. The whole model drift mitigation pipeline is pictured below.


Conditional Components and externally triggered pipelines

You now have the retraining pipeline but you are left with one question. How do you automate the execution of the retraining pipeline?

The easiest answer is to schedule a recurrent run for it. For instance, you specify that it should run every hour, day or month, depending on how quickly you expect the model to drift. This solution comes with a major drawback though: Unnecessary model retraining can be costly.

A better approach is to check whether the model has already drifted or not and launch the retraining pipeline accordingly. In a nutshell, we consider that the model has drifted if there is a huge gap between the training set label mean and the model predictions mean. This is achieved via the retraining-checker pipeline pictured below. You want to schedule it so that it runs as often as needed so that a model drift is cured very fast.

The trigger-model-training component thus runs conditionally to the is-retraining-needed component.

Being able to trigger externally a pipeline is a really useful feature. This is the only purpose of the trigger-model-training component. The component submits input for the retraining pipeline and triggers a run. Depending on where your Kubeflow Pipeline runs, you might need to provide the Kubeflow endpoint (host) as an input to the component.


Summing up

Hopefully, I was able to share how I managed to build a ML model retraining system using Kubeflow Pipelines and Cloud SQL. The key differentiator of the system is that it is automated and economical. As for the best practices around writing Kubeflow pipelines, the main takeaway I think, is that the best approach for building pipeline components is to define them as yaml files that can be shared and reused between pipelines and teams.

The complete code I used is readily available in this gitlab repository. Feel free to check it out as well as the readme.


Related Articles