The world’s leading publication for data science, AI, and ML professionals.

End-to-End ML Pipelines with MLflow: Tracking, Projects & Serving

A Definitive Guide to Advanced Use of MLflow

Definitive tutorial for advanced use of MLflow

Photo by Jeswin Thomas on Unsplash
Photo by Jeswin Thomas on Unsplash

Introduction

MLflow is a powerful tool that is often talked about for its experiment tracking capabilities. And it’s easy to see why – it’s a user-friendly platform for logging all the important details of your Machine Learning experiments, from hyper-parameters to models. But did you know that MLflow has more to offer than just experiment tracking? This versatile framework also includes features such as MLflow Projects, the Model Registry, and built-in deployment options. In this post, we’ll explore how to utilise all of these features to create a complete and efficient ML pipeline.

For complete MLflow beginners, this tutorial might be too much so I highly encourage you to watch these two videos before diving into this one!

Setup

For this project we’ll be working locally, so make sure to properly setup your local environment. There are three main dependecnies that the project requires – Mlflow ,[pyenv](https://github.com/pyenv/pyenv#installation) , and [kaggle](https://github.com/Kaggle/kaggle-api) . While MLflow can be installed simply using pip, you’ll need to follow separate instructions to setup pyenv and kaggle.

Once you’re done with the installations, make sure to pull the latest version of this repo. When you have the repo on your laptop, we’re finally ready to begin!

Project Overview

Move to the mlflow_models folder and you’ll see the following structure:

mlflow_models folder structure. Image by author.
mlflow_models folder structure. Image by author.

Here’s a brief overview of each file in this project:

  • MLProject – yaml-styled file describing the MLflow Project
  • python_env.yaml— lists all the environment dependencies to run the project
  • train_hgbt.py and train_rf.py – training scripts for HistGradientBoosterTree and RandomForest models using specific hyperparamaters
  • search_params.py – script to perform hyperparameter search.
  • utils – folder contains all the utility functions used in the project

As stated before, this project is end-to-end, so we’re going to go from data download to model deployment. Approximate workflow is going to be like this:

  1. Download data from Kaggle
  2. Load and pre-process the data
  3. Tune Random Forest (RF) model for 10 iterations
  4. Register the best RF model and put it into production bucket
  5. Deploy the models using in-built REST API

After we’re done, you can repeat the steps 2 to 5 for HistGradientBoostedTrees model on your own. Before jumping into the project, let’s see how these steps can be supported by MLflow.

MLflow Components

Generally speaking, MLflow has 4 components – Tracking, Projects, Models, and Registry.

MLflow components. Screenshot from mlflow.org
MLflow components. Screenshot from mlflow.org

Thinking back to the project steps, here’s how we’re going to be using each of them. First of all, I’ve used MLflow Projects to package up the code so that you or any other data scientist/engineer can reproduce the results. Second, MLflow Tracking serve is going to track your tuning experiments. This way, you’ll be able to retrieve the best experiments in the next step, where you’ll add your models to the Model Registry. From the registry, deploying the models will literally be a one-liner because of the MLflow Models format that they’re saved in and their in-built REST API functionality.

Project steps overview. Image by author.
Project steps overview. Image by author.

Pipeline Overview

Data

Data will be downloaded automatically when you run the pipeline. As an illustrative example, I’ll be using a Loan Default dataset (CC0: Public Domain license) but you can adjust this by re-writing training_data parameter and changing column names to the relevant ones.

MLProject & Environment Files

MLProject file gives you a convenient way to manage and organise your machine learning projects by allowing you to specify important details such as the project name, location of your Python environment, and the entry points for your pipeline. Each entry point can be customized with a unique name, relevant parameters, and a specific command. The command serves as the executable shell line that will be executed whenever the entry point is activated, and it has the capability to utilize the parameters that have been previously defined.

python_env.yaml file outlines the precise version of Python necessary to execute the pipeline, along with a comprehensive list of all required packages.

These two files were needed to create a necessary environment for running the project. Now, let’s see the actual scripts (entry points) that the pipeline will be executing.

Training and Experiment Tracking

Training is done in train_rf.py and train_hgbt.py scripts. Both of them are largely the same, with exception of the hyper-parametrs that get passed and the pre-processing pipelines. Consider the function below which downloads the data and trains a Random Forest model.

The experiment starts when we define MLflow context using with mlflow.start_run(). Under this context, we use mlflow.log_metrics to save the PR AUC metrics (check out the eval_and_log_metrics function for more information) and mlflow.sklearn.log_model to save the preprocessing and modelling pipeline. This way, when we load the pipeline, it will do all the pre-processing together with the inference. Quite convenient if you ask me!

Hyper-parameter Tuning

Hyper-parameter tuning is done using Hyperopt package in search_params.py . A lot of the code is borrowed from the official mlflow repo but I’ve tried to simplify it quite a bit. The trickiest part of this script is to understand how to structure these tuning rounds, so that they appear connected to the "main" project run. Essentially, when we run search_params.py using MLflow, we want to make sure that the structure of experiments is as follows:

Experiment structure visualised. Image by author
Experiment structure visualised. Image by author

As you can see, the search_params script does nothing else but specifies which parameters should train_rf.py use next (e.g. depths of 10, 2 and 5) and what should be its parent run ID (in the example above it’s 1234). When you explore the script, make sure to pay attention to the following details.

  • When we define mlflow.start_runcontext, we need to make sure that nested parameter is set to True
  • When we run train_rf.py (or train_hgbt.py ), we explicitly pass the run_id and make it equal to the previously created child_run run
  • We also need to pass the correctexperiment_id

Please see the example below to understand how it all works in code. eval funtion is the one that will be optimised by the Hyperopt minimisation function.

The actual tuning function is relatively simple. All we do is initialise an MLflow experiment run (parent run of all the other runs) and optimise the objective function using provided search space.

Please note that these functions are just to illustrate the main parts of the code. Refer to the github repository for the full versions of the code.

Run the RF Pipeline

By now you should have a general idea about how the scripts work! So, let’s run the pipeline for Random Forest using this line:

mlflow run -e search_params --experiment-name loan . -P model_type=rf

Let’s decompose this command line:

  • mlflow run . means that we want to run the Project in this folder
  • -e search_params specifies which of the entry points in MLProject file we want to run
  • --experiment-name loan makes the experiment name equal to "loan". You can set it to whatever you want but write it down since you’ll need later
  • -P model_type=rf sets the model_type parameter in search_params script to "rf" (aka Random Forest)

When we run this line, four things should happen:

  1. Python virtual environment will get created
  2. New experiment called "loan" will get initialised
  3. Kaggle data will get downloaded into a newly created folder data
  4. Hyperparameter search will begin

When the experiments are done, we can check the results in the MLflow UI . To access it, simply use the command mlflow ui in your command line. In the UI, select "loan" experiment (or whatever you’ve called it) and add your metric to the experiments view.

Screenshot of MLflow UI. Image by author.
Screenshot of MLflow UI. Image by author.

The best RF model has achieved test PR AUC of 0.104 and took 1 minute to train. Overall, the hyper-parameter tuning took step too roughly 5 minutes to complete.

Register the Model

By now, we have trained, evaluated and saved 10 Random Forest models. In theory, you can simply go to the UI to find the best model and manually register it in your Model Registry and promote it to production. However, a better way is to do it in code since then you can automate this step. This is exactly what model_search.ipynb notebook covers. Use it to follow along the sections below.

First of all, we need to find the best model. To do it programatically you need to gather all the hyperparameter tuning experiments (10 of them) and sort them by the test metric.

Your results will be different but the main goal here is to end up with correct best_run parameter. Please note that if you’ve changed the experiment name, you’ll need to change it in this script as well. The parent run IDs can be looked up in the UI if you click on the parent experiment (in this case named "capable-ray-599").

Run ID lookup in MLflow UI. Screenshot by author.
Run ID lookup in MLflow UI. Screenshot by author.

To test if your model is working as expected, we can easily load it into the notebook.

If you managed to get the prediction – congrats, you’ve done everything correctly! Finally, registering the model and promoting it to Production is a piece of cake as well.

Running these 2 lines of code registers your model, and promotes it to "Production" bucket internally. All this does is changes the ways of how we can access the models and their metadata but it’s incredibly powerful in the context of model versioning. For example, at any point we can compare version 1 with version 2 when it comes out.

MLflow model registry. Screenshot by author.
MLflow model registry. Screenshot by author.

If you go to the "Models" tab of the UI, you’ll indeed see that there is a model named loan_model and its Version 1 is currently in Production bucket. This means that we can now access the model by its name and stage which is very convenient.

Serve the Model

The easiest way of serving the model is to do it locally. This is usually done to test the endpoint and to make sure that we get the expected outputs. Serving with MLflow is quite easy, especially when we’ve already registered the model. All you need to do is run this this command line:

mlflow models serve - model-uri models:/loan_model/Production -p 5001

This line will start a local server that will host your model (that’s called loan_model and is currently in Production stage) at the port 5001. This means that you’ll be able to send the requests to localhost:5001/invocations endpoint and get the predictions back (given that the requests are properly formatted).

To test the endpoint locally, you can use a requests library to call it and get the predictions.

In the example above, we’re getting the same probability that we had before, but now these scores are produced by the local server and not your script. The inputs need needs to follow very specific guidelines, so that’s why we have 4 lines of pre-processing. You can read more about the expected formats for MLflow serving here

Summary

If you’ve managed to get this far and everything is working – give yourself a nice round of applause! I know it was a lot to take in, so let’s summarise everything you’ve achieved so far.

  1. You saw and understand how to structure your project with MLflow Projects
  2. You understand where in the script we log our parameters, metrics and models, and how search_params.py invokes train_rf.py
  3. You can now run the MLflow Projects and see the results in MLflow UI
  4. You know how to find the best model, how to add it to the model registry, and how to promote it to Production bucket programmatically
  5. You can serve the models from model registry locally and can call the endpoint to make a prediction

What Next?

I strongly recommend that you put your skills to the test by attempting to run the pipeline for the Gradient Boosted Trees model and then deploying the HGBT model. All the necessary scripts are available to you, so all that remains is for you to configure the pipeline and complete the deployment on your own. Give it a go and if you encounter any challenges or have any questions, don’t hesitate to leave them in the comments section.


Related Articles