Model Management in productive ML software

- an introduction to MLflow -

Published in

Towards Data Science

11 min readMay 11, 2020

Developing a good Proof of Concept for a machine learning problem can be hard sometimes. You are working through tons and tons of data engineering layers and testing many different models until finally you have “cracked the code” and gotten a good score on your test set. Hurray! That is great news because now the fun really starts and your model can potentially help your company make or save money. If that ever is supposed to be the case you have to build productive software around your model. What does that mean? You need a solution architecture that allows for a live data flow, a compute component that can scale with that data flow, a real front end, and a good storage solution. And those are just your main components! You also need a monitoring solution in case any of your software pieces run an error and a DevOps tool that takes care of testing and releasing your newer code versions into the productive environment. Now that’s just your every-day data use case, if you are trying to build AI software you probably are dealing with massive amounts of data and a very intense data engineering layer to get your raw data into a form that can be used to train/infer machine learning models. Most likely you will need an ETL pipelining tool to orchestrate your data engineering during every day runs. But lastly one thing that might be the most important piece of the puzzle, you need to know what your ML model is doing over time!

How good is the model today vs how good was it yesterday?
Which features was it trained on?
What are the optimal hyperparameters? Do they change over time?
How do training and test data change over time?
Which model is in production/integration?
How do I bring a major model update into this set-up?

All these questions arise due to the complexity of the machine learning lifecycle. Because data changes over time, even in productive ML settings, we are pretty much constantly in a loop of collecting data, exploring models, refining models, and finally testing/evaluating, deploying, and in the end monitoring our models.

How can we keep control over this complex loop that is constantly evolving?

Actually there is a very simple solution — Write everything down! Or in computer terminology:

LOGGING OF ALL THE RELEVANT INFORMATION!

A fancy name for this is Machine Learning Model Management, a vital part of MLOps. The constant process of capturing relevant information while the software is executed and making automatic decisions based on it. Metrics during preparation, training, and evaluation as well as trained models, preprocessing pipelines, and hyperparameters. All saved away neatly in a model database, ready to be compared, analyzed and used to pick out one specific model that will serve as the prediction layer. Now in theory you could just use any normal database as backend and write your own API on how to measure and store away all this info. In practice, however, we can make use of some pretty cool pre-built frameworks designed to help with this task. Azure Machine Learning, Polyaxon, and mldb.ai are just a few examples. This article focuses on MLflow. MLflow was developed by the folks at Databricks and integrates seamlessly into their ecosystem. As I am already doing all my ETL as well as hyperparameter tuning on Databricks (using the hyperopt framework) it was an easy choice for model management framework.

MLflow Deep Dive

MLflow is an “open source platform for the machine learning lifecycle”. The managed version on Databricks is especially easy to use and you can even create an empty MLflow experiment using the user interface by clicking on “New MLflow Experiment” below.

You don’t have to worry about setting up a backend database, everything is taken care of automatically. Once the experiment is available, the MLflow tracking API can be used in your code to log various information. For a quick walk-through, let’s have a look at an example:

Sklearn’s Wine Dataset

I had a look at the open source wine dataset from sklearn. It contains data on the chemical analysis of three different wine types from different Italian wine cultivators. The goal is to create a classifier to predict the wine class, using the given features, such as alcohol, magnesium, color intensity, etc.

You can start by opening up a new Databricks notebook and load the following packages (make sure they are installed):

import warnings
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearnwarnings.filterwarnings(“ignore”)

Now you can load the data into a Pandas dataframe:

# Load wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target#create pandas df:
Y = np.array([y]).transpose()
d = np.concatenate((X, Y), axis=1)
cols = ["Alcohol", "Malic acid", "Ash", "Alcalinity of ash", "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", "Proanthocyanins", "Color intensity", "Hue", "OD280/OD315 of diluted wines", "Proline", "class"]
data = pd.DataFrame(d, columns=cols)

Set up logging

First we need to make up our mind on what all we would like to log during the training phase. In principle MLflow lets us log anything. Yes anything! In the most basic form we can write any information to a file and have MLflow log it as an artifact. Pre-built logging functionality includes:

run ID
date
parameters (e.g.: hyperparameters)
metrics
models (trained models)
artifacts (e.g.: trained encoders, preprocessing pipelines etc)
tags (e.g.: model version, model name, other meta information)

In this example we will stick to the basics and log 2 metrics, all the hyperparameters as well as the trained model and give it a model version tag. In order to simplify a little bit we build a little wrapper around scikit-learn’s .fit() method that takes care of the MLflow logging. Then we can call the wrapper, passing in different hyperparameters and see how performance varies and everything gets logged to our MLflow backend.

Let’s start with an evaluation function, including accuracy and the F1 score:

from sklearn.metrics import accuracy_score, f1_scoredef eval_metrics(actual, pred):
    acc = accuracy_score(actual, pred)
    f1 = f1_score(actual, pred, average = "weighted")
    return acc, f1

and then write our wrapper around scikit-learn’s RandomForestClassifier, giving it two hyperparameters to tweak, the number of trees in the forest, as well as the fraction of feature columns to use per decision split within a tree.

def train_winedata(data, in_n_estimators,  in_max_features):
  
  warnings.filterwarnings("ignore")
  np.random.seed(40)# Split the data into training and test sets. (0.75, 0.25) split.
  train, test = train_test_split(data, stratify= data["class"])# The predicted column is "class"
  train_x = train.drop(["class"], axis=1)
  test_x = test.drop(["class"], axis=1)
  train_y = train[["class"]]
  test_y = test[["class"]]
  
  
  # start run
  with mlflow.start_run():
    rf = RandomForestClassifier(n_estimators = in_n_estimators, 
                                max_features= in_max_features)
    rf.fit(train_x, train_y)  
    acc, f1 = eval_metrics(test_y, rf.predict(test_x))
    print("  Accuracy: %s" % acc)
    print("  F1 Score: %s" % f1)
    
    #log metrics 
    mlflow.log_metric("F1", f1)
    mlflow.log_metric("Accuracy", acc)
    #log params
    mlflow.log_param("n_estimator", in_n_estimators)
    mlflow.log_param("max_features", in_max_features)
    
    # add tags
    mlflow.set_tag("model_name", "Wineforest")
    # save the trained model
    mlflow.sklearn.log_model(rf, "model")

Training the model

Before running the training we need to tell our MLflow Client where to log all this info. We will have it point to the empty MLflow experiment we created earlier via the Databricks UI:

mlflow.set_experiment("/Users/<your user>/WineClassifier")

Now we can run our first set of hyperparameters:

train_winedata(data, in_n_estimators= 1, in_max_features= 0.1)Accuracy: 0.733   F1 Score: 0.731

We can see that a fairly decent Accuracy and F1 score was achieved by a single tree only using 10% of the features per split.

Rerunning the training now with 50% of max_features:

train_winedata(data, in_n_estimators= 1, in_max_features= 0.5)Accuracy: 0.933   F1 Score: 0.931

Already gives a really good score. However, including 100 trees in the random forest increases the performance even more:

train_winedata(data, in_n_estimators= 100,  in_max_features= 0.5)Accuracy: 0.98   F1 Score: 0.98

Let’s have a quick look at the outcome in our MLflow experiment:

We can see a nice GUI that includes all our logged info. We can also deep-dive into a single run by clicking on the date. There we find the logged model as an artifact, ready to be loaded and used to make predictions. We can now set up our serving layer to fetch the best model (e.g. by F1 score) and automatically deploy to it our production environment. The effect of different hyperparameter logs can be explored to potentially adjust search spaces in the hyperparameter grid when doing extensive grid search.

Architecture setup for production-grade machine learning

When building productive software we always have to keep in mind that one single environment most likely won’t do the trick. If we have a productive setup running and we want to make changes to the code we need at least an integration environment where we can first test any changes.

As training data increases and changes over time, we also need to define cycles in which we retrain our model, redo hyperparameter tuning or even re-evaluate the type of machine learning model overall. This needs to happen in an automated fashion and the best model (tested on a fixed external dataset), that has been successfully tested in the integration environment, needs to be deployed in production at all times. If we decide to change the external test dataset or the code changes in a major way we might want to incorporate a newer model version and only compare models for deployment older than such version etc. A somewhat complicated process.

The model registry within MLflow provides a tool that can help with this process. Here we can register trained models from our MLflow backend with different names, versions, and stages. Stages ranging from “None”, over “Staging” to “Production”.

Model Registry entry for model “winetest”

Logging our wine classifier from earlier, we only need to grab the model_uri from MLflow and run:

mlflow.register_model(model_uri=model_uri, name="winetest")

Let’s have a closer look at how we can manage two basic ML lifecycle scenarios with this setup: Retraining and retuning the already deployed PROD model periodically as well as deploying new or updated model code into the productive system. The main difference: the latter needs to go through a testing phase on INT since the code has changed while the former does not. The code remains the same, simply the performance of the model might have changed.

Think of the following scenario: A model named AImodel currently on version 1.0 is registered as “Production” and also deployed on PROD. We are retraining and retuning the model every x days. Once retraining and retuning occurs we can grab the model from the registry, run our training and tuning pipeline, logging the new model performance of AImodel version 1.0 into MLflow. Now we can kick off an automatic selection process by comparing the performance of the newly trained AImodel 1.0 with all existing performances of AImodel 1.0 on an external dataset and deploy the best model (due to some metric) by overwriting the current AImodel 1.0 sitting in the model Registry at “Production” stage with our selected model. Now we still have AImodel 1.0 in PROD, only with a (hopefully) better performance since trained on newer (more) data.

That is pretty straightforward and allows for constantly deploying the best AImodel 1.0 in production. What if we want to change something in the code of AImodel 1.0 or even introduce MLmodel 1.0 to compete with AImodel 1.0 for the production slot? We need to make use of our integration environment!

Deploying code into an environment is managed by a DevOps Pipeline, which gets triggered by a pull request on the int or master Git branch respectively. That get’s us the INT or PROD version of our code into the respective environment. So let’s imagine we included distributed training into the code of AImodel 1.0. We deploy whats on the INT Git branch through DevOps into our INT environment, and run the newly configured training, again logging into MLflow (giving the model an INT tag) and register the model as AImodel 2.0 as well as “Staging”. Now the key concept: Our two environments are automatically choosing the model to use as serving layer due to some criteria (as explained above using a performance metric). Since we only want AImodel 2.0 used (and therefore tested) on INT and not yet on PROD, we introduce model version allowances per environment. This means PROD is only allowed to select from models ≤ Version 1.0 whereas INT is allowed to select versions ≤ 2.0 (or if we specifically want to test version 2.0: =2.0). Upon completion of the test phase on INT, we can transition the model to “Production” in the registry and up the model version allowance in PROD to ≤ 2.0 (or =2.0 or 1.0≤PROD version≤2.0 etc). Now the new code (new model) is ready for the PROD environment. So basically in every new training loop on INT, we are comparing and potentially replacing the current model in “Staging” with a subset of models, according to version allowance on INT (and e.g. model tag model name, etc), from MLflow. The same thing happens on PROD according to version allowance on PROD using the model sitting in “Production”. In case we have major updates to the code (such as wanting to use a completely different model) we can always register the model under a new name and start with version 1.0 all over again.

This setup allows for automatically deploying the best performing model while still having the availability to introduce new code and models into the system smoothly. However, it is still a somewhat basic setup as in most cases tuning and training will be done asynchronously since the tuning job is very compute intensive and might therefore only occur for example every 10th training loop. Now we can think about using different MLflow backends, one for tuning and one for training to keep things clear and manageable.

Overall the tooling available in MLflow allows for a very flexible model management setup that can be tailored directly to the use case’s needs and fulfills production-grade requirements. We can also imagine building a dashboard on top of our MLflow backend, allowing power users to track the performance, version status as well as parameter and feature selections without having to open Databricks.

Try it out and take your machine learning application to the next level :)

References:

MLflow

MLflow - A platform for the machine learning lifecycle

An open source platform for the machine learning lifecycle MLflow is an open source platform to manage the ML…

mlflow.org

MLflow Model Registry

Manage the Lifecycle of MLflow Models in MLflow Model Registry

MLflow Model Registry is a centralized model store and a UI and set of APIsthat enable you to manage the full lifecycle…

docs.databricks.com