Automate ML model retraining and deployment with MLflow in Databricks

Efficiently manage and deploy production models with MLflow

Published in

Towards Data Science

8 min readMar 15, 2023

Getting a working machine learning model deployed for user consumption is a great achievement. We see statistics showing that machine learning models often fail to make it into production, whether this is due to insufficient data, lack of direction or other reasons.

Models which do make it into production still face many challenges as they require consistent attention in the forms of monitoring and retraining to ensure the insights they aim to provide are up-to-date and accurate over time.

This blog aims to help streamline the model retraining process with MLflow while providing background on the recommended approach.

Why retrain a production model?

Model retraining is the process of giving our production model access to the latest data to run up-to-date predictions. Depending on the sophistication of our system, we might perform this retraining under various scenarios, such as:

At regular intervals: Such as on a weekly schedule.
Upon certain criteria: Data drift hitting a threshold condition may result in us retraining to accommodate a changing data landscape.
Upon registration of a new model: Our Data Scientists have found a more accurate model which has been approved to go live.

MLflow

Model retraining falls under the Machine Learning Operations (MLOps) process and MLflow is a great tool that helps simplify this in an iterative fashion, allowing smoother delivery with reproducible executions.

If you are new to MLflow there are many detailed resources available online, but I recommend starting with their website to see their offerings — I’ve included the components offered below as an introduction.

We will be making use of the MLflow Tracking component to log our retraining experiment runs and the Model Registry component to ensure deployment is seamless and mitigates the need for downtime in our production environment.

Pre-requisites

As we’re talking about retraining, we’ve made the assumption that you already have a model (and the data you wish to predict against) in production. If you do not, and you wish to use MLflow to achieve this, I’ve provided this notebook to get started. We’ll review this process and code shortly in the “Deploy an initial model to production” section.

Out of convenience, we’ll be using a Databricks workspace with ML compute cluster since this provides us with a managed environment with all of the required packages installed, the MLflow interface and embedded and a spark environment to assist with any big data queries through parallel processing, if required.

Databricks is available through most cloud providers — I’ll be using Microsoft Azure. If running MLflow locally, then ensure all of the relevant packages are installed and an MLflow tracking server is set up.

I’ve broken down these requirements below:

A Databricks workspace running a ML compute cluster, simulating a production environment.
Source data: In production, I would expect this to be tables in our Data Warehouse or Lake. We’ll just use the Diabetes dataset in the Scikit-learn package for this example.
An existing Machine Learning model which we will save as a production model in the Model Registry (code sample below).

Deploy an initial model to production

The following code block shows an experiment run for a Ridge Regression model. For the full notebook, see this link in my GitHub Repo. This example is designed to provide you with a “production” model which you can base the model retraining process against.

# Start MLflow run for this experiment

# End any existing runs
mlflow.end_run()

with mlflow.start_run() as run:
  # Turn autolog on to save model artifacts, requirements, etc.
  mlflow.autolog(log_models=True)

  diabetes_X = diabetes.data
  diabetes_y = diabetes.target
  
  # Split data into test training sets, 3:1 ratio
  diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(diabetes_X, diabetes_y, test_size=0.25, random_state=42)

  alpha = 1
  solver = 'cholesky'
  regr = linear_model.Ridge(alpha=alpha,solver=solver)

  regr.fit(diabetes_X_train, diabetes_y_train)

  diabetes_y_pred = regr.predict(diabetes_X_test)
  
  # Log desired metrics
  mlflow.log_metric("mse", mean_squared_error(diabetes_y_test, diabetes_y_pred))
  mlflow.log_metric("rmse", sqrt(mean_squared_error(diabetes_y_test, diabetes_y_pred)))
  mlflow.log_metric("r2", r2_score(diabetes_y_test, diabetes_y_pred))

Use the MLflow API commands to push this to production in your Model Registry.

model_uri = "dbfs:/databricks/mlflow-tracking/<>/<>/artifacts/model"
desc = 'Initial model deployment'
new_run_id = run.info.run_id
client.create_model_version(name, model_uri, new_run_id, description=desc)
version = client.search_model_versions("run_id='{}'".format(new_run_id))[0].version
client.transition_model_version_stage(name, version, "Production")

Finding the right approach: Deploy Code vs Deploy Model

To understand what our retraining notebook might look like, we need to understand the approach taken. Microsoft discusses two deployment patterns which I’ve summarised below. More in-depth information can be found here.

Deploy Code

ML artifacts are packaged as code from deployment to production.
Version control and testing can be implemented.
The deployment environment is reproduced in production, reducing the risk of production issues.
Production models are trained against the production data.
Additional deployment complexity.

Deploy Model

Standalone artifacts (Machine Learning model) are deployed to production.
Flexibility to deploy to different types of environments or integrate with different services.
Simplicity in the deployment process.
Fast deployment time with easy versioning.
Changes and enhancements to Feature engineering, monitoring, etc need to be managed separately.

We’ll be taking the recommended approach of Deploying Code. This lends itself nicely to what we are trying to achieve: we can take our Production-ready, stakeholder-approved script and use this for the retraining process.

We’re not changing any parameter values in our model: we are simply retraining the model against the latest data.

Start by setting the experiment and model that we’re retraining.

# Import packages
from mlflow.client import MlflowClient

# Set the experiment name to an experiment in the shared experiments folder
mlflow.set_experiment("/diabetes_regression_lab")

client = MlflowClient()

# Set model name
name = 'DiabetesRegressionLab'

Load the dataset. As mentioned, we’re simplifying this requirement by using the Diabetes dataset from the scikit-learn package. In reality, this might be a select statement against a table.

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

In this case we’re simulating this model retraining occurring after some data change, possibly a few days later. To simulate this time delta, the current registered production model has been trained on a subset of the data (1.) and we’re using the full dataset to show further data being added over time (2.)

# 1. Mimic results from a week ago, used by our registered production model
diabetes_X = diabetes.data[:-20]
diabetes_y = diabetes.target[:-20]

# 2. Dataset as of point of retraining, used in our latest experiment run
diabetes_X = diabetes.data
diabetes_y = diabetes.target

Note that there might be a use case to remove some historical data from your training data set, should data drift be detected.

Once loaded, kick off an MLflow run and start training. This will follow the usual process of splitting the data, training, predicting and comparing to the test dataset.

# Start MLflow run for this experiment: This is similar to your experimentation script
mlflow.end_run()

with mlflow.start_run() as run:
    # Turn autolog on to save model artifacts, requirements, etc.
  mlflow.autolog(log_models=True)
  
  # Split data into test training sets, 3:1 ratio
  diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(diabetes_X, diabetes_y, test_size=0.25, random_state=42)

  alpha = 1
  solver = 'cholesky'
  regr = linear_model.Ridge(alpha=alpha,solver=solver)  

  regr.fit(diabetes_X_train, diabetes_y_train)

  diabetes_y_pred = regr.predict(diabetes_X_test)
  
  # Log desired metrics
  mlflow.log_metric("mse", mean_squared_error(diabetes_y_test, diabetes_y_pred))
  mlflow.log_metric("rmse", sqrt(mean_squared_error(diabetes_y_test, diabetes_y_pred)))
  mlflow.log_metric("r2", r2_score(diabetes_y_test, diabetes_y_pred))

Validation Criteria

It is good practice to use validation criteria to determine if our new model performs at least as well as the existing one before replacing it in production. This process helps to ensure the new model is reliable and minimises the risk of performance degradation. In this case, I’m simply using the mse, rmse and r2 values as my validation metrics.

Now that we’ve logged a new run, we can compare this to the run which is currently in production.

# Collect latest run's metrics
new_run_id = run.info.run_id
new_run = client.get_run(new_run_id)
new_metrics = new_run.data.metrics

# Collect production run's metrics
prod_run_id = client.get_latest_versions(name, stages=["Production"])[0].run_id
prod_run = client.get_run(prod_run_id)
prod_metrics = prod_run.data.metrics

# Collate metrics into DataFrame for comparison
columns = ['mse','rmse','r2']
columns = ['version'] + [x for x in sorted(columns)]
new_vals = ['new'] + [new_metrics[m] for m in sorted(new_metrics) if m in columns]
prod_vals = ['prod'] + [prod_metrics[m] for m in sorted(prod_metrics) if m in columns]
data = [new_vals, prod_vals]

metrics_df = pd.DataFrame(data, columns=columns)
metrics_df

This is a simple example where our model uses a Ridge regression algorithm. In reality, our model may be composite and use a hyperparameter search space, comparing multiple algorithms and automatically deciding the “best” one to use under complex validation criteria. The same concept can be applied, and the successful model published to the Model Registry for consumption.

Promote in the Model Registry

We can then use the following code to automatically move this run to Production in the Model Registry, upon meeting these validation criteria specified:

# Retrieve validation variables from the metrics DataFrame
new_mse = metrics_df[metrics_df['version'] == 'new']['mse'].values[0]
new_rmse = metrics_df[metrics_df['version'] == 'new']['rmse'].values[0]
new_r2 = metrics_df[metrics_df['version'] == 'new']['r2'].values[0]

prod_mse = metrics_df[metrics_df['version'] == 'prod']['mse'].values[0]
prod_rmse = metrics_df[metrics_df['version'] == 'prod']['rmse'].values[0]
prod_r2 = metrics_df[metrics_df['version'] == 'prod']['r2'].values[0]

# Check new model meets our validation criteria before promoting to production
if (new_mse < prod_mse) and (new_rmse < prod_rmse) and (new_r2 > prod_r2):
  model_uri = "dbfs:/databricks/mlflow-tracking/<>/<>/artifacts/model"
  print('run_id is: ', new_run_id)
  
  desc = 'This model uses Ridge Regression to predict diabetes.'

  client.create_model_version(name, model_uri, new_run_id, description=desc)
  to_prod_version = client.search_model_versions("run_id='{}'".format(new_run_id))[0].version
  to_archive_version = client.search_model_versions("run_id='{}'".format(prod_run_id))[0].version
  
  # Transition new model to Production stage
  client.transition_model_version_stage(name, to_prod_version, "Production")
  
  # Wait for the transition to complete
  new_prod_version = client.get_model_version(name, to_prod_version)
  while new_prod_version.current_stage != "Production":
      new_prod_version = client.get_model_version(name, to_prod_version)
      print('Transitioning new model... Current model version is: ', new_prod_version.current_stage)
      time.sleep(1)

  # Transition old model to Archived stage
  client.transition_model_version_stage(name, to_archive_version, "Archived")

else:
  print('no improvement')

In our case, all validation criteria were hit so the model has been pushed to production in the model registry. Any utilisation of the model, batch or real-time, will now be against this version.

You will hopefully start to notice the benefit of the Deploy Code approach at this point, as it gives us complete control over the scripts used to retrain a model and also accommodate changing validation criteria for automatic redeployment.

The full notebook can be accessed in my GitHub Repository if you wish to see the code in full.

Utilisation

We can continue using inference pipelines & REST API calls to access the model as before, being careful to update any changes to schema if this has taken place.

This is something we can factor into our deployment code to the model registry as a reminder to prevent this from being a breaking change and any end-user impact.

Notes/Considerations

As of MLflow 2.0, Recipes is an experimental feature (at the time of writing) which provides a streamlined approach to some of this functionality, with particular reference to the validation criteria. I expect further development of Recipes to give users a well-structured and repeatable approach for the model deployment and retraining elements of the ML lifecycle.

It is also worth noting that MLOps is still relatively new and we certainly see that in the variety of approaches different businesses and users are taking in implementing ML solutions. Standardising the approach is still a work in progress, with additional components such as explainability and monitoring being more prevalent in the lifecycle.

Thanks for reading and let me know if you have any questions.