Feature Factories pt 2: An Introduction to MLFlow

Published in

Towards Data Science

11 min readSep 25, 2019

If you read my first article you hopefully have a good understanding of what a feature factory is, why it’s important, and a general idea of how to best cultivate one. If you haven’t, I suggest you check it out first. In this follow up article, I want to begin diving into MLFlow and introducing the major concepts with some code examples.

ML Lifecycle

To begin, let’s create a common understanding of a classic machine learning lifecycle:

A business problem is identified where the application of machine learning might be valuable

A team gathers a large dataset

Data engineers begin cleaning and standardizing the data, preparing it for analysis

Subject matter experts dig into the data to find useful new signals (features)

Data scientists research the best algorithms for the problem at hand, recording all of their individual runs

Once a model is decided upon, the dev-ops team takes that model and deploys it to a servicing platform such as Sagemaker, Azure ML, or a bespoke implementation like a container wrapped in a web service.

There are a few common pain points in this process, primarily, lifecycle/deployment management and analytics tracking.

Deploying a trained machine learning model is not yet as straightforward as one may expect, and if you’ve ever tried figuring out how to do it, you may find the documentation surrounding it is not as comprehensive as other stages in the ML lifecycle. Additionally, after deployment, you still have to handle monitoring, retraining, and redeployment. Without a standardized process, this can take a lot of time away from your dev-ops team, and if your company or team were to switch cloud services (AWS to Azure for example), you have to start the learning process all over.

If you’ve ever seen a chart like this, you know the long and arduous process of keeping track of your data science projects. Create a new spreadsheet for each problem you’re working on, save it on some cloud storage (hopefully) shared by everyone in your company, make sure your team can see all of your trials so they don’t waste time trying the same setup, you get the idea.

I always used to use Excel to track my projects, and you may have had another tool you liked, but it was most likely a personal log that was hard to share and expand on with others. It’s hard to keep track of who tried what, what worked, what didn’t, and the dataset being used. As you can see from the picture above, there is a lot missing from that sheet, and adding things to it could force you to redesign it entirely. What if someone wanted to try gradient boosting or wanted to add one of the many other neural network hyperparameters used today? As you can see, this tracking concept is not very sustainable.

Introducing MLFlow

MLFlow is the open-source solution for both the data scientist and the dev-ops engineer. In my last article, I explained at a high level why MLFlow is so great, so here I’d like to dive in, starting with governance. We’ll walk through how MLFlow can be used to deliver governance and organize experiments. We’ll also touch on what details we can track to help the model building process, and how to select and deploy models.

Governance

Governance is the idea of having a full birds-eye view of your data science process, from raw data to ingestion to experimentation to modeling, and even deployment. It is very important to have governance over any processes that can affect something of value, be it monetary, regulatory, and especially user facing (i.e. ML models making real-time decisions). Therefore, the data science process must be tracked and managed. We need to know where the models came from, what kind and depth of testing was done on them and the exact data that went into them before we authorize their deployment into production

MLFlow allows you to organize your data science artifacts into two main categories:

Experiments — An overarching problem you are trying to solve, ex: Point of sale fraud. Typically created around a common dataset
Runs — Each individual “attempt” at feature engineering and model training: Each row of the Excel sheet (with much more flexibility)

What’s great about this organizational design is it allows you to design the metadata you want to track between runs and does not force you to commit to some common parameters. You can store anything you want in a given run, and it does not need to match exactly to the last run. For example, for a given run, you can track any parameters you want (model type, hyperparams, train/test split), metrics you’d like (fpr, r², f1) even if it’s for a different model genre (binary/multiclass classification, regression).

Enough talk, let’s get to the demo:

For the purpose of this demo, I’m going to use Splice Machine’s MLManager because it’s what I typically use day-to-day (since it runs natively on Splice DB), but for the most part it will look the same as vanilla MLFlow, and I will point out any changes necessary for the example to work outside Splice Machine.

Groundwork

In this demo, we will be using a modified Kaggle dataset to predict order delays in a supply chain environment. As would be typical in a data science team, we will have more than one data scientist collaborating on this project each performing a number of transformations and modeling techniques.

To start, we import the necessary libraries and create an MLManager object for tracking (Note here that with vanilla MLlow there is no object, simply replace manager with mlflow). We now start our experiment with the create_experiment function and pass in a name. If that experiment already exists, it will automatically become the active experiment.

After creating our manager, we can ingest data into a dataframe and begin analyzing and transforming. It’s important to note the ways in which you can record (log) information to MLFlow:

log_param (or lp): Allows you to log any model/pipeline parameters you’d like. Think train/test split, hyperparameters, stages of a transformation pipeline, a description of a specific source dataset.
log_params: MLManager specific function that takes a list of parameters to log and handles them all automatically
log_metric (or lm): Allows you to log any model metrics. Think train time, accuracy, precision, AUC etc.
log_metrics: MLManager specific function that takes a list of metrics and handles them all automatically
log_artifact: Allows you to log objects to a specific run. Think charts, serialized models, images etc.
log_artifacts: MLManager specific function that takes a list of artifacts and handles them all
log_feature_transformation: MLManager specific function that takes an untrained Spark Pipeline and logs all of the transformation steps of every feature in your dataset
log_pipeline_stages: MLManager specific function that takes a Spark Pipeline (fit or unfit) and logs all of the stages in a readable format
log_model_params: MLManager specific function that takes a fitted Spark model or Pipeline model and logs all of the parameters and hyperparameters
log_evaluator_metrics: MLManager specific function that takes a Splice Evaluator and logs all of its metrics
log_spark_model: MLManager specific function that takes a trained Spark model and logs it for deployment. This was created because, with Splice Machine’s implementation, the model is stored directly in the Splice Machine database instead of S3 or in external storage.

What’s great about MLFlow is that it is flexible enough to allow you to design your workflow in any way you want, getting as granular or high-level as your team needs and the desired governance process requires.

If this is a bit mysterious, the demo below should make it more understandable.

Starting a Run

We like to stay very organized, so we will start our first run right away and log all feature transformations and discoveries as we go. As we make our various new features and feature transformations, we can log each of those individually using the log_param function, or we can use the (Splice specific) log_feature_transformations function and pass in a Spark Pipeline object. Please note that this bulk logging function only works with Spark ML Pipelines.

*‘dataSize’ is simply an arbitrary key:value MLFlow* *tag*

Preprocessing our dataset by converting our string columns into numeric and creating a feature vector

In the cell above, we call manager.log_feature_transformations and pass in a Spark Pipeline. As noted above, this is a Splice Machine specific function and will only work with MLManager. To create the output of this function, you would loop through the stages of your pipeline as well as the columns of your dataframe, and “trace” the path of each column, seeing which Transformers and Estimators it is affected by, logging those findings respectively.

The table above describes the preprocessing steps of each column. For example, the column CITY_destination was first indexes from a string value to an integer value using a StringIndexer, then One Hot Encoded and finally assembled into a feature vector in a final column called featuresVec

Once we have our dataset in its final form, we can use Splice Machine’s Native Spark Datasource (more on this in another article) to create and insert our transformed dataset into a SQL table so we can keep track of the data used for this particular run. We can then log a parameter or a tag specifying the name of the table in the database to refer to for later analysis. This is useful because if our model is acting strange in the future, we can see the exact dataset it was trained on and potentially find causes of mishaps.

Create a table in Splice database from the Spark Dataframe, and insert that dataframe into the DB table

Because storing dataframes as tables directly in a Splice Database is a Splice specific function, what you could do using vanilla MLFlow is log your dataframe as an artifact using mlflow.log_artifact. This will link your dataset to the run in question for future analysis. Another option is to call df.write.save and pass in an S3 bucket to save your dataframe to S3, logging a parameter in your run to the S3 location.

Training models

Once our data is ready, we can start training some models. We want to try out a number of different models to see which one has the best baseline performance, and then narrow in on the hyperparameters of the model chosen. We can iteratively run a loop through a number of models, tracking the metrics of each one. For each model we run, we can start a new run and log the table from which our dataset is coming from (the one created above). During these runs, we can also log plots such as the Receiver Operating Characteristic (ROC) curve.

Trying multiple models and logging all parameters and results

After we run some hyperparameter tuning and we’re happy with the model we’ve chosen, we can save the model to MLFlow (where it is stored serially in Splice DB) and see it in the artifacts in the MLFlow UI.

Output from manager.log_evaluator_metrics()

Output from manager.stop_and_log_timer()

Output from manager.log_model_params()

As you can see above, some of the Splice Machine built-in logging functions take care of some nice governance for you, but you can implement these functions in MLFlow as well:

To log your model params, you can extract the parameter map using model.extractParamMap() and loop through your parameters, logging each one to MLFlow.
To log the time it took to train on your data and predict on your testing data, you can import the native time library, get the start and end times of your train with t1 = time.time() and log the difference of your two times.
To log all of your metrics, you could use the Spark Evaluators to get evaluation metrics, and loop through the available metrics, logging each one using mlflow.log_metric.

As other data scientists build runs to test their ideas, they can all be populated in the same experiment, allowing for cross-comparison of models. Anything and everything can be saved in these experiments and runs, allowing the team to develop their own standard of work. Below we are showing two comparisons: one chart comparing a specific model hyperparameter against a single metric (in this case r² vs Number of Trees), and another chart showing a holistic view of each of the four models build, showing where each one outperforms, and where they fall flat.

Comparing all metrics of each model against each other

Something I really appreciate about the second chart is that you can see a more holistic view of each model. For example, here you can see that the magenta model clearly outperforms all of the others, except in r². So, unless r² is a critical evaluation metric, that would be a clear choice.

Model Deployment

Now that everything is tracked and compared, it’s time to deploy your model live. Without deployment, your model is nothing more than a nice Jupyter notebook, but MLFlow allows for easy code based deployment to AzureML or AWS Sagemaker. In the Splice platform, we have a handy UI for deployment, so no matter your cloud service, deployment remains consistent and simple. If you’re using vanilla MLFlow, you can use the built-in AzureML or Sagemaker APIs to deploy your model.

When deploying to Azure ML or AWS Sagemaker, simply choose your region, name and some other details, and your model is on its way. After it’s been deployed, you can track all jobs and who deployed them in the UI.

Summary

And with that, we’ve transformed, trained, tested and deployed out first Machine Learning model using MLFlow. At any point, anyone on your team can navigate to the MLFlow UI and see all of the steps taken to get from the raw data to the deployed model. As the data changes and the model needs to be monitored or adjusted, everyone knows exactly where to look.

I’ve really enjoyed using MLFlow and all of its incredible features. If you’re a fan of MLFlow yourself, feel free to comment on any other great features I may have missed, all suggestions are welcome!

To learn more about MLFlow, click here, and to learn more about Splice Machine’s MLManager implementation, click here, or request a demo here.