The world’s leading publication for data science, AI, and ML professionals.

MLflow: a better way to track your models

In a perfect world, you would get clean data that you will feed into a single machine learning model and voila, done. But fortunately…

In a perfect world, you would get clean data that you will feed into a single machine learning model and voila, done. But fortunately, reality is much more interesting than a perfect world: a well-preforming ML model will only be ready after a bunch of trials. You will have to try different algorithms along with a bunch of hyper parameters, different variants of your training data, and maybe even different variants of your target variable.

This complex mix and match makes the challenge a fun one, but also a messy one to track. Solution? Not so spoiler alert: Comes MLflow to the rescue

MLflow is a platform that helps us manage the life cycle of Machine Learning. It has four different modules: tracking, projects, models and registry.

In this article, we will get started with the tracking module which you can use to keep track of your different models.

The source code of this tutorial is pushed to GitHub:

GitHub – Deena-Gergis/mlflow_tracking: Illustarting basic functionality of MLflow tracking


Getting started in 3 minutes:

So how could you manage your different trials in 3 steps?

After installing Mlflow on your system using pip install mlflow or conda install -c conda-forge mlflow, go ahead and follow those 3 simple steps:

1. Train your model

Our toy model will be very simple logistic regression that was preceded with standard scaling and a PCA. We will play around with the variance cutoff of the PCA to see which will be the best.

2. Track your model:

Set up a new experiment for your project. For each experiment, you can track multiple models, each referred to as a run.

For each run, you can keep track of the following:

  • Parameters: What are the parameters that you have set for your model? For example the depth of you tree, your learning rate, number of estimators, etc. For our toy model, we will track the cutoff variance of the PCA as our parameter.
  • Metrics: How well your model is preforming? You are free to choose any metric of your choice. You can also keep track of multiple metrics. For our toy model, we will track the MSE as our metric.
  • Artifacts: This can be any file that you can export to a file. For example: Pickle files for models, CSVs for data, or even JPEGs for plots. For our toy model, we will store the sklearn pipeline as a pickle file and track it as an artifact

    And now, change the PCA variance for each new run, and MLflow will take care of the rest.

3. Retrieve the best:

After training and tracking your different models, it is time to see what you have and retrieve them for further work. Simply get the ID of your experiment and query the runs.

Note how each run is assigned a unique identifier that you can save and use for your next run. We can also automatically select the model with the least MSE and load it back.

And congratulations, you are now officially qualified to untangle the tracking mess if your models!


So what’s happening behind the scenes?

The tracking mechanism of MLflow is actually quite simple: YAML + Directory Structure.

You will find a new directory created called [mlruns](https://github.com/Deena-Gergis/mlflow_tracking/tree/master/mlruns), which contains a sub-directory for each of your experiments. Each of the experiments will also contain a YAML file with the configuration and subdirectories, each containing a run. And in the run’s sub directory you will find another YAML file, along with other subdirectories where the parameters, metrics and artifacts are tracked.

Other backends:

Even though the local storage back-end that we have been using is very simple to setup and use, it is also limited in terms of multiple users and functionality. And this is why MLflow also offers multiple options for the backend, in case that you decide to adopt the platform on a wider scale. MLflow also supports distributed architectures, where the tracking server, backend store, and artifact store reside on remote hosts. And here is a sneak peak of their fancy GUI that you could also get.


And, happy tracking!


Related Articles