Improve Your Machine Learning Pipeline With MLflow

A tutorial to make your machine learning pipeline has more visibility

Pathairush Seeda
Towards Data Science

--

Photo by Alejandro Piñero Amerio on Unsplash

Introduction

Machine learning pipeline is an essential part of data application. We build it to transform the raw data into an insightful prediction. The pipeline contains many steps such as data ingestion, data preprocessing, feature engineering, model fitting, and performance evaluation.

When data scientists start developing the ML pipeline, they try to build the whole pipeline fast and re-iterate the process by changing some hyper-parameter to get the best result. There are many hyper-parameters to tweak in this process.

It would be best if we can track the variation of those hyper-parameters. We will gain a deeper understanding of our ML use cases. For example, we can see how the performance metric, like accuracy or AUC, goes up or down when we change the specific hyper-parameter.

Another benefit is that we can analyze those logging data and derive the in-house best practice for developing the model in a particular area.

For instance, we can derive a suitable range for random forest depth or the number of neural network layers to try within a hyperparameter tuning process like a propensity or churn model. It helps to reduce a lot of time-consuming for model development.

How can we do that?

That is the purpose of this article. There are several useful open source libraries out there for supporting the machine learning pipeline creation. Today we will focus on aMLFlow library.

  • MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
  • MLflow receives 8K stars on GitHub as of 17 Dec 2020. Many companies use and contribute to this project.
  • I had a chance to use a Databricks platform with MLflow integration. It was a valuable experience, and it taught me how important the logging/tracking system is.

Let’s see how we use it to improve our ML pipeline.

You can start by installing your MLFlow with pip install mlflow. You can then start the tracking user interface that MLFlow provided by typing a mlflow ui command line in your terminal.

If you can see the below figure, then you are good to go. I provide an example for using it in the GoogleColab here.

ML Flow tracking UI: Image by author

You can see a beautiful tracking system that the MLflow provides to users. We haven’t had any data here yet. So let’s put something into it.

Firstly, you have to understand how to plug your MLFlow element into your code.

For whatever pattern your code is, you can wrap everything with the above snippet to make the MLflow tracks the result for you. The result will be stored locally based on where your MLflow is installed.

In Databricks, there is a dedicated space for storing the metrics data.

With these simple 3 lines of code, you can track anything you want in MLflow. MLFlow will track anything you run in the with condition and display it through the tracking system as below figure.

Without MLflow, you may need to make a logging system by yourself. This reduces the setup time for your logging system and helps you focus more on the machine learning code.

There are many things you can log with MLflow, like the plot of confusion metrics from your model, the fitted model itself, the pickle file for the name of the important features. That’s what I did when I use it with the Databricks platform.

With this ability, what is the good use cases of MLflow?

Typically, I always use MLflow when I’m tuning the hyper-parameter with 3rd party tools such as hyperOpt. The hyperOpt is a hyper-parameter tuning using the Bayesian method to find a better hyper-parameter set to use.

We can integrate the MLflow to log every hyperOpt trial to see a suitable range of hyper-parameter. This reduced a lot of modeling time for reaching the acceptable model performance for me in the past.

Also, it gives you a better understanding of how each parameter from your model behaves. And next time, you can go directly to where it should be without wasting time grid/random search all over the hyper-parameter spaces.

Let me show you how to use the MLflow with the hyperOpt library. We can create an hyperOpt by the following snippet.

Usually, the hyperOpt works by passing the hyper-parameter space to the objective function. Then it figures out the value minimizing the objective function within the number of trials.

The above example is to minimize the loss value that returns from the fit_model function.

From the above code, the hyperOpt will run 100 times max_evals from the provided space variable and return the best combination of hyper-parameter.

All the used models, hyper-parameters, and scores in each run will be stored in the MLflow directory. You can access the best model after you finish the hyperOpt running. This helps ensure the reproducibility of your machine learning model in the development process.

Once you realized the benefit of a tracking system along the model development process. You will never stop using it. ❤️

Another area that MLflow is useful

The second use case is that I use it all the time is in my deployed machine learning pipeline. MLflow gives me the ability to track the new re-fit model performance when fitting with the latest data. We can log the feature importance and visualize the change over time.

All the fitted model can store within the MLflow artifact store in each run. We can refer back to the model at that time for reproducibility. It helps you ensure the validity of the model version.

I usually simulate the result of my prediction with the test data. For example, I randomly sample an A side group from my population and compare them to my prediction's top scores. Then I measured the conversion rate of the product sold within both groups. This is the A/B testing simulation with the historical data that helps you feel more comfortable to use your prediction in the real world.

All the simulation results can be logged and visualize over time with MLflow. No need for manual work from data scientists to track their model results. You can automate and visualize it with MLflow.

Not only the related stuff to machine learning, but MLflow can be used in the quality checking and testing process as well. I once wrote the unit test that logs the test result to MLflow. After the code is run, I can check the result instantaneously within the same place.

Making great habits with MLflow

Photo by Martin Shreder on Unsplash

I used to make a quick machine learning model development for ad-hoc work. It’s a pain for me to keep track of all the metrics and parameters in the development pipeline. Sometimes, I reported the result that I couldn’t reproduce after restarting the Jupyter notebook kernel. It isn’t a delightful moment at all.

Luckily, those problems can be fixed with MLflow, as I mentioned above. If you make it a habit, I trust that your development time and error in the machine learning pipeline will be shortened. Also, When you make it automated, It becomes simple to generalized to other contexts as well.

“The most successful men work smart, not hard”
Bangambiki Habyarimana, The Great Pearl of Wisdom

This is the skill I think it’s crucial for data science. But it’s not that much pointed out in the online course or on the internet. It’s like the best practice that you will gradually gather along the working journey. It will become your toolbox to reach a higher level, making a more delivered and more successful data science life.

Also, there are many exciting features provided by MLflow that I didn’t mention here, such as model deployment and automated logging. If you would like to learn more, it’s here

Pathairush Seeda

if you like this article and would like to see something like this more.

--

--