Manage your Machine Learning Lifecycle with MLflow — Part 1.

Reproducibility, good management and tracking experiments is necessary for making easy to test other’s work and analysis. In this first part we will start learning with simple examples how to record and query experiments, packaging Machine Learning models so they can be reproducible and ran on any platform using MLflow.

Favio Vázquez

Published in

Towards Data Science

7 min readJun 13, 2018

The Machine Learning Lifecycle Conundrum

Machine Learning (ML) is not easy, but creating a good workflow which you can reproduce, revisit and deploy to production is even harder. There has been many advances towards creating a good platform or managing solution for ML. Note that this is not the Data Science (DS) Lifecycle, which is more complex and has many parts.

The ML lifecycle exists inside the DS lifecycle.

You can check some of the projects for creating ML workflows here:

pachyderm/pachyderm

pachyderm - Reproducible Data Science at Scale!

github.com

combust/mleap

mleap - MLeap: Deploy Spark Pipelines to Production

github.com

These packages are great, but not so easy to follow. Maybe the solution is a mix of these three, or something like that. But here I’ll present you the latest solution created by Databricks called MLflow.

Getting started with MLflow

MLflow is an open source platform for the complete machine learning lifecycle.

MLflow is designed to work with any ML library, algorithm, deployment tool or language. It is very easy to add MLflow to your existing ML code so you can benefit from it immediately, and to share code using any ML library that others in your organization can run. MLflow is also an open source project that users and library developers can extend.

Installing MLflow

Installing MLflow is very easy, you just have to run:

pip install mlflow

And this is according to the creators. But I faced several issues while installing it. So here are my recommendations (if you can run mlflow in your terminal after installing ignore ):

From Databricks: MLflow cannot be installed on the MacOS system installation of Python. We recommend installing Python 3 through the Homebrew package manager using brew install python. (In this case, installing mlflow is now pip3 install mlflow).

That did not work for me and I got this error:

~ ❯ mlflow
Traceback (most recent call last):
  File "/usr/bin/mlflow", line 7, in <module>
    from mlflow.cli import cli
  File "/usr/lib/python3.6/site-packages/mlflow/__init__.py", line 8, in <module>
    import mlflow.projects as projects # noqa
  File "/usr/lib/python3.6/site-packages/mlflow/projects.py", line 18, in <module>
    from mlflow.entities.param import Param
  File "/usr/lib/python3.6/site-packages/mlflow/entities/param.py", line 2, in <module>
    from mlflow.protos.service_pb2 import Param as ProtoParam
  File "/usr/lib/python3.6/site-packages/mlflow/protos/service_pb2.py", line 127, in <module>
    options=None, file=DESCRIPTOR),
TypeError: __init__() got an unexpected keyword argument 'file'

And the way of solving that was not very easy. I’m using MacOS btw. To solve that I needed to update the protobuf library. To do that I installed the Google’s protobuf library from source:

google/protobuf

protobuf - Protocol Buffers - Google's data interchange format

github.com

Download the 3.5.1 version. I had the 3.3.1 before. Follow these steps:

Installing protoc

API for protocol buffers using modern Haskell language and library patterns.

google.github.io

Or try using Homebrew.

If your installation works, run

mlflow

and you should see this:

Usage: mlflow [OPTIONS] COMMAND [ARGS]...Options:
  --version  Show the version and exit.
  --help     Show this message and exit.Commands:
  azureml      Serve models on Azure ML.
  download     Downloads the artifact at the specified DBFS...
  experiments  Tracking APIs.
  pyfunc       Serve Python models locally.
  run          Run an MLflow project from the given URI.
  sagemaker    Serve models on SageMaker.
  sklearn      Serve SciKit-Learn models.
  ui           Run the MLflow tracking UI.

Quickstart with MLflow

Now that you have MLflow installed let’s run a simple example.

import os
from mlflow import log_metric, log_param, log_artifact

if __name__ == "__main__":
    # Log a parameter (key-value pair)
    log_param("param1", 5)

    # Log a metric; metrics can be updated throughout the run
    log_metric("foo", 1)
    log_metric("foo", 2)
    log_metric("foo", 3)

    # Log an artifact (output file)
    with open("output.txt", "w") as f:
        f.write("Hello world!")
    log_artifact("output.txt")

Save that to train.py and then run with

python train.py

You will see the following:

Running test.py

And that’s it? Nope. With MLflow you have a UI that you can access easily writing:

mlflow ui

And you will see (localhost:5000 by default):

So what have we done so far? If you see the code you’ll se we used two things, a log_param, log_metric and log_artifact. The first one logs the passed-in parameter under the current run, creating a run if necessary, the second one logs the passed-in metric under the current run, creating a run if necessary, and the last one log a local file or directory as an artifact of the currently active run.

So with this simple example we learned how to save the log of params, metrics and files in our lifecycle.

If we click on the date of the run, we can see more about it.

Now if we click the metric, we can see how it got updated through the run:

And if we click the artifact we can see a preview of it:

MLflow Tracking

The MLflow Tracking component lets you log and query experiments using either REST or Python.

Each run records the following information:

Code Version: Git commit used to execute the run, if it was executed from an MLflow Project.

Start & End: TimeStart and end time of the run

Source: Name of the file executed to launch the run, or the project name and entry point for the run if the run was executed from an MLflow Project.

Parameters: Key-value input parameters of your choice. Both keys and values are strings.

Metrics: Key-value metrics where the value is numeric. Each metric can be updated throughout the course of the run (for example, to track how your model’s loss function is converging), and MLflow will record and let you visualize the metric’s full history.

Artifacts: Output files in any format. For example, you can record images (for example, PNGs), models (for example, a pickled SciKit-Learn model) or even data files (for example, a Parquet file) as artifacts.

Runs can optionally be organized into experiments, which group together runs for a specific task. You can create an experiment via the mlflow experiments CLI, withmlflow.create_experiment(), or via the corresponding REST parameters.

# Prints "created an experiment with ID <id>
mlflow experiments create face-detection
# Set the ID via environment variables
export MLFLOW_EXPERIMENT_ID=<id>

And then you just launch an experiment:

# Launch a run. The experiment ID is inferred from the MLFLOW_EXPERIMENT_ID environment variablewith mlflow.start_run():
    mlflow.log_parameter("a", 1)
    mlflow.log_metric("b", 2)

Example of Tracking:

A simple example using the Wine Quality dataset: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.

First download this file:

https://raw.githubusercontent.com/databricks/mlflow/master/example/tutorial/wine-quality.csv

And then in the folder create the file train.py with the content:

# Read the wine-quality csv file

data = pd.read_csv("wine-quality.csv")

# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)

# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]

alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5

with mlflow.start_run():
    lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    lr.fit(train_x, train_y)

    predicted_qualities = lr.predict(test_x)

    (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

    print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
    print("  RMSE: %s" % rmse)
    print("  MAE: %s" % mae)
    print("  R2: %s" % r2)

    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)

    mlflow.sklearn.log_model(lr, "model")

Here we will thest MLflow integration for SciKit-Learn too. After running you will see in the terminal this:

Elasticnet model (alpha=0.500000, l1_ratio=0.500000):
  RMSE: 0.82224284976
  MAE: 0.627876141016
  R2: 0.126787219728

And then run the mlflow ui in the same current working directory as the one which contains the mlruns directory and navigate your browser to http://localhost:5000. You will see:

And you will have this for each run, so you can track everything you do. Also the model have a pkl file and a YAML for deployment, reproduction and sharing.

Stay tuned for more

In the next post I’ll cover the Projects and Models API, where we will be able to run in production these models, also create a full lifecycle.

Make sure to check the MLflow project for more:

databricks/mlflow

mlflow - Open source platform for the complete machine learning lifecycle

github.com

Thanks for reading this. I hope you found something interesting here :)

If you have questions just follow me on Twitter

Favio Vázquez (@FavioVaz) | Twitter

The latest Tweets from Favio Vázquez (@FavioVaz). Data Scientist. Physicist and computational engineer. I have a…

twitter.com

and LinkedIn.

Favio Vázquez — Principal Data Scientist — OXXO | LinkedIn

View Favio Vázquez’s profile on LinkedIn, the world’s largest professional community. Favio has 15 jobs jobs listed on…

linkedin.com

See you there :)

Manage your Machine Learning Lifecycle with MLflow — Part 1.

The Machine Learning Lifecycle Conundrum

pachyderm/pachyderm

pachyderm - Reproducible Data Science at Scale!

combust/mleap

mleap - MLeap: Deploy Spark Pipelines to Production

Getting started with MLflow

Installing MLflow

google/protobuf

protobuf - Protocol Buffers - Google's data interchange format

Installing protoc

API for protocol buffers using modern Haskell language and library patterns.

Quickstart with MLflow

MLflow Tracking

Example of Tracking:

Stay tuned for more

databricks/mlflow

mlflow - Open source platform for the complete machine learning lifecycle

Favio Vázquez (@FavioVaz) | Twitter

The latest Tweets from Favio Vázquez (@FavioVaz). Data Scientist. Physicist and computational engineer. I have a…

Favio Vázquez — Principal Data Scientist — OXXO | LinkedIn

View Favio Vázquez’s profile on LinkedIn, the world’s largest professional community. Favio has 15 jobs jobs listed on…

Written by Favio Vázquez