Welcome back to the second part of our journey in Mlflow. Today we’ll extend the current SDK implementation with two functions for reporting historical metrics and custom metrics. Then, we’ll finally see the SDK working with a simple example. Next time, we’ll dig into MLflow plugins and we’ll create a "deployment" plugin for GCP AI platform
Here my first article about MLflow SDK creation:
Table of Content
— What do we need today — – Report experiment’s runs metrics to the most recent run
What do we need today
Firstly, let’s think of the design of the main SDK protocol. The aim today is to allow data scientists to:
- add to a given experiment’s run the historical metrics computed in previous runs
- add custom computed metrics to a specific run
Thus, we can think of implementing the two following functions:
report_metrics_to_experiment
: this function will collect all the metrics from previous experiment’s runs and will group them in an interactive plot, so users can immediately spot issues and understand the overall trendreport_custom_metrics
: this function return data scientists’ metrics annotations, posting a dictionary to a given experiment. This may be useful if a data scientist would like to stick to a specific experiment with some metrics on unseen data.
Report experiment’s runs metrics to the most recent run
This function makes use of the MLflowClient
Client in MLflow Tracking manages experiments and their runs. From MLflowClient
we can retrieve all the runs for a given experiment. From there, we can extract each run’s metrics. Once we gather together all the metrics we can proceed with a second step, where we are going to use plotly
to have an interactive html
plot. In this way, users can analyse each single data point for all the runs in the MLflow server artefacts box.
Fig.1 shows the first part of report_metrics_to_experiment
function. Firstly, the MlflowClient
is initiated, with the given input tracking_uri
. Then, the experiment’s information is retrieved with client.get_experiment_by_name
and converted to a dictionary. From here each experiment’s run is listed, runs_list
. Each run has its run_id
which is practical to store metrics information in a dictionary models_metrics
. Additionally, metrics can be access via run.to_dictionary()['data']['metrics']
. This value returns the name of the metric.
From the metric’s name, the metric’s data points can be recorded through client.get_metric_history()
This attribute returns the steps and the values of the metric, so we can append to lists and saved them in models_metrics[single_run_id][metric] = [x_axis, y_axis]
Fig.2 shows the second part of report_metrics_to_experiment
Firstly, a new plotly
figure is initialised fig = go.Figure().
Metrics are then read from models_metrics
and added as a scatter plot. The final plot is saved in html
format, to have an interactive visualization.
Report custom metrics to a run
The final function we are going to implement today, reports a custom input to a specific run. In this case a data scientist may have some metrics obtained from a run’s model with unseen data. This function is shown in fig.3. Given an input dictionary custom_metrics
(e.g. {accuracy_on_datasetXYZ: 0.98}
) the function uses MlflowClient
to log_metric
for a specific run_id
Update the experiment tracking interface
Now that two news functions have been added to the main MLflow protocol, let’s encapsulate them in our experiment_tracking_training.py
In particular, end_training_job
could call report_metrics_to_experiment
, so, at the end of any training, we can keep track of all the historical metrics for a given experiment, as shown in fig.4
Additionally, to allow users to add their own metrics to specific runs, we can think of an add_metrics_to_run
function, which receives as input the experiment tracking parameters, the run_id
we want to work on and the custom dictionary custom_metrics
(fig.5):
Create your final MLflow SDK and install it
Patching all the pieces together, the SDK package should be structured in a similar way:
mlflow_sdk/
mlflow_sdk/
__init__.py
ExperimentTrackingInterface.py
experiment_tracking_training.py
requirements.txt
setup.py
README.md
The requirements.txt
contains all the packages we need to install our SDK, in particular you’ll need numpy, mlflow, pandas, matplotlib, scikit_learn, seaborn, plotly
as default.
setup.py
allows to install your own MLflow SDK in a given Python environment and the script should be structured in this way:
To install the SDK just use Python or a virtualenv Python as: python setup.py install
SDK in action!
It’s time to put in action what our MLflow SDK. We’ll test it with a sklearn.ensemble.RandomForestClassifier
and the iris dataset¹ ² ³ ⁴ ⁵ (source and license, Open Data Commons Public Domain Dedication and License). Fig.7 shows the full example script we are going to use (my script name is 1_iris_random_forest.py
)
tracking_params
contains all the relevant info for setting up the MLflow server, as well as the run and experiment name. After loading the dataset, we are going to create a train test split with sklearn.model_selection.train_test_split
. To show different metrics and plots in the MLflow artefacts I run 1_iris_random_forest.py
5 times, varying the test_size
with the following values:0.3, 0.2, 0.1, 0.05, 0.01
Once the data have been set up, clf=RandomForestClassifier(n_estimators=2)
we can call experiment_tracking_training.start_training_job
. This module will interact with the MLflow context manager and it will report to the MLflow server the script that is running the model as well as model’s info and artefacts.
At the end of the training we want to report all the experiment run’s metrics in a single plot and, just for testing, we are going to save also some "fake" metrics like false_metrics = {"test_metric1":0.98, ... }
Before running the 1_iris_random_forest.py
in a new terminal tab open up the connection with the MLflow server as mlflow ui
and navigate to http://localhost:5000
or http://127.0.0.1:5000
. Then, run the example above as python 1_iris_random_forest.py
and repeat the run 5 times for different values of test_size

Fig.8 should be similar to what you have after running the example script. Under Experiments
the experiments’ names are listed. For each experiment there is a series of runs, in particular, under random_forest
you’ll find your random forest runs, from 1_iris_random_forest.py
For each run we can immediately see some parameters, which are automatically logged by mlflow.sklearn.autolog()
as well as our fake metrics (e.g. test_metric1
) The autolog function saves also Tags
, reporting the estimator class (e.g. sklearn.ensemble._forest.RandomForestClassifier
) and method ( RandomForestClassifier
).
Clicking on a single run more details are shown. At first you’ll see all the model parameters, which, again, are automatically reported by the autolog function. Scrolling down the page we can access the Metrics plots. In this case we have just a single data point, but you can have a full plot as a function of the number of steps for more complicated models.

The most important information will then be stored under the Artifacts box (fig.9). Here you can find different folders which have been created by our mlflow_sdk:
- Firstly,
code
is a folder which stores the script used to run our model – this was done inexperiment_tracking_training
on line 24 withtraceback
, here the link, and pushed to MLflow artefacts on line 31 ofrun_training
function, here the link). - Following,
model
stores binary pickle files. MLflow automatically saves model files as well as its requirements to allow reproducibility of the results. This will be super helpful at deployment time. - Finally, you’ll see all the interactive plots. (
*.html
), generated at the end of the training, as well as additional metrics we have computed during the training, such astraining_confusion_matrix.png
As you can see, with minimal intervention we have added a full tracking routing to our ML models. Experimenting is crucial at development time and in this way, data scientists could easily use MLflow Tracking functionality without over modifying their existent codes. From here you can explore different "shades" of reports, adding further information for each run as well as running MLflow on a dedicated server to allow cross-teams collaborations.
¹ Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179–188.
² Deming, W. Edwards. "Contributions to Mathematical Statistics. RA. New York: Wiley; London: Chapman & Hall, 1950. 655 pp. " Science 113.2930 (1951): 216–217.
³ Duda, R. O., and P. E. Hart. "Pattern Classification and Scene Analysis.(Q327. D83) John Wiley & Sons." (1973): 218.
⁴ Dasarathy, Belur V. "Nosing around the neighborhood: A new system structure and classification rule for recognition in partially exposed environments." IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (1980): 67–71.
⁵ Gates, Geoffrey. "The reduced nearest neighbor rule (corresp.)." IEEE transactions on information theory 18.3 (1972): 431–433.
That’s all for today! I hope you enjoyed these two articles about MLflow and its SDK development. Next time we’ll dig into MLflow plugins world, which, in theory, could lead your team to the deployment phase as well.
If you have any question or curiosity, just write me an email at [email protected]