Training and fine-tuning various models is a basic task for every computer vision researcher. Even for easy ones, we do a hyper-parameter search to find the optimal way of training the model over our custom dataset. Data augmentation techniques (which include many different options already), the choice of optimizer, learning rate, and the model itself. Is it the best architecture for my case? Should I add more layers, change the architecture, and many more questions will wait to be asked and searched?
While searching for an answer to all these questions, I used to save the model training process log files and output checkpoints in different folders in my local, change the output directory name every time I ran a training, and compare the final metrics manually one-by-one. Tackling the experiment-tracking process in such a manual way has many disadvantages: it’s old school, time and energy-consuming, and prone to errors.
In this blog post, I will show you how to use MLflow, one of the best tools to track your experiment, allowing you to log whatever information you need, visualize and compare the different training experiments you have accomplished, and decide which training is the optimal choice in a user- (and eyes-) friendly environment!
Set Up MLflow
As with many other stuff, I used conda to create an isolated environment to set up MLflow in my local using the following commands:
conda create --name mlflow python=3.8 #create the environment
conda activate mlflow #activate the conda environment
pip install mlflow #install mlflow
Run MLflow
Inside our conda environment just set up, we run the following command to start the MLflow server in our local host, at port 9090. If this port is already in use, just modify it for any other one.
Also, create a folder in your local named MLflow, and whenever you start the server, do it from this folder. If you don’t determine a fixed folder and run the MLflow server from random locations, the server will create an empty "mlruns" folder each time, instead of bringing you back the previous experiments you have done. So, don’t forget to run the below command from your fixed location, which is for me _/home/yca/_MLflow
cd MLFLOW
mlflow server --host 127.0.0.1 --port 9090
Go to http://127.0.0.1:9090 and meet with the MLflow interface:

Train and Log
Now that the MLflow server is running, we need to arrange our training code in such a way that whenever we run training, it will automatically log all the information into our MLflow user interface.
Click to reach out to the full code I have prepared for this blog over a Classification project. The repository contains the main program train.py, the dataset management script utils/data.py, and the helper script utils/augmentation.py to automate data augmentation according to the config.yaml and facilitate logging in MLflow interface, and utils/trainer.py, the object class for training with a readable and portable structure.
Let’s start to discuss some MLflow-specific functions that I use in train.py and trainer.py scripts to make my training process trackable by MLflow.
- Set MLflow server IP correctly
mlflow.set_tracking_uri(uri="http://127.0.0.1:9090")
- Set the experiment name, as long as it is the same project, the name should be the same. Therefore, the different experiments will be accumulated under the same project and we will be available to track and compare them with each other.
mlflow.set_experiment(config["logging"]["experiment_name"])
- If somehow the previous, already finished experiment is active, close that so you can start a new experiment under the same experiment name
if mlflow.active_run():
mlflow.end_run()
- Tell MLflow to track also the system metrics automatically, which contains GPU, CPU, memory usage, and stuff like that. The default value is "false", so if you don’t want these metrics, just don’t include the below line in your code.
os.environ["MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING"] = "true"
- Start the MLflow engine—we will be logging whatever metric or parameter from now on. I also like to check the run ID, experiment ID, and experiment name by printing.
with mlflow.start_run() as run:
run_id = run.info.run_id
experiment_id = run.info.experiment_id
experiment_name = mlflow.get_experiment(experiment_id).name
print(f"Experiment Name: {experiment_name}")
print(f"Experiment ID: {experiment_id}")
print(f"Run ID: {run_id}")
- Set dataset name and version; in my case it’s set automatically from the configuration file:
mlflow.set_tag("Dataset Name", dataset_name)
mlflow.set_tag("Dataset Version", dataset_version)
- Log everything in the configuration file (config.yaml) as an artifact.
mlflow.log_dict(config, "config.yaml")
When the training finishes, we will see it as below. It is also downloadable again as a .yaml file. So imagine that you apply tens of training by changing the configuration file to modify the hyperparameters every time, you don’t need to find the correct parameters and modify back the config.yaml file in your local manually. Just download the chosen model’s configuration file and that’s it!

- Also, I want to log all the parameters in this configuration file as individual parameters, not only an artifact the whole .yaml file. So when the time to compare comes, I can use these parameters directly.
log_dict_as_params(config) #It's a user defined function built in helpers.py
- We are ready to initialize our trainer class object and start training in 2 lines of codes:
trainer = Trainer(config, class_names, dataloaders, dataset_sizes)
model_finetuned = trainer.train_model()
As you may see, in trainer.py, we also have some MLflow-specific functions. In the _trainmodel() function, we see a usual training process, feeding the network forward for each batch coming from the training dataset, calculating the loss and accuracy, and updating the model parameters. If it is in validation mode, feed the batch from the validation dataset and calculate the loss and accuracy. Here we have four important metrics: train accuracy, validation accuracy, train loss, and validation loss and we log them as follows:
if phase == "train":
self.scheduler.step()
self.train_loss.append(epoch_loss)
self.train_accuracy.append(epoch_acc.cpu())
mlflow.log_metric("train_loss", epoch_loss, step=epoch)
mlflow.log_metric("train_accuracy", epoch_acc, step=epoch)
else:
self.val_loss.append(epoch_loss)
self.val_accuracy.append(epoch_acc.cpu())
mlflow.log_metric("val_loss", epoch_loss, step=epoch)
mlflow.log_metric("val_accuracy", epoch_acc, step=epoch)
Logging metrics creates always plots in the MLflow interface:

- Still, you can create your own plots and save them under MLflow experiment using mlflow.log_figure function.
# Plot and save
plt.figure(figsize=(5, 5), num=1)
plt.clf()
plt.plot(self.epochs, self.train_loss, label='Train')
plt.plot(self.epochs, self.val_loss, label='Test')
plt.legend()
plt.grid()
plt.title('Cross entropy loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
mlflow.log_figure(plt.gcf(), "loss.png")
plt.figure(figsize=(5, 5), num=2)
plt.clf()
plt.plot(self.epochs, self.train_accuracy, label='Train')
plt.plot(self.epochs, self.val_accuracy, label='Test')
plt.legend()
plt.grid()
plt.title('Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
mlflow.log_figure(plt.gcf(), "accuracy.png")
This gives me the following plots at the end of the training:

- I like to log an early stopped parameter to check whether my training finished via early stopping, or it continued until the last epoch (if so, I may retrain with more epochs). I do it with the following piece of code in my _trainmodel() function inside trainer.py script.
# Early stopping check
if early_stopping_counter >= early_stopping_patience:
print(f'Early stopping at epoch {epoch} due to no improvement in validation loss for {early_stopping_patience} consecutive epochs.')
self.early_stopped = True # Hypothetical attribute; framework-dependent
break
....
....
if self.early_stopped:
mlflow.log_param("early_stopping", True)
mlflow.log_param("stopped_epoch", epoch)
else:
mlflow.log_param("early_stopping", False)
- I like to apply the test dataset just after the training and log the test accuracy along with the confusion matrix.
trainer.test_model()
#inside of test_model() function
...
...
mlflow.log_metric("test_accuracy", test_accuracy)
" Create confisuion matrix and save it as artifacts "
# Handle confusion matrix with all classes
cf_matrix = confusion_matrix(y_true, y_pred, labels=list(range(len(self.class_names))))
panda_matrix = pd.DataFrame(
cf_matrix,
index=self.class_names, # Rows: True Labels
columns=self.class_names # Columns: Predicted Labels
)
plt.figure(figsize = (12,7))
sn.heatmap(panda_matrix, annot=True)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
mlflow.log_figure(plt.gcf(), "confusion_matrix.png")
Which saves the confusion matrix plots as an artifact as follows:

- I like to create the model signature and log it. It is useful to take a quick look to remember the model input expectation and example output shape.
# Create input signature for Mlflow
sample_input = torch.randn(1, 3, config["input"]["input_size"][0], config["input"]["input_size"][1])
model_finetuned = model_finetuned.to("cpu")
model_signature = infer_signature(sample_input.numpy(), model_finetuned(sample_input).detach().numpy())
- As the last step, we save the best model (the model having the highest validation accuracy during training) as an artifact.
mlflow.pytorch.log_model(model_finetuned, "models/best_model", signature=model_signature)
When I run the python train.py command 3 times, with little changes in the configuration file, I see these 3 experiments logged in the MLflow interface as follows:

We can customize the Metrics, Parameters, and Tags columns to decide what we want to see in this main table. Features like "Created", "Duration", "Source", and "Models" are logged automatically by MLflow.
By clicking the last experiment, beautiful-croc-936 (the names are randomly generated by MLflow, in your case, they will be different), we can start discovering the interface more and more. The overview section includes all the parameters, system, and model metrics:

Model metrics gives us the automatically generated plots for the metrics we were logging:

The system metrics section shows the automatically generated plots for system metrics, as well as the artifacts containing the manually logged plots, the best model that we saved, and so on.
Compare Experiments via MLflow and Inference
In the last part, we will discover what are the options to compare and load the optimal model for inference time.
The first look at the general table already gives us a quick comparison, but by choosing all the experiments and clicking the "Compare" button at the top, we can compare and visualize whatever metric and parameter.

When we are done choosing the model for inference, it is nothing more than loading this model from the MLflow server and starting to use it:
model_uri = f"runs:/{args.run_id}/models/best_model"
print(f"Loading model from: {model_uri}")
model = mlflow.pytorch.load_model(model_uri)
I also like to log the inference outputs (input image with the predicted label as a plot) so that for the future selected models I can even compare the inference results and be sure my new model to deploy works still the same as the previous classes/previous inference set.
Since I want to see the inference results under the same experiment of related model training, I give the experiment_id as a terminal argument and log the inference results to the same MLflow experiment as follows:
with mlflow.start_run(run_id=args.run_id):
The rest is nothing special: load the inference dataset, get the predictions of the model, plot the results, and save them as artifacts under the inference_torch/ folder.

Check out inference.py for this part!
You are ready to examine and use the example GitHub repository created for MLflow experiment tracking for a classification task and modify it for any dataset you need. Add different model structures if you want to check the performance of other models rather than InceptionV3.
To run the GitHub repo correctly, create another conda environment, install requirements.txt, and run the train.py program from this environment, modifying the configuration file according to your dataset.
Note that MLflow is not only for classification, it is just some lines of code change for your custom cases like detection, segmentation, and many more!
Takeaways:
- Log parameters: Register any hyperparameter you use to structure your model.
- Log figure: Save the metrics additionally as plots.
- Log metrics: Register any data changing by the time during training.
- Initialize the MLflow server always from the same folder, so that the previous experiments are loaded, instead of starting a new-empty storage in your local
Keep in touch for the rest of the MLOPS tutorial, as we will be discovering the next steps to accomplish Computer Vision projects with a full MLOPS cycle!