The world’s leading publication for data science, AI, and ML professionals.

Tracking in Practice: Code, Data and ML Model

A guide to tracking in MLOps

Photo by Farzad on Unsplash
Photo by Farzad on Unsplash

Tracking! We’ve all done it before whether you’re a researcher or an engineer; whether you’re involved in machine learning, data science, software development or even a profiler (please don’t mind me, I’m into thriller books these days)! Well, what I want to say is that tracking is important and inevitable. In MLOps, we track all its components: code, data, and the machine learning model! In this article, we explain the importance of tracking through a practical example where we apply testing across the different steps of a machine learning workflow. The entire codebase for this article is accessible in the associated repository.

Not a Medium member? No worries! Continue reading with this friend link.


Table of contents:

· 1. Introduction · 2. Project setting · 3. Code tracking · 4. Data tracking · 5. ML Model tracking · 6. Conclusion


1. Introduction

MLOps principles
MLOps principles

Tracking is defined as the process of recording and tracing the changes and status of the various system components to improve productivity, collaboration and the system maintenance. In MLOps, tracking is one of its essential principles that involves tracing the historical evolution of the different steps of machine learning workflow, including data, Machine Learning model (ML model), and code. In a previous tutorial where I introduced MLOps Principles, I’ve associated tracking with monitoring once the model’s deployed. In fact, these two are related but slightly distinct concepts. It can be said that monitoring focuses on real-time observation and analysis of system behavior after deployment whereas tracking is used over the entire project life cycle.

Why tracking? Tracking your code, data, and models improves reproducibility by tracking the inputs, outputs, execution of code, workflows, and models. It also improves testing by detecting the anomalies in the model/system behavior and performance. In addition, the iterative nature of machine learning development requires tracking.

When to perform tracking? As I stated previously, tracking is applied **** over the entire project life cycle and for code, data, and ML at the same time since they are highly connected: processing data and ML model development is using code so tracking data and ML model requires tracking code, and ML model performance depends on data and code so tracking ML model requires tracking code and data.

Tracking use cases? Let’s consider a specific scenario in our handwritten digits classification project. Imagine the developed machine learning model is deployed. The model was trained using the public dataset and achieved a certain level of accuracy during development and testing phases. However, once deployed into production, the model’s performance degraded over time. The degradation of the model’s performance is initially detected thanks to tracking the model’s behavior. Additionally, by tracking the MLOps components (code, data and ML model) individually, the cause of this behavior can be detected:

  • By tracking the code, we can check if there is a bug and then we promptly identify the commit associated with the bug. Consequently, we can temporarily roll back the deployment, fix it and reintegrate it into the production project.
  • By tracking changes in the distribution and the characteristics of the incoming data over time, data-related issues can be detected such as data drift.
  • In addition to the ability to detect the degradation of the system performance, tracking the ML model also enables rollbacks and updates without disruption.

Although this is an article dedicated to the tracking concept, it’s also part of my MLOps articles series. Furthermore, by following my previous and next tutorials you’ll be able to create your own end-to-end MLOps project starting from the workflow to model deployment and tracking.

If you are interested in MLOps, check out my articles:

2. Project setting

In this article, we will use handwritten digit classification using the Convolutional Neural Network (CNN) project as an example. Specifically, given an input image of a handwritten digit ranging from 0 to 9, the model is required to identify the digit and output its corresponding label. The AI canvas is structured and filled as follows:

AI canvas for handwritten digits classification.
AI canvas for handwritten digits classification.

This project was created for a step by step tutorial along the MLOps articles series that I’m creating. Its structure follows a template for MLOps that you can find as a cookiecutter project or as a Github template. You can also find more details about this project structure in my previous article. Obviously, we’re using Git for code version controlling and DVC for data version controlling. The entire codebase for this project is accessible in the repository.

In the rest of this article, we will improve this project by adding the tracking of code, data and the ML model while giving an example of how it can be achieved.

3. Code tracking

Code tracking is essential for maintaining Machine Learning projects. It consists of recording code versions, the dependency changes and all code-related updates. To track our code effectively, a set of practice needs to be followed:

  • Employ a version control system such as Git, and make use of all its features such as tags, descriptive commit messages and other operations to display history and switch between the different commits. If you want to learn more about version control, I invite you to check out my article: Version Controlling in Practice: Data, ML Model, and Code.
  • Adopt a Git workflow suitable for the project requirements in order to track the code changes and the features development. It also ensures that changes are isolated before merging to the main branch which makes Tracking easier. If you want to learn more about git workflows, I invite you to check out my article: Mastering Git: The 3 Essential Workflows for Efficient Version Controlling or/and Git Workflow for Machine Learning Projects: the Git Workflow I use in my Projects.
  • Manage dependencies and their versions using tools such as pip for python. I usually create the requirements.txt file before pushing, sharing or publishing the project and add it to the version control system to track the dependencies.
  • Integrating the repository with the MLOps platform that orchestrates the end-to-end machine learning lifecycle to facilitate tracking .
  • Other practices are performed after deployment and will be elaborated in the next tutorials such as CD/CI.

Now, let’s elaborate some Git commands that are usually used in code tracking:

  • To check the status of the repository, we use: git status that shows the current branch whether it’s up to date with the remote branch, list the changes in files and view the untracked files.
$ git status
On branch feature/grad_cam
Your branch is up to date with 'origin/feature/grad_cam'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   src/models/cnn/train.py
        modified:   tests/grad_cam.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        mlartifacts/
        mlruns/
        tests/mlruns/

no changes added to commit (use "git add" and/or "git commit -a")
  • To list the branches, we use one of the following commands:
$ git branch # to list all local branches
  feature/data-test
* feature/grad_cam
  feature/integration_test
  feature/model-test
  feature/preprocessing_test
  master

$ git branch -r # to list all remote branches
  origin/HEAD -> origin/master
  origin/feature/data
  origin/feature/data-dvc
  origin/feature/data-test
  origin/feature/grad_cam
  origin/feature/integration_test
  # ...

$ git branch -a # to list all branches (remote and local branches)
  feature/data-test
* feature/grad_cam
  feature/integration_test
  feature/model-test
  feature/preprocessing_test
  master
  remotes/origin/HEAD -> origin/master
  remotes/origin/feature/data
  remotes/origin/feature/data-dvc
  remotes/origin/feature/data-test
  # ...

$ git branch -vv # to list all branches with detailed information
  feature/data-test  a976d83 [origin/feature/data-test] test : add features domain validation.
* feature/grad_cam   f959be7 [origin/feature/grad_cam] Merge branch 'feature/integration_test'
  # ...
  • To view the commit history, we use
$ git log # that displays the detailed commit history
# ...

# Or we can use:
$ git log --pretty=format:"%h %s" # to display only the commit ID and the commit message
f959be7 Merge branch 'feature/integration_test'
eca40ba fix: predict using the latest run.
aa53e29 feat: system integration testing.
  • There are other commands that I usually use through my editor for simplicity and readability:
$ git diff # to view the changes between the working directory and the staging area.

$ git diff --staged # to view the changes between the staging area and the last commit.

$ git reset <file> # to unstage changes

$ git checkout -- <file_name> # to discard local changes

$ git revert <commit_hash> # to revert a commit

$ git reset --soft <commit_hash> # to move HEAD to the specific commit, but keeps changes in the staging area

4. Data tracking

Data tracking is another essential practice to maintain machine learning projects. It consists of recording the data version in its different form, the metadata, the applied processing and its quality over time. To track our data effectively, a set of actions needs to be followed:

  • Data versioning that ensures that the changes can be tracked and reproduced.
  • Data lineage that tracks the origin and the transformation of data as it moves through data processing and ML pipeline.
  • Metadata logging such as logging data source, that ensures preprocessing steps and any transformations applied.

In a previous tutorial, we employed DVC to track data. Let’s elaborate some of its commands that are usually used in data tracking:

  • To display the status of the data files, whether they are up to date or need to be synchronized, use:
$ dvc status
  • To check the correct version of data files based on the current Git commit use:
$dvc checkout
  • When the data is stored in remote storage, use:
$dvc pull # to retrieve the latest version of data files to the local workspace

$dvc push # to upload the latest version of data files from the local workspace to remote storage

$dvc fetch  # to fetch data files from remote storage without checking them out to the workspace.

5. ML Model tracking

If I had to order the list by importance, I would put ML model tracking first. Tracking the model’s performance enables early detection of systems misbehavior and facilitates decision-making and rapid correction of the situation.

Tracking a ML model includes tracking its name, architecture, parameters, weights and experiments. It also includes tracking the code and the data version with which the training is done. Hmmm, there’s so much to keep track of, I agree! Before entering the world of MLOps, I’ve always struggled to save all experiments effectively and efficiently. Back then, in order to save settings, I used a basic traditional approach such as basic file-based storage (pickle and csv files). This latter lacks scalability since it requires manual management and limits reproducibility and collaboration. Fortunately, this struggle led me to research more advanced approaches and learn new technologies and tools related to MLOps. Several platforms and tools exist today to meet the different needs of the different stages of MLOps but that is not the subject of this article.

In this article, we will use MLFlow which we have already presented in a previous tutorial (Version Controlling in Practice: Data, ML Model, and Code) and employed it to version control the ML model.

We first start a local MLflow Tracking Server:

mlflow server --host 127.0.0.1 --port 8080

and set an MLflow experiment to organize and manage our training runs:

# Set tracking server uri for logging
mlflow.set_tracking_uri(tracking_uri)
# Create an MLflow Experiment
mlflow.set_experiment(experiment_name)
  • To track experiments and metadata, MLflow provides a powerful feature, called mlflow.autolog() , to automatically log metrics and parameters. mlflow.autolog() should be called before the training code:
with mlflow.start_run():
    mlflow.autolog()

    # Train:
    model.compile(loss=loss, optimizer='adam', metrics=[metric])
    history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1,
                        validation_data=(x_val, y_val))

The MLflow auto-log saves about 29 parameters including: the batch size, the number of epochs and the optimizer name; and the training metrics including the loss values. Another strong point in MLflow is that it provides a graphical interface in which we can view the logs and even display the graphs:

Displaying the training accuracy in MLflow UI
Displaying the training accuracy in MLflow UI
Displaying the loss function in MLflow UI
Displaying the loss function in MLflow UI
  • Often, we also need to log other metrics and parameters. This latter can be achieved by using : mlflow.log_metrics() and mlfow.log_params() . Here I used them to log the loss function name, precision and f1:
with mlflow.start_run():
    mlflow.autolog()

    # Train:
    model.compile(loss=loss, optimizer='adam', metrics=[metric])
    history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1,
                        validation_data=(x_val, y_val))

    # Evaluation
    # ...

    # Log other params:
    mlflow.log_params({
        'loss': loss,
        'metric': metric,
    })

    # Log other metrics
    mlflow.log_metrics({
        'acc': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'training_loss': history.history['loss'][-1],
        'training_acc': history.history['accuracy'][-1],
        'val_loss': history.history['val_loss'][-1],
        'val_acc': history.history['val_accuracy'][-1],
        'test_loss': test_loss,
        'test_metric': test_metric
    })

As we can see, the additional metrics and parameters are correctly logged and can be beautifully displayed. Also, the different runs can be compared with each other and the best model can be selected:

Furthermore, MLflow stores automatically other metadata such as: the git commit, the user, the training file source, the model summary and the requirements file:

  • We can also register (by clicking the ‘Register Model’ button) and version our model easily and rapidly using the MLFlow UI and the model is now ready for deployment:

I can not end this section without talking again about the importance of using tools like MLflow. Such tools are essential for managing the complexities of the ML lifecycle. They improve the structure and efficiency to the development, experimentation and deployment. With MLflow, ML model tracking becomes more effective and beneficial. Additionally, the MLflow UI provides an interactive and a visual way to manage, track and compare the ML models.

6. Conclusion

Here comes the end of this article! In this article, we introduced one of the most crucial MLOps principles: tracking. Tracking ensures the quality, reliability and reproducibility of machine learning workflows. It also plays an important role in selecting the model to deploy as we will explore further in upcoming articles.

My aim through my articles is to provide my readers clear, well-organized and easy-to-follow tutorials, offering a solid introduction to the diverse topics I cover and promoting good coding and reasoning skills. I am on a never-ending journey of self-improvement, I share my findings with you through these articles. I, myself, frequently refer to my own articles as valuable resources when needed.

Thanks for reading this article. You can find all the examples of the different tutorials I provide in my GitHub profile. If you appreciate my tutorials, please support me by following me and subscribing to my mailing list. This way, you’ll receive notifications about my new articles. If you have any questions or suggestions, feel free to leave a comment.

Image credits

All images and figures in this article whose source is not mentioned in the caption are by the author.


Related Articles