Branches Are All You Need: Our Opinionated ML Versioning Framework

A practical approach to versioning machine learning projects using Git Branches that simplifies workflows and organises data and models

Jonathan (Yonatan) Alexander
Towards Data Science

--

TL;DR

A simple approach to versioning machine learning projects using Git Branches that simplifies workflows, organises data and models, and couples related parts of the project together.

Introduction

When managing machine learning solutions, various aspects of the solutions are often distributed across multiple platforms and locations like GitHub for code, HuggingFace for models, Weights and Biases for tracking, S3 for a copy of everything, etc.

On the data side, we have training data, processed data, training tracking data and model monitoring data. We save models for inference, including older versions and experimental models for online tests. We also have code for preprocessing, training, inference, experimentation, data science analysis, and monitoring alerts.

This can easily get out of hand.

ML example using only multiple GitHub repos and and S3 buckets and addresses
Image by author

Using various tools, environments, and asset addresses to track different parts of the ML lifecycle can result in scattered and uncoordinated states. This can lead to data loss, security breaches, and misconfigurations that need to be carefully managed.

In a previous project, we used SageMaker for daily training in an on-premises solution. This required customers to download a model on a daily basis and train it using various combinations of clients’ datasets.

We had to manage which binary model was trained with which training code on which data of which client, which model runs at which client with which inference code, etc.

In this post we will show how to utilize data versioning tools in order to address these issues using Git.

Data versioning tools allow you to commit data and model files alongside code, regardless of their size. By versioning all files in this way, you can bypass the inconvenience of managing data and model assets.

In an ideal world, you would only have the relevant code, data, and models for the specific task at hand. Whether you are developing or running ML training, tracking experiments, monitoring models in production, or conducting online experiments.

Instead of manually (or even automatically) connecting the right pieces — loading the right data in the right code, connecting them to the right model in HuggingFace in the right environment — imagine every time you checked out a branch, all the pieces were there.

In this article, I will showcase a framework that replaces the complexity of juggling tools with Git, a system that nearly all ML teams are already using.

The goal is to simplify and remove barriers to starting every stage of the ML workflow by having everything in a single place, managed by Git.

Our requirements

  1. A simple workflow that is easy to pause, pick up, and adjust to accommodate changing business and development needs. It should also support reproducibility and enable post-factual queries, such as “What data was my model trained on?”
  2. Efficient use of data and code, with a focus on cohesion, governance and auditing. — Aim to reuse data and code as much as possible, and leverage git features like commits, issues, and tags.
  3. Consolidating all the different aspects of the ML solution is important. Often, experiment tracking, model monitoring in production, and online and offline experiments are separated from the training and inference sides of the solution. Here, we aim to unify them under one umbrella, making it easy to transition between them.
  4. Follow Git and ML best practices, such as early and shareable data splits, testing, and simple collaboration, for different ML engineers.

Key concepts

Every change is a Git commit: This includes data uploads, feature engineering, model overriding, merging experiment metrics results, model monitoring, and naturally code changes.

Active branches: It is common practice to use different branches for development and production. However, we can take it a step further here. This means that you can check out a branch and have all the necessary data, code, models, documents, readmes, and model cards with metrics in one place.

🤯 Your repository is your blob-store! Your branch functions like a bucket within your blob-store, enabling you to upload, download, and store data and models.

This allows you to use different branches for various development, experimentation, and production needs, rather than relying on different platforms and tools.

Merges as workflow: They are used to combine branches. Code is merged normally, and a model typically overrides the existing model. When data is merged, it is usually appended. To receive new data, it is “pulled” from another branch by copy.

Merge data can be as simple as a copy of files or appending to a JSON list. In more sophisticated cases you can merge sqlite databases.

Deduplication, a commonly used feature in data versioning tools, prevents the creation of multiple copies of files, even when there are multiple branches containing the same files.

Types of Branches

Image by author

Main Branch

First, we use our project’s main branch to store the problem definition, documentation, data description, and project structure. This serves as a space for collaboration and discussions.

TIP: Beginning with clearly defining the business problem, determining the desired outcome, identifying target values or labels and how they are obtained, and establishing evaluation metrics and requirements, we can ensure a successful start to our project and provide a place for onboarding and collaboration.

We can also use it for tracking experiments, where your experiments’ results are combined. For example, MLflow’s mlruns folder can be merged there for this purpose. Any collaborator can checkout this branch and run the UI.

Alternatively, the tracking can be done in another branch.

Starting this way is very simple, and as needs change over time, it is possible to upgrade to an MLflow server or a tracking platform like Weights and Biases with minimal changes.

Data Branches

These are branches of the project, which mainly include data files, documentation, and transformation scripts, and they remain active. You can think of them like S3 buckets, but instead of uploading and downloading, you checkout a branch, and your files are there.

It is recommended to always commit (upload) to the raw branch. These create a source of truth, a place never edited or deleted, so we can always track where data is coming from and passing. It also enables creating new flows easily, auditing and governess.

💡 If you add a commit message of where the data is coming from, you can get even more granular observability over your data.

You can use another clean branch where only clean data exists. For example, broken images or empty text files that were uploaded to the raw branch do not appear in the clean branch.

A split branch where the data is divided for training, validation, and testing can ensure that all teams and collaborators work on the same playing field.

This approach helps prevent data leakage and enables more robust feature engineering and collaboration. Minimizing the chance of examples from the test set being included in the training stages reduces the risk of introducing bias. Additionally, having all collaborators on the same split ensures consistent and unbiased results in an experiment.

In a former classification project, I was part of a team of individual contributors where each person ran the whole pipeline from scratch; each of us had used different data splitting percentages and seeds, which led to weaker models in production based on bugs and data biases.

Image by author

💡 ML tip: The three phases model development best practice
We use the “train” and “validation” datasets to train and optimize the model’s hyperparameters. We then use the train plus validation as training set to train our tuned model and evaluate with the test dataset only once. Lastly, we train the model on all the data and save it as our model.

Stable Branches

These branches are active branches for training and inference. Here, you can run your training, save your model, checkpoints, and model card, run tests, build and test the Docker image, commit everything at the end of a training cycle, and then Tag. They should be capable of handling the retrieval of new data and re-training. This is where the automation takes place.

⚠️ No code is written in these branches.

This ensures that a model is coupled with the data it was trained on, the code used to train and run it in production (including feature engineering), and the result metrics. All of these components are combined into a single unified “snapshot”. Whenever you check out a tag, all the necessary pieces for that model are present.

💡 Tip: By choosing the tag name ahead of time, you can add to the tracking info during training as a parameter. This ensures you can always retrieve the model-data-code “snapshot” given the tracking data using any tracking tool.

After training, only the tracking data is merged (copied) to your main branch for tracking.

In the simplest case, it can be a JSON text file that contains the hyperparameters and evaluation results. This file is then appended to a list in the main branch. In the case of MLflow, it involve copying the experiments from the mlruns folder to the main branch.

Coding Branches

These branches are for code development & data exploration, training on sampled or small data until you have a working program. While developing, you are welcome to use all Git best practices. However, only branch out to a stablebranch when no further changes to the code are required, even if additional data is pulled in. These branches should include the inference code, the server, the Dockerfile, and tests.

There is always at least one development branch that remains active, where all new features, bug fixes, and other changes are merged.

💡 ML and MLOps engineers can collaborate on the training and inference sides.

For example, you can create a dev/model branch where you develop a baseline model. This can be the most popular class for classification or the mean/median for regression. The focus is on setting up the code while thoroughly understanding your data.

When it’s stable, and tests pass, we branch out to stable/model where we train and commit the model, code and data together to remote and tag that commit. That is fast and easy to share and will enable the DevOps, backend, and frontend teams to initiate development and exchange feedback. It will also facilitate validating of newly discovered requirements in a real-world environment as early as possible.

Next, we develop the model on the dev/model branch to a simple model like linear regression, and when it’s ready, and tests pass, we can merge it to stable/model where we train, commit, and tag a release to prod.

This approach gives you the freedom to incrementally improve your model while preserving the full context of previous models in the stable branch.

Image by author

From this point, we have three options:

  • We can re-train when more data arrives by pulling data to the stable branch.
  • We can start experimentation using feature engineering on the dev/linear-regression branch.
  • We can create a new dev/new-approach branch for more sophisticated models.

Monitoring Branch

In model monitoring we care about the data distribution, outliers, and prediction distributions.

In the monitoring branch, we save the queried data, commit tag and model prediction from prod as files.

💡 You can use multiple monitoring branches for each environment — dev, stable, and prod.

We can set alerts on data commits to test for drift in any feature distributions, outlier values, calibration sanity test, and save the alerts code; this enables more advanced solutions like an outlier detection model as we can save the model in this branch too.

Image by author

This branch could typically belong to another project that is decoupled from the code responsible for creating the monitoring logs, as well as the data and model that generated them.

Analysis Branch

Data science and analysis is another aspect of the project that is often separated into a different project. This is where the analysis code and non-training data of the data scientists are gathered.

A data scientist can check out and pull data from the monitoring branch to run analysis, A/B tests, and other online and offline experiments. They can also use data from the raw branch for these purposes.

Online examples are simpler, as each experiment group corresponds to a branch.

💡 Tip: Common online experiments:

Forward test- Comparing the current model 99% vs. a candidate model 1%.

Backtest — after merging a new model, keep 1% on the former model to validate expected effect in reverse.

Having the model tag as a parameter in the monitoring data helps you pinpoint every change in the metric potential cause.

Summary

Image by author

This article introduces a framework for versioning machine learning projects using Git branches. The framework simplifies workflows, organizes data and models, and couples related parts of the project together. It emphasizes the use of branches as environments, where each branch contains the necessary data, code, models, and documentation for a specific task. The article also discusses key concepts such as using different active branches categories. Overall, the framework aims to improve workflow efficiency, governance, and collaboration in machine learning projects.

If you want to chat or learn more, join us on our discord or follow our blog.

Epilogue

Regarding my on-premise challenge, we maintained a “stable” branch for each relevant training code and dataset combination. After completing the training, we would tag the commit with an appropriate tag (<client-id>-<incremental version>). Clients can pull the most recent tag, just like any other release.

When “debugging” a client, we would refer to the tag at a specific moment to review the code and corresponding data. We could also match the monitoring data using the same tag, which was added to the monitoring data. The analysis notebooks can be found on our ds/client-id branches.

--

--