DIY machine learning training pipeline

Or how to automate your machine learning training pipeline using Git and a CI server.

Published in

Towards Data Science

14 min readMay 10, 2020

A machine learning pipeline is more than some tools glued together. It is a process that can make the lives of data scientists and developers easier by removing any mundane task or error-prone step. The automation of the training process requires the collaboration of many different team members The term MLOps has been used to describe the combination of software engineering, machine learning and system operations in order to bring models to production. Platforms for managing machine learning pipeline already exist. In the next paragraphs, a solution using Git and a CI/CD as the core of the pipeline is presented.

The reasons why this article is being written are:

To highlight the basic parts and the most common problems encountered when building a machine learning pipeline.
To suggest a design for a training pipeline that can be used by smaller teams without major effort in deployment.

The training pipeline

Let’s start with a description of a typical machine learning training pipeline. These pipelines consist of three steps:

Dataset preparation
Model training
Deploying the model

Each of these steps has its own details.

Dataset preparation

Data may come from different sources in different formats. They need to be merged, cleaned, transformed etc. The output is a dataset that can be used for training a model. It is really important to store these datasets in a place where it is easy to retrieve for everyone in the team or audience. Datasets are, also, required for reproducing the results of past experiments or for testing different approaches to solving the problem. Datasets also need to be immutable. For example, a dataset may be generated from a database. A database is a snapshot of its data and changes fast with every update. Even running the same dataset generation script with the exact same parameters may result in a different dataset. A popular solution is to store the generated files in some object storage like Amazon S3 or Google Cloud Storage, usually using some conventions for the keys/paths to the objects.

An aspect that is often overlooked is that the code that generated the dataset is somehow discoverable from the dataset (and vice versa). This can help in cases like generating a newer version of the dataset with more data or creating a revised version after bug fixes.

Furthermore, it is important to know how the dataset was generated. Which database connections were used, any parameters that were passed as input etc.

Model training

As soon as the dataset is ready, it may be used for experimentation. The training process may generate multiple artifacts (models, reports, scalers, vocabularies etc). Moreover, it is common to try different solutions or iterate on the best performing models. There’s a lot of tools for tracking the performance of individual models. However most of them lack a way to connect the results/reports with the generated artifacts.

Like dataset preparation, model training usually results in a lot of different branches of the code, each one for a different experiment. Keeping the connection among the source code of the model, the dataset used as input and the generated model/reports is really helpful.

Finally, any parameters or configuration that was required for running the model must be logged. If you run training on your laptop or an EC2 instance by manually typing the training commands, you are likely to lose or forget this information.

Deployment

We’ve made it. We have a model that looks promising and you want to deploy it to production. I’m not going to go into details, but there’s a lot of different ways this may work, including

serving the model over a RESTful endpoint,
processing events from streams or queues or
batch processing input data.

But we need to be sure that the correct version of the model is deployed. A really bad practice is manually copying trained models. This can introduce a lot of errors. The model may be the wrong version or a REST endpoint may have updated code and it might be incompatible with the model’s code.

Requirements

Having in mind all the above, we can create a list of requirements for building a machine learning training pipeline.

Datasets must be “easy” to access.
Datasets must be versioned.
It must be possible to determine the code that was used for creating a dataset.
Dataset generation must be automated.
It must be possible to generate different versions of the dataset.
Model training must be automated.
Models must be versioned.
Results must be reproducible.
It must be possible to train different models.
It must be possible to determine the version of the code and dataset that was used for training a model.
The data scientist/developer must be able to choose which model gets deployed.
When deploying a model, it must be possible to backtrack the whole training process.

Building the training process

The requirements can be grouped in three groups:

All artifacts (code, datasets and models) need to be versioned. The versions of each artifact need to somehow be connected with the version of all related artifacts. For each dataset, it must be possible to find the related code, and for each model the related code and dataset.
Time consuming tasks such as dataset generation and model training need to be automated.
The data scientist/developer needs to be in control of what and when to deploy.

Versioning things

It’s quite clear that versioning of code, datasets and models is one of the most important parts. It is required both for having reproducible experiments and for tracing the code that was used for generating datasets, feature extraction etc. There are a lot of version control systems (VCS) that allow versioning code. Git is the most popular Version Control System and it’s quite safe to assume that most teams use Git (I hope nobody still exchanges code via email). Actually, it’s the one we’ll use for the purposes of this article. Git allows a team to work on different parts of the project or in parallel, testing different solutions in branches. Having said that, here are a few assumptions:

All code lives on Git.
Code for generating the dataset, training the model and serving the model is on the same repository. This makes it easier to refer to code from other steps. You can definitely have other libraries (*as long as you take care of versioning) or use different repos, but it will make it easier down the road to have all code in a single repo.

Like other VCSs, Git is great in versioning code. In addition, it allows you to keep branches of your code and select which branch to merge to the master branch. A first idea might be to add the other artifacts (like datasets and models) to Git too. This is possible, but it has a few problems:

Adding big files or binary files to Git may cause a repository’s size to increase, resulting in really big downloads when you want to push changes or clone the repository.
Dataset generation and model training are processes that take time. We will share some ideas on how to automate them later on. These long-running processes will need to commit artifacts to the Git repo, potentially leading to problems such as merge conflicts.

Instead of storing the files inside Git, we will use some “pointers” from Git to external system. Git identifies different versions of the code by a unique hash. Whenever we run some code and generate some artifacts (datasets, models, reports), we can use the hash to refer back to the version of the code. We can do the opposite as well. Given some artifacts and the hash code, we can find the version of the code that was used to generate them. What we need to do is to follow these conventions:

Datasets must be stored in the object storage under a path that contains the git hash.
Models must be stored in the object storage under a path that contains the git hash.
The git hash must also be stored along with the code whenever it is distributed — i.e. when deploying to a server. However, if you use some versioning scheme, you can also derive the git hash from the version.

But wait, there are a couple of issues. What about configuration parameters used for generating datasets or training models? They must be stored inside the git repo. When you actually run the code for generating the artifacts, you just need to read them. We can package the configuration files together with the code (i.e. inside a Docker image). Or we can have a command line interface for running each step of the process. Configuration parameters can be passed as command line parameters and we can write a small bash script that contains all the configuration in it.

What if we want to test multiple versions? We just create different branches. Then, we can merge to master the one that has the best results.

Another question is that if everything is based on the git hash, then how do you relate datasets with training code? Do they need to share the same git hash? One way to do it is by sharing the same code. However, this will not allow us to have different models on the same dataset. An improved solution is to pass the git hash of the dataset as part of the configuration of the training script. This allows for a range of options:

The dataset may be coming from any branch.
Multiple models may reuse the same dataset (since it’s just a link).

Let’s have a look at an example bash script. It runs the imaginary train command, which is responsible for training the model. The script will be stored inside the git repository. Thus, any change to it is versioned. The intention is to be used as the entrypoint when running the automated model training task.

#!/bin/bash# Dataset version to use. Note that this is the commit hash of 
# the dataset generation's code, not the hash of current commit!DATASET_VERSION=da7a1#  Download dataset
aws s3 cp s3://bucket/project/datasets/${DATASET_VERSION}/dataset.csv data/dataset.csvecho "Training model..."# Imaginary command that trains a model using the given parameters
train data/dataset.csv --estimators 50 \
                       --max-depth 10 \
                        --output model.clf
...

Let’s have a look at an example focusing on the flow. Code is on the master branch. A new branch named dataset is created for writing the dataset generation code. After a couple of commits the code is ready (git hash da7a1). Some magic process (we’ll get to that later) runs the code from the branch and uploads the two files of the dataset to the dataset object store. Now the dataset generation code and artifacts are “connected”. You may add more commits to the branch and do the same again or you may create new branches. You can even do this in parallel. For now, let’s say we’re happy with the result and da7a1 is merged back to master branch.

Next step is experimenting with a couple of models. Two different branches are created, one for each model. Similar to dataset generation, model training will result in uploading the artifacts in the model object store, in a path using the git hash. One thing to note here is how to connect dataset in dataset object store with the model code. As described earlier, this can be done using a configuration file in the repo. The file may contain a link to the dataset or the hash of the branch.

Automation

Now that we’ve established the conventions about where each artifact will “live””, it’s time to proceed with automating the individual parts. Dataset generation and model training are the tasks that need to be executed at some point of the project. Some process must detect changes in the code, trigger the execution of the tasks and upload results to the object store. Since the code is in a VCS, a continuous integration (CI) server is a perfect match. A CI server is able to monitor repositories for changes and run commands upon specific triggers. A really common use case is to run unit tests upon creating a pull request. This way successful test execution is enforced before merging the code.

For our case, we can have the following steps:

Run tests on code (to be sure we didn’t break anything with our changes).
Trigger task that runs the code.
Store result in proper path at target object store.

Step #1 is quite obvious. Step #2 depends on whether it’s dataset generation or model training. There’s a lot of different ways to perform it, depending on available infrastructure, size of data etc. Here’s some ideas:

Run inside CI server — (although might not be possible or the best solution).
Submit to some always-on infrastructure (a server, kubernetes cluster as Kubernetes job, Spark job etc).
Submit to infrastructure allowing on-demand tasks (such as AWS Sagemaker).

Step #3 can also be done in different ways. For example, Spark might output a dataframe to S3. Sagemaker can be configured to upload results in specific S3 path. Or even your code may upload the code directly to S3 or use AWS CLI to push it. However, you need to make sure that the CI server will pass the path containing the git hash.

Here’s an example. Your code exposes a command line interface for generating dataset. You can write a small bash script like this one:

#!/bin/bash
# Get current git hash (short version)
VERSION=$(git rev-parse --short HEAD | head -c7)echo "Creating dataset..."# Imaginary command that extracts data from a database and stores 
# them to a CSV
dataset generate --db-url redshift://user:pass@host:5439/warehouse \
                 --from 2019-01-01
                 --to 2019-12-31
                 --output data/dataset.csv#  Upload generated file to S3
aws s3 cp data/dataset.csv s3://bucket/project/datasets/${VERSION}/dataset.csv

When run, it will generate the dataset and upload the file to S3. Another benefit of using this approach is that the parameters passed to the dataset generation script are also stored inside git repo, as they are part of the bash script. You can definitely follow the same logic for model training.

As discussed previously, CI server will be responsible for triggering the execution of the tasks. But when do we want to trigger execution of tasks? The naive approach would be for pull request opened to the main branch. This might trigger execution of tasks for every follow up commit to the branch (even if it’s a small commit to the project’s README file). It is time consuming, resource hungry and may produce duplicate artifacts. Instead, we can rely on the judgement of the data scientist or the developer that decides when to start the execution of the task. Something similar is done when instructing a CI server that a new release of the project needs to be prepared. A common practice is to create a tag. The CI server can trigger a task upon detecting a tag. Even better, we can establish some convention in terms of the tags, in order to trigger different types of tasks per tag type. For example, tags prefixed with dataset/ can be used for triggering dataset generation, while experiment/ can be used for triggering training of different models or experiments. The following figure demonstrates exactly this.

The developer created a branch for dataset generation. As soon as tag (dataset/1.1) is added to commit with hash cafe1 , the task is picked from CI server, resulting in data stored in object store. After checking the result, a bug is fixed and a new commit is pushed. As soon as a new tag ( dataset/1.2) is created, then CI server runs the new task, storing it to the new location.

Similarly, two branches may exist in parallel for training models. Tags are applied and CI server triggers model training task. Note that in branch model2 tag is added on the second commit. No task has been run on the first commit. Tasks only run if a tag exists.

Getting a bit more technical, here’s an example on how to achieve the flow using Travis:

language: python
python:
- '3.6'
git:
  depth: false

jobs:
  include:
  - stage: test
    script: python setup.py test
  - stage: extract dataset
    script: source scripts/generate-dataset.sh
    if: tag =~ /^(dataset\/)/
    before_script:
      - pip install -e .
  - stage: train model
    script: source scripts/train.sh
    if: tag =~ /^(experiment\/)/
    before_script:
    - pip install -e .

Travis uses stages to group jobs. Each stage is a part of a build that may or may not be triggered. In the above example stage test runs always. Stage extract dataset runs scripts/generate-dataset.sh bash script ONLY IF a tag with prefix dataset/ exists. Similarly, stagetrain modelruns scripts/train.sh only if the tag starts with experiment/.

Deployment

Deployment is the final step of the pipeline. As discussed earlier, a model may be deployed in many different settings (streaming, REST endpoint, batch etc). The important part is to have the version of the model that matches the current state of the code. The developers/data scientist should also be in control of what model makes it to production. The simplest solution is that if something needs to be on production, it needs to be merged to master branch. The CI server can then download the latest version of the model, package it together with the code, run tests and deploy.

But how will the CI server know what is the latest version? This can be done by simply getting the last tag of master branch that starts with experiment. Here’s how to do it with git and a sprinkle of bash magic:

$ git rev-list --tags | xargs git describe | grep experiment | head -n 1experiment/model2

Instead of the tag name, it’s possible to get the git hash:

$ git rev-list --tags | xargs git describe | grep experiment | head -n 1 | xargs git rev-list -n 1 | head -c 7
caca000

Last thoughts

In this document, I have described a flow that uses Git and a CI server for driving machine learning pipelines. I’d like to discuss a couple of topics before closing.

When developing, it’s considered a good practice to write small self-contained commits that get reviewed and merged fast to master branch. This contrasts with machine learning project setups, where different versions of a model need to be tested and finetuned before reaching the final result. A lot of branches (especially the ones used for different versions of a model) might not be merged to master. This is one of the reasons that the suggested convention for model tags is experiment rather than model.

Moreover, there are a couple of possible issues:

Merging branches before tasks have been completed. This may break the CI/CD process. For example, if a branch with an experiment tag is merged to master before training has completed, CI server will not be able to find the trained model, leading to a broken build. This is actually a good thing, as it doesn’t make sense to push the code without the model :) A quick fix would be to delete the newly merged tag.
Squash-merge branches (or history rewrite in general). This results in collapsing the individual commits, creating a new one (with a new git hash), thus breaking the link between code and artifacts.
Rebase may also result in the same situation. But rebasing means adding your code after someone else’s code. In most cases, it might be a good idea to trigger the pipeline again, just to validate that the expected performance of the models hasn’t changed.

The branching model described here also assumes that there are a master branch and feature branches. It is very close to Github Flow, but can definitely be adapted to other branching models such as gitflow. Similarly, if your project needs to support multiple versions or have some other requirements, you might need to adapt.

There are, also, a couple of things not discussed, but can be definitely supported. The first one is recurring training. Just automating the whole process (from dataset generation to deployment), so that it uses new data. The second is how to log the results and hyper-parameters of each experiment. Once more, there are multiple solutions available, including sacred, mlflow or AWS Sagemaker’s training metrics. In any case, the git hash may be used as a logged parameter.

DIY machine learning training pipeline

Or how to automate your machine learning training pipeline using Git and a CI server.

The training pipeline

Dataset preparation

Model training

Deployment

Requirements

Building the training process

Versioning things

Automation

Deployment

Last thoughts

Written by Ioannis Foukarakis