Creating Infrastructure for Training Machine Learning Models

Introducing an Automatic Pipeline for Training, Tracing, and Comparing Machine Learning Experiments

Sivan Biham
Towards Data Science

--

Photo by Firmbee.com on Unsplash

Let’s imagine the following scenario: you get a new project to work on. For this project, you need to develop a machine learning model, which will require running and training several experiments. Each experiment might take several hours or even days and needs to be tracked.

You have your own laptop for the development phase, but it's not realistic to use your own laptop for training and running all of the experiments. First, your computer might not have the required hardware, for example, a GPU. Second, it’s a waste of time. Why train on one computer sequentially if you can run and train all the experiments in parallel?

If you do train in parallel, on separate machines for each experiment, how do you easily track and compare them? How do you access the training data? Can you be aware in real-time of any failure so you will be able to fix and run it again?

All of these questions are going to be answered. In this blog post, I want to describe one possible solution for all the issues above.

The solution combines different services and unites them to create a whole pipeline solution for training machine learning models.

In the image below you can find the scheme of this pipeline.

Image by author

The pipeline consists of three steps- containerizing the code so we will be able to run it later, executing the code in several separate machines, and tracking all the experiments.

(1) Containerizing

The best way to run training experiments at scale is to build a docker image. One way to create such an image is to use Jenkins. An open-source automation server that enables building, testing, and deploying software. Another way is to use GitHub actions. Continuous integration and continuous delivery (CI/CD) platform that allows you to automate your build, test, and deployment pipeline. It can be done directly from your GitHub account.

Let me describe one possible scenario — each push to our git branch, triggers a Jenkins pipeline. This pipeline builds the docker image (among other things it can do), tag it, and store it in a docker registry.

Once we have this image, we can execute it and train our models at scale using virtual cloud machines.

(2) Executing

We have our code ready and we want to start training. Let’s say we have 10 experiments to run, and we want to run them in parallel. One option is to use 10 virtual machines, connect each one using ssh, and manually execute our code at each one. I guess you understand that it’s not optimal.

Since we already containerized our application in the previous step, we can utilize the most popular container orchestration platform to run our training — Kubernetes. Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications. This means, that by using it, we can take our docker image and deploy it at scale. We can run 10 experiments on 10 separate machines, using a simple bash script.

You can read more about it here.

(3) Tracking

Training data acquisition

To execute the training we first need access to our training data. The best way is to have direct access from the training machine to the storage with the required training files (S3, GCS, etc) and to the database with the required tabular information. To establish that we might need help to get access to the relevant credentials and secret managers (a secure and convenient storage system for API keys, passwords, certificates, and other sensitive data). Once it’s configured, we are free to go.

NFS disk — shared disk

All the machines that are used for the training, are connected to an NFS disk (in our case, we use the managed solution by GCP, which is called Filestore), an additional shared disk to which all machines have read and write access. We add this shared disk for three main reasons —

  1. When training on separate machines, we don’t want to download our training data over and over again for each one. Instead, we can connect all the machines to a shared file store, an additional disk which all the machines have access to. This way we can download the data only once.
  2. Training models might take time and as such can be costly. One option to make the training cheaper is to use Spot instances which are available at a much lower price. However, your cloud provider might stop (preempt) these instances if it needs to reclaim the compute capacity for allocation to other VMs. Using Kubernetes, once the instance is stopped, it’s automatically replaced and execute the same experiment again. In such a case, we want to continue the training from where we stopped- loading the last saved checkpoint. This can be done only using the shared disk, without it we will not have access to the saved checkpoint in the previous machine. Other solutions like uploading all checkpoints to cloud storage and applying the logic on it, are possible but with high overhead (and costs).
  3. TensorBoard — to compare different experiments from different machines using TensorBoard, all the information must be written to the same parent directory. We will write all the information to the same parent log directory, which TensorBoard UI will read from.

Experiment tracking

One of the most important things in training models is to be able to track the training progress and compare different experiments. For each experiment we want to save the parameters we used, the data we trained on, the loss values, the results, and more. We will use this data for tracking, tracing, and comparing.

There are many tracking tools. I use MLflow and TensorBoard when I want advanced visualization. They are both deployed in a way that all the team can access the same instance and see all the team’s experiments.

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides a simple API to be integrated with and a nice UI for tracking and comparing the results.

TensorBoard provides the visualization and tooling needed for machine learning experimentation. It provides an API that logs our information to files on the machine’s disk. TensorBoard UI reads those files from a passed input directory.

Errors Alerts

As the training might take time, hours, or even days, we would like to get real-time notifications if an error occurred. One way to do so is to integrate our code with Sentry and integrate Sentry with Slack. This way, every error raised in the code will be sent as an event to Sentry and will trigger a Slack notification.

Now I get notifications about errors in real-time. It not only reduced my need to monitor the runs (whether they crashed or still running), but it also reduced the time between the occurrence of the error and the triggering of another run, with the fixed code.

Summary

When I first trained models, it was done on my laptop. Soon enough, it was not scalable, so I moved on to virtual machines using ssh. This option has pitfalls too. For each machine, I had to log in, pull the code, download the data, run the code manually, and hope that I did not mix up the parameters I pass. The data part was the easiest one to solve, we add a shared disk to all my virtual machines. But I still got confused with the parameters and passed the same ones at several machines. This was the time to move on to use a scalable and more automated infrastructure as I described in this post. Together with the DevOps team, we defined and build the great pipeline we have today. Now I have a pipeline that is scalable, easy to execute and even sends a Slack notification every time an error is raised. This pipeline saves time not only due to the experiments' parallelism but also minimizes the parameter mistakes and the idle time due to errors that we don’t know about unless actively checking each machine.

--

--