Managing Machine Learning projects

A journey of tooling decisions

Published in

Towards Data Science

11 min readSep 28, 2020

When I started working as a Machine Learning researcher at slashwhy (formerly SALT AND PEPPER Software) full time, I came from a background of developing in Unity with C# — and also in Python. Over the course of a few months, our team grew to a total of 5 people and we soon had to face a few very essential questions about the way we work and organize ourselves. As a quick background, we almost always work on deep learning projects with a wide range of different customers from the industrial sector, spanning work on large machine data to optical recognition and quality assurance. As such, our work always involves a tight interaction with our customers: To understand the data, the challenge, and the desired value that should be achieved through the use of Machine Learning.

The questions that we had to ask ourselves were the following:

How do we write our code and structure our projects?
How do we manage our ML experiments?
How do we communicate our work to our customers?
How do we assure the quality of our code and results?

While some of these questions might seem trivial at first, their answers are not straight forward. Back when we started out, we had limited computational resources as well as a limited budget — a situation that might sound familiar to anyone who has worked in a small, young team before. As such, while trying to find answers to our questions, we tried to avoid expensive out-of-the-box solutions and instead focused on freely available open-source solutions. Let’s have a look at the aforementioned questions in turn and our solutions and experiences.

How do we write our code and structure our projects?

To Jupyter or not to Jupyter?

Three out of five people in our team had a background in more traditional software development. As such, using git as version control and having clear coding guidelines and style guides for projects was a concept we wanted to apply to our ML work as well.

Gitlab was already used by almost all people in our company, so version control was not an issue. Code and project structure on the other hand was.
Discussing how we might properly structure our workflow quickly brought up the question of Jupyter notebooks — and how to handle them in our daily routine. Looking at how popular notebooks and Google Collab have become, now you may ask “Why wouldn’t I just use Jupyter notebooks for everything?”. In our experience, Jupyter notebooks are great for exploratory work such as initial data exploration and even for mocking quick prototypes. They are easy to use and good in keeping track of what you are doing with your data. But once your project grows and things like Docker deployment, code re-usability and testing enter the stage, they quickly become an ugly mess. They also do not play well with version control since the logic of your algorithms is entangled with the representation syntax — making reviews a mess (especially if someone forgets to clear the output before committing).

Thus, we decided to create a structural basis for all our upcoming projects and in the form of a template. In its first version, it was basically a bash script that created a file and folder structure which looked something like this:

src/
├── train.py
├── inference.py
├── data_processing/
│   └── data_loader.py
├── utils/
│   └── ...
├── models/
│   └── model.py
└── visualization/
    └── explore.ipynb.gitignore
README.md
requirements.txt
run.py

It was essential for us to keep our different projects as similar as possible. After all, we do customer work — there are new projects at a steady pace and it is initially not clear if they survive the proof of concept phase. We wanted to be able to look into a project that one of our team members had been working on and immediately understand what is going on without having to read through all of their code first. Also, having clearly structured projects makes communication with our customers much easier.

Furthermore, we decided to use a single script called run.py as our central interface for all projects: It has to be identical in all repositories and manage things such as GPU usage and running either training or model inference on the specified data.

While we decided against using Jupyter notebooks for our general coding routine, we still kept an exploratory notebook for the initial data exploration as part of our template.

Having our projects set up this way already helped a lot, but still didn’t solve a few essential issues:

We needed to somehow manage our training runs
We needed to document what we were doing within a project — properly!
Our method of tuning hyperparameters was rather intuitive up to that point and not very structured
We had no way of checking more complex data operations in an automated way

How do we manage our ML experiments?

At this point, we had a clear code structure within our projects and even a style guide where we agreed on what coding guidelines to use. What we were still lacking was any way of methodically organizing our actual core work — ML experiments.

Back then, we relied on Tensorboard for result tracking and visualization. This was mostly fine but had no central place to store and compare our results. Also, with us using different machines to run training on, it quickly led to a bunch of Tensorboard record files lying around here and there. In short, we were looking for a better solution.

For us and most of our customers, data security and privacy are very important topics. For many clients it is absolutely no option to use cloud-based services in combination with their data. It is not uncommon to send data via physical hard drives as people tend to be very careful, especially with actual customer data. Whatever our solution to this problem looked like, it had to be on-premise and not a 3rd party cloud service.

Luckily, around that time a new open source project came to live that would change our workflow quite substantially — Sacred. Sacred is a framework dedicated to managing ML experiments by logging and saving them to a database.

Omniboard. Source: github.com/vivekratnavel/omniboard

We quickly decided to give it a try and turned one of our ML workstations into a MongoDB server. Having a decentralized project structure with many different scripts turned out to be a benefit here, as sacred integrates into this structure very well. Within a short time, we got our first projects running with sacred support. Now we had a central server with a very structured and organized tracking system — we decided to use Omniboard as a database viewer.

The only issue back then was that Sacred still lacked a few features that seemed critical for us:

Hyperparameter config files

When using sacred, it is common to have a dictionary of hyperparameters as part of your main training script which is used to override parameters in function calls. While we really liked this idea, it didn’t quite fit our structure of separate scripts for each part of our code, so we added a config.py script to our New Project Template. This config script contains a default config that can be accessed via a function call, as well as a config override function which can create a list of permuted configs from multiple values for the same hyperparameter. If one for example wants to investigate the impact of different learning rate values on training performance, simply entering a list of multiple different values in the config creates and returns a list of multiple config dictionaries that can in turn be used for grid training.

Experiment queues and grid training

When we started using sacred, it was very much focused on running single experiments. It was already possible to merely queue up a run instead of starting it immediately, but this meant typing in different configurations for each such run. With our config.py script at hand, we thus extended run.py to use a list of different configs and automatically create a queue in our database from that. A multithreaded training loop with dedicated GPU access per training run made it possible for us to create a large grid training and have it run automatically without any further need for human supervision.

Last but not least, we decided to write a few custom callbacks to expand Sacred’s logging capabilities. At this point, this is what our project structure looked like:

docker/
├── Dockerfile
├── .dockerignore
└── main.pysrc/
├── train.py
├── inference.py
├── config.py
├── data_processing/
│   └── data_loader.py
├── utils/
│   └── ...
├── models/
│   └── model.py
└── visualization/
    └── explore.ipynb.gitignore
README.md
requirements.txt
run.py

As you can see, a docker deployment folder has sneaked its way in there as well. Having all code neatly separated in individual scripts makes it easy to reuse it for tasks such as Docker deployment — or any deployment, in fact.

How do we assure the quality of our code and results?

With Sacred in place, we now had a clear project structure, code style guidelines, and an experiment management system. What we were still lacking though was something very fundamental — continuous testing.

Whenever I read articles about ML, usually written by people with a Python background, I notice that despite large amounts of information on data processing or neural network fine-tuning, code quality is rarely an issue. This is alright if you work in a purely scientific environment and mostly just code for yourself. For us, writing code and hoping for the best by manually testing it was simply not enough.

Thus, we added yet another section to our new project template — tests. We decided to use unittest, the classic Python package for testing your code. And then, we started writing tests.

To make our lives a lot easier, we use the GitLab Continous Integration (CI) to automatically run our tests after each commit.

Testing ML projects is often somewhat more complex than it is in other types of software development, but it can still help quite a lot! What we found to be most important to test in our project pipeline was the data loading and (pre)processing section. Here, with just a few bits of data added to the repository, one can automatically assure that whatever changes are made to the data pipeline, your intermediate results still keep their desired range, shape, and types.

Automatic testing and linting checks for that crisp, polished code

Continuous testing becomes a challenge if one looks at the training of a model though. Most of the time this is a time expensive process and something one would not want to run for every commit. It is also a highly stochastic process in a very sensitive floating-point regime and we had to write a lot of asserts with an error margin on floating-point accuracy. In the end, we divided our testing efforts according to a common computer science paradigm into integration and unit tests. We run the more complex and time-intensive tests that require the building of a complex model only for the merge requests and run the faster unit tests that check our data manipulation on every commit.

Since we also need a bit of computational power we decoupled the GitLab runner from the company-wide pipelines and let it run on our dedicated machine learning servers.

How do we communicate our work to our customers?

For sharing intermediate results with our customers we up to then had basically two strategies: 1 — quickly build up a few images and slides that explain our current state or 2 — directly use the Jupyter notebook. This is not optimal since we want to focus our meetings on results and have a slicker way of interacting with the data (instead of altering code and rerunning cells). It is also quite cumbersome to transfer working algorithms from the notebook structure to pure Python scripts that make up our remaining code structure. So we looked for different tooling.

Thankfully we found Streamlit and were immediately convinced of its functionality. Streamlit allows for generating a small web application with a minimum of additional code. This was beneficial for us for the following reasons:

all one needs is writing pure Python files → this neatly integrates into our project template structure
the presentation is decoupled from the code so that we could focus on the data and plots in our customer meetings
the created HTML pages are easy to present and allow for interactive elements like buttons, selections, and filters that enabled us to move beyond static power points
from a development perspective, it was much more pleasing to use than notebooks and allowed for a faster code transfer into the training routines

In the development cycle it can become a bit slow due to the way Streamlit manages hot code replacements (basically it doesn’t) but so far we couldn’t find a better tool for this aspect of our projects.

Our final and current NPT structure now looks like this:

docker/
├── Dockerfile
├── .dockerignore
└── main.pysrc/
├── train.py
├── inference.py
├── config.py
├── data_processing/
│   └── data_loader.py
├── utils/
│   └── ...
├── models/
│   └── model.py
└── visualization/
    └── explore.pytests/
├── data/
│   └── [project relevant data goes here for testing]
├── unit/
│   └── test_data_loader.py
└── integration/
    └── test_model.py.gitignore
gitlab-ci.yml
README.md
requirements.txt
run.py

How does it all come together?

We work in a field that is moving at an incredible speed. It is already hard enough to keep track of all the moving parts in a regular software project, but we found that a Machine Learning project is way less standardized as of now. Our approach to ML the projects we work on puts us somewhere in between two popular approaches: The more data scientific side of validating first experiments with computationally less expensive algorithms and the fully cloud-enabled data pipeline with a rich ecosystem of hardware and software resources.

The frameworks we settled for are currently the best fit for our needs and although this will (most likely) change in the upcoming months and years we are happy with the choices so far. They keep our work independent and allow for a very individual approach to each project while also ensuring flexible data handling and easy reuse of proven algorithms and routines. We started small but we always strive to improve on our daily doing and make it more efficient as well as improving our communication with our customers. Maybe this post helped you to answer some questions around setting up an end to end Machine Learning project and we hope that it was an interesting insight into our technology stack.