Stress-Free Machine Learning

How to build models quickly without sacrificing code’s quality

Ilia Zaitsev
Towards Data Science

--

Building machine learning models isn’t easy. Heavy datasets and tricky data formats. A ton of hyper-parameters, models, optimization algorithms. Not talking about generic programming adventures like debugging, handling exceptions, and logging.

It is especially true for the R&D (Research and Development) style of work when models, approaches (and sometimes, even the data itself!) can change very quickly. You wouldn’t like to invest a large amount of time and effort to build something that could become irrelevant very soon. But you also don’t want to turn your project into a pile of Jupyter notebooks and ad hoc scripts, with a ton of files scattered here and there across the file system. All these train_model.py, train_model_adam.ipynb, train_model_adam_v2.py, train_model_final.pth — I bet you know what I'm talking about. Is there a better way?

In this post, I would like to share an approach that worked for me pretty well during several quickly evolving projects I’ve worked on recently. It is structured in the following way.

  1. Environment — how to make sure that the environment can be easily redeployed if required, such as when moving from a working station with low computational power to a dedicated server?
  2. Data Access — how to organize the data access, and how to switch among several data sets?
  3. Training — what to consider when organizing a training process?
  4. Tracking and Reporting — how to track the performance metrics and generate comprehensive reports to compare experiment runs to each other?

If these points sound interesting to you, let’s dive deeper! We’ll pick up a couple of image datasets and try to find a model architecture that works best for them all.

TL;DR: Please access the repository with the code used in this post to run an end-to-end training process using all tips and tricks described here. All generated plots are available there.

1. Environment

This step is probably the most simple one. When it is needed to create an isolated programming environment, Python’s ecosystem gives you plenty of choices. Usually, for Machine Learning projects, I go with Anaconda. Here you can specify a Python interpreter version, and a list of packages to install.

conda create -n my_env -c conda-forge -c pytorch \
python=3.7 \
pytorch=1.6 torchvision pytorch-lightning wandb

Note, that conda has many distribution channels. In case if you’re trying to install a package and the installation command fails, try using -c keys as shown above and include additional channels into the search. In cases when a package isn’t available on any channel, you can always fall back to pip.

The conda manager supports exporting the environment into a YAML file. It includes the exact versions and distribution channels where the packages are installed from. When the environment ready, you can run the following command to get a snapshot of your setup.

conda env export -n my_env | egrep -v "^prefix: " > my_env.yml

The egrep command strips your local installation path from the environment definition. Use the following command to re-create your environment.

conda env create -f my_env.yml

Add this file to a repository, and now you can easily re-create your environment on any other machine!

Of course, there are other ways to set up a Python’s environment, like going with commonly used virtualenv. I personally find conda manager more convenient and robust when dealing with scientific packages. (Though it could be just a matter of habit and personal taste, of course).

2. Data Access

I like to work with PyTorch and its ecosystem a lot, and one of the reasons why is the approach adopted there to access the data. As you may know, the library relies on a very straightforward and easy to implement Dataset interface representing a randomly accessed array. The following snippet shows a trivial way to implement it.

class MyDataset(torch.utils.data.Dataset):    def __init__(self, items: List, targets: List):
self.xs, self.ys = items, targets
def __getitem__(self, item: int) -> Tuple:
return self.xs[item], self.ys[item]
def __len__(self) -> int:
return len(self.xs)

Though simple, this conception actually provides a developer with great flexibility. The samples could be taken from a folder with files, or fetched from some structured storage, like an HDF5 file, Redis database, or even an S3 bucket. What’s even more convenient, in some cases, you don’t even need to implement this interface yourself: the majority of widely-used image datasets are already provided as a part of the torchvision library.

Note that torchvision datasets implement data downloading on your behalf. You don’t need to do it “manually” — only pass the download=True option to the class’s initializer,

and it is always a good idea to automate as many aspects of your setup as you can: creating a virtual environment, downloading the data, setting up folders and environment variables.

So it is a very convenient feature. Imagine that you want to deploy your code on a different machine. If this functionality weren’t implemented, you would need to download the data with wget or curl and put it into a proper location. All these little actions consume time and distract from the actual work on your models. Even if your dataset isn’t implemented, it is easy to do it from scratch as our trivial example above shows.

Let’s pick the following widely-known datasets:

  • CIFAR10
  • CIFAR100
  • FashionMNIST

Each of them is implemented as a part of the torchvision package. But before we can feed these datasets into a training loop, we should wrap them with DataLoader objects that convert a set of samples into tensor batches. The following snippet shows a convenience function that takes a path to the folder with data and the name of a dataset and makes all required preparations to initialize data loaders. If data is not there, a dataset class will download it.

Here we use the DATASET_FACTORY dictionary that stores mapping from dataset names to factory classes. Each class uses the same interface so it is easy to implement dispatching.

Note that the FashionMNIST dataset is slightly different from the first two: its samples are single-channel grayscale images instead of three-channels colored images represented in CIFAR. Therefore, we should take care of this and adapt the samples accordingly. The most simple way is to include an additional transformation that duplicates channels three times. (See the next section showing how it's done).

That’s it! Now we’re ready to start experimenting.

3. Training

The datasets are ready. Now we should set up a training loop. We use the PyTorch-Lightning library for this purpose. It ships many neat training features right out of the box including model snapshots, early stopping, metrics logging, and many more! So sounds like a great solution to get rid of some boilerplate code.

In order to use this library, we only need to inherit from its LightningModule base class and define some callback methods. I’ll not describe these steps in detail here as soon as mastering PyTorch-Lightning is itself a topic for a series of posts. The only thing I would like to highlight is that from my experience, it seems to be a good idea to write a simple base class that implements the required hooks and use it as your base instead of LightningModule. Then you can implement your experiments in a very straightforward way, as the following example shows.

Here the BaseExperiment class inherits from LightningModule and does some basic logic common for all our experiments. (Check out PyTorch Lightning Bolts repository to see how this idea is implemented for more realistic scenarios). The module creates a very simplistic model as a baseline for our experiments.

What is also important, the library exposes its parameters via CLI. So we can easily tune our training process. Usually, I try to expose everything as a CLI option to simplify automation. A drawback of this approach is that the argument’s parser becomes very large. To tackle this challenge a bit, we can split it up into smaller groups as the snippet below shows.

See how create_default_parser function combines several CLI groups together, including the Trainer parameters as well. Then we can run a training process and override any training parameter, including scheduler, optimizer, the number of training epochs, early stopping criterion, etc.

Now we can easily launch an experiment using a shell command similar to the one shown below. (See the repository for the full-featured end-to-end example that uses a bit more involved training script and one more experiment class using pre-trained ResNet models).

python train.py --dataset=cifar10 --dataset_path=/tmp/data

Ok, the training process is set up. We have our experimenting code and can easily add new architectures and datasets! It is time to take the final crucial step: organize logging that would help us to track the model’s quality, and also to keep in a single place all its configurations.

4. Tracking and Reporting

Everyone who works with data modeling knows how fragile things could be. One parameter change here, another option or feature switched off there — and a promising model turns into something mediocre and unreliable.

Machine Learning pipelines can be very sensitive to even a minor change in the parameters. So it makes sense to log every small piece of information.

Tracking parameters in a notepad or Excel sheet works for some cases, but this approach is fragile. (Probably as fragile as storing your code’s backups in archives instead of proper version control…). Storing the tracked metrics as CSV or JSON files works but requires additional effort to analyze them after training is done. So there should be a better way, right?

Fortunately, the field of Data Science and Machine Learning is mature enough to equip us with a set of convenient tools making the situation much better. The Weights & Biases platform is one of them. It helps in tracking the performance metrics computed during the experiment and stores them in the cloud. It also allows you to store configurations, CLI arguments, and actually any other information you provide.

The following image is static but plots rendered on W&B are interactive and updated in real-time as the training process goes on. Also, the platform supports several plotting primitives, not only line plots!

Metrics recorded in real-time while running the experiment

Again, in addition to the training metrics, W&B stores information about your experiment’s configuration as the following picture shows. So you have all the information at hand. (Even such technical details like OS version or path to the interpreter’s executable).

Experiment execution configuration

Moreover, you can easily compare metrics generated during different experiment runs showing them on the same canvas. It makes the process of comparing different experiments and models with each other almost effortless.

And to top it off, the platform provides basic reporting functionality, so we can easily convert our metrics plots into a dashboard describing the behavior of our models.

An interactive dashboard showing a brief summary of the experiments

Follow this link and try out the interactive plots and reports yourself!

Conclusion

Writing the ML code fast doesn’t mean to write it ugly. Making it clearer and better structured will save you time and make your experiments reliable, easy to modify, and reproduce.

Sure enough, there are many other nice tools and practices that I didn’t mention in the post: (1) training loop libraries like ignite, catalyst, or fastai; (2) ML experiments tracking systems like mlflow ; (3) good old tensorboard logging that allows for tracking metrics and experiment parameters similarly to how it is done with wandb. Also, we didn’t cover DevOps topics here. Nevertheless, I hope the post was helpful to you and gave some new insights.

And what is about your approach? Feel free to share your thoughts and best practices in the commentaries to this post!

Are you interested in Python programming? Can’t live without Machine Learning? Have read everything else on the Internet? Check out my blog where I share various technical topics and thoughts!

--

--

Software Developer & AI Enthusiast. Working with Machine Learning, Data Science, and Data Analytics. Writing posts every once in a while.