The world’s leading publication for data science, AI, and ML professionals.

How to make your deep learning experiments reproducible and your code extendible

Lessons learned from building an open-source deep learning for time series framework

Improving your deep learning code quality (Part I)

.

Photo by author (taken while hiking at the Cutler Coast Preserve in Machias ME)
Photo by author (taken while hiking at the Cutler Coast Preserve in Machias ME)

_Note this is roughly based on a presentation I made back in February at the Boston Data Science Meetup Group. You can find the full slide deck here. I have also included some more recent experiences and insights as well as answers to common questions that I have encountered._

Background

When I first started my river forecasting research, I envisioned using just a notebook. However, it became clear to me that effectively tracking experiments and optimizing hyper-parameters would play a crucial role in the success of any river flow model; particularly, as I wanted to forecast river flows for over 9,000+ rivers around the United States. This led me on the track to develop flow forecast which is now a multi-purpose deep learning for time series framework.

Reproducibility

One of the biggest challenges in machine learning (particularly Deep Learning) is being able to reproduce experiment results. Several others have touched on this issue, so I will not spend too much time discussing it in detail. For a good overview of why reproducibility is important see Joel Grus’s talk and slide deck. The TLDR: is that in order to build upon prior research we need to make sure that it worked in the first place. Similarly, to deploy models we have to be able to easily find the artifacts of the best "one."

One of my first recommendations to enable reproducible experiments is NOT to use Jupyter Notebooks or Colab (at least not in their entirety). Jupyter Notebooks and Colab are great for rapidly prototyping a model or leveraging an existing code-base (as I will discuss in a second) to run experiments, but not for building out your models or other features.

Write High Quality Code

  1. Write unit tests (preferably as you are writing the code)

I have found test driven development very effective in the machine learning space. One of the first things people ask me is how do I write tests for a model that I don’t what its outputs will be? Fundamentally your unit tests should fall into one of the four categories:

(a) Test that your model’s returned representations are the proper size.

(b) Test that your models initialize properly for the parameters you specify and that the right parameters are trainable

Another relatively simple unit test is make sure that model initializes in the way you expect it to and the proper parameters are trainable.

(c ) Test the logic of custom loss functions and training loops:

People often ask me how to do this? I’ve found the best way is to create a dummy model with a known result to test the correctness of custom loss functions, metrics, and training loops. For instance you could create a PyTorch model that only returns 0. Then use that to write a unit test that checks if the loss and the training loop are correct.

(d) Test the logic of data pre-processing/loaders

Another major thing to cover is to make sure your data loaders are outputting data in the format you expect and handling problematic values. Problems with data quality are a huge issue with machine learning so it is important to make sure your data loaders are properly tested. For instance, you should write tests to check NaN/Null values are filled or the rows are dropped.

Finally, I recommend using tools CodeCov and Codefactor. They are useful for automatically determining you code test coverage.

Recommended tools: Travis-CI, CodeCov

2. Utilize integration tests for end-to-end code coverage

Having unit tests is good, but it is also important to make sure your code runs properly in an end-to-end fashion. For instance, sometimes I’ve found a single model’s unit tests run, only to find out the way I was passing the configuration file to the model didn’t work. As a result I now add integration tests for every new model I add to the repository. The integration tests can also demonstrate how to use your models. For instance, the configuration files I use for my integration tests I often leverage as the backbone of my full parameter sweeps.

3. Utilize both type hints and document strings:

Having both type hints and document strings greatly increases readability; particularly when you are passing around tensors. When I’m coding I frequently have to look back at the doc-strings to remember what shape my tensors are. Without them I have to manually print the shape. which wastes a lot of time and potentially adds garbage that you later forget to remove.

4. Create good documentation

I’ve found the best time to create documentation for machine learning projects is while I’m writing code or even before. Often laying out the architectural design of ML models and how they will interface with existing classes saves me considerable time when implementing the code as well as forces me to think critically about the ML decisions I make. Additionally, once the implementation is complete you already have good start on informing your peers/researchers on how to use your model.

Of course you will need to add some things like specifics of what parameters you pass and their types. For documentation, I generally record broader architectural decisions in Confluence (or GH Wiki pages if unavailable) whereas specifics about the code/parameters I include in ReadTheDocs. As we will talk about in Part II. having good initial documentation also makes it pretty simple to add model results and explain why your design works.

Tools: ReadTheDocs, Confluence

5. Leverage peer reviews

Peer review is another critical step in making sure your code is correct before you run your experiments. Often times, a second pair of eyes can help you avoid all sorts of problems. This is another good reason not to use Jupyter Notebooks as reviewing notebook diffs is almost impossible.

As a peer reviewer it is also important to take time to go through the code line by line and comment where you don’t understand something. I often see reviewers just quickly approve all changes.

A recent example: While adding meta-data for DA-RNN, I encountered a bug. This section of code did have an integration test but unfortunately it lacked a comprehensive unit test. As a result several of the experiments that I ran and that I thought used meta-data turned out did not. The problem was very simple; on line 23 I forget to include meta-data while calling the encoder.

This problem likely could have been averted by writing a unit tests to check that the parameters of the meta-data model wereupdated on the forward pass or by having a test to check that the result with and without the meta-data were not equal.

In Part II of this series I will talk about how to actually make your experiments reproducible now that you have high quality and (mostly) bug free code. Specifically, I will look at things like data versioning, logging experiment results, and tracking parameters. So

Relevant Articles and Resources:

Unit testing machine learning code

Joel Grus Why I hate Notebooks


Related Articles