How to organize Machine Learning code

In the last 6 months, my awesome team and I have been working on a challenging enterprise-level Machine Learning project: rebuilding from scratch the short-term power load forecasting models of a major energy provider.
It has been a difficult yet satisfying journey: we have learnt a lot, mostly by trying to recover from naive mistakes. It is time to try and share the main lessons learnt, hoping to help and – why not? – be helped to improve even more.
A single article would be far too long – and dull: I’ll try and post a bunch of them, each one focusing on a single topic. This one covers a very painful point: how to organize, structure, and manage the code for a Machine Learning project.
The beginning: a Jupyter-based project
I had used Jupyter in my latest academic projects – and I enjoyed its power and flexibility. A fast-prototyping tool which is also the ultimate result sharing platform is priceless, isn’t it? Thus, Jupyter was the choice for our first analysis.
It went no further than Data Exploration. At that moment, our project repository looked like this.

You spotted the problem, right? It’s not scalable. Not at all.
We used to copy and paste parts of code from one notebook to another – and that’s something no software engineer can stand. Moreover, we were practically prevented from collaborating over the same code base, as the json-based stucture of notebooks makes versioning with git very painful.
The breakthrough: software engineering in ML
Before passing the point of no return, we looked for a more scalable structure. We found the following resources very useful:
- Mateusz Bednarski, Structure and automated workflow for a machine learning project, 2017
- Kaggle community, How to manage a Machine Learning project, 2013
- Dan Frank, Reproducible research: Stripe’s approach to Machine Learning, 2016
- DrivenData, Data science cookiecutter, 2019 (last update)
All of them were repeating the same message.
The code for a Machine Learning project is not different from any other and shall follow the best practices of Software Engineering.
Thus, we refactored the whole code base:
- We extracted all the reusable code from the notebooks into several utility modules, thus removing all the duplication in the notebooks
- We divided modules into 4 packages, corresponding to the 4 main steps of our workflow: data preparation, feature extraction, modelling and visualization
- We put the most critical functionalities under unit test, thus preventing dangerous regressions
As soon as the operation was completed, the Python, non-notebook code became the single source of truth: as per team convention, it was our state-of-the-art version of data preparation, feature extraction, modelling and scoring. Everyone could experiment, but the Python code was to be changed if and only if a proven improved model was available.
What about experimentation? And what about notebooks?
Two questions, the same answer.
Of course, notebooks were not eliminated from our development workflow. They are still an awesome prototyping platform and an invaluable result sharing tool, aren’t they?
We just started using them for the purposes they were created in the first place.
Notebooks became personal, thus avoiding any git pain, and subject to a strict naming convention: _author_incremental numbertitle.ipynb, to be easily searchable. They remained the starting point of every analysis: models were prototyped using notebooks. If one happened to outperform our state-of-the-art, it was integrated in the production Python code. The concept of outperforming was here well-defined, as scoring procedures were implemented in utility modules and shared by all members of the team. Notebooks also made up most of the documentation.
It took only a few days for us as a team to complete the transformation. The difference was incredible. Almost overnight, we unlocked the power of collective code ownership, unit testing, code reusability, and all the legacy of the last 20 years of Software Engineering. A great boost in terms of productivity and responsiveness to new requests was the obvious consequence.
The most evident proof came when we realized we were missing unit of measure and labels from all the goodness-of-fit charts. As they were all implemented by a single function, it was fast and easy to fix them all. What would have happened if the same charts were still copied and pasted in lots of notebooks?
At the end of the transformation, our repository looked like this.
├── LICENSE
├── README.md <- The top-level README for developers
│
├── data
│ ├── interim <- Intermediate data
│ ├── output <- Model results and scoring
│ ├── processed <- The final data sets for modeling
│ └── raw <- The original, immutable data dump
│
├── models <- Trained and serialized models
│
├── notebooks <- Jupyter notebooks
│
├── references <- Data explanatory materials
│
├── reports <- Generated analysis as HTML, PDF etc.
│ └── figures <- Generated charts and figures for reporting
│
├── requirements.yml <- Requirements file for conda environment
│
├── src <- Source code for use in this project.
│
├── tests <- Automated tests to check source code
│
├── data <- Source code to generate data
│
├── features <- Source code to extract and create features
│
├── models <- Source code to train and score models
│
└── visualization <- Source code to create visualizations
A win-win situation.
The end: the value of frameworks
We were satisfied with the separation of Jupyter prototypes and Python production code, but we knew we were still missing something. Despite trying to apply all the principles of clean coding, our end-to-end scripts for training and scoring became a little bit messy as more and more steps were added.
Once again, we figured out there was some flaw in the way we were approaching the problem and we looked for a better solution. Once again, valuable resources came to rescue:
- Norm Niemer, 4 Reasons why your Machine Learning code is probably bad, 2019
- Lorenzo Peppoloni, Data pipelines, Luigi, AirFlow: everything you need to know, 2018
We studied AirFlow, Luigi and d6tflow and we finally opted for a Luigi/d6tflow pipeline, with the latter used for simpler tasks and the former for more advanced use cases.
This time it took just a single day to implement the whole pipeline: we saved all our functions and classes, encapsulating the logic for preprocessing, feature engineering, training and scoring, and we replaced scripts with pipelines. The improvements in readibility and flexibility were significant: when we had to change how train and test sets were splitted, we could modify only two tasks, preserving input and output signatures, without worrying of everything else.
Wrapping up
Wrapping up, we got three important lessons about code in a Machine Learning project:
- A Machine Learning project is a software project: we shall take care of the quality of our code. Having to deal with statistics and math is no excuse for writing bad code
- Jupyter Notebooks are great prototyping and sharing tools, but are no replacement for a traditional code base, made of modules, packages and scripts
- The Directed Acyclic Graph (DAG) structure is great for Data Science and Machine Learning pipelines. There is no point in trying to create such a structure from scratch when there are very good frameworks to help
Thank you, my reader, for getting here!
This is my first article on Medium and by no means I am a good writer. If you have any comment, suggestion, or critic, I beg you to share with me.
Also, if you have any question or doubt about the topic of this post, please feel free to get in touch!