The world’s leading publication for data science, AI, and ML professionals.

Structuring Jupyter Notebooks For Fast and Iterative Machine Learning Experiments

A cheat sheet for busy ML practitioners who need to run numerous modeling experiments quickly in a tidy Jupyter workspace.

"Modularising" your code is hard in machine learning projects

Unlike in the software world, the term "reusable component" can be hard to apply in the modeling world. Experiments are often one-off, and not many codes were reused. If you’re a clean code advocate who likes spending time refactoring every single line of code to follow the "Don’t Repeat Yourself" (DRY) principle, you could easily spend too much time doing so.

However, I’m not suggesting to go to the opposite of the "Don’t Repeat Yourself" principle. I’ve seen very messy and unorganized Jupyter Notebook directories. However, we should strive to understand which components should we reuse. In this article, I will highlight the components that tend to be reused in a Machine Learning project, based on my experience in preprocessing and modeling data for 2 years+ using Jupyter notebooks.

First of all, why do we use Jupyter?

The main reason we use Jupyter in modeling projects is that we want to go fast. We want to run fast experiments, fail fast, and learn fast. Data processing takes time, and Machine Learning training takes even more time. Unlike the software world where "hot reloading" is a thing, we don’t usually have it in the modeling world. Preparing for an experiment and the experiment itself takes a lot of time. And to be quick, we need to use Jupyter, which gives us the ability to test run on just a small part of our code instead of the entire script.

This is an iterative process. The faster you can go round this loop, the faster you will make progress.

  • Andrew Ng, Machine Learning Yearning

Now, knowing that we should not write scripts at the beginning but we should use Jupyter notebooks instead, let’s see how should we structure our projects.

Overview

Here is an overview of what we’re going to cover in this post:

  1. Having "Small data" helps – why and how to have a small dataset when writing code.
  2. Using git – how to use git to version control your notebooks.
  3. Separation of concerns – how to structure your Jupyter files directory.
  4. The Preprocessing, Modelling Notebook & Reporting Notebook— here we talk about how to structure the 3 notebooks and what to include in the notebooks
  5. The MASTER Notebook – how to call other notebooks from a notebook and how to log the output

Having "Small data" helps

First of all, before we start to write codes for our data processing and data models, we should have a set of "small data" as our data. The main intuition is to have a very small set of data that can be processed quickly. By doing so, when we run our code for the first time, we don’t have to wait a few hours before knowing that there is a simple bug in our code.

For example, if we hope to train a model on 10 million images, try sampling only 50 images per class for writing the code. Or, if we’re training 100 million rows of sales data, we can try to sample 2000 rows of sales data as our "small data".

How big the "small data" should be is dependent on how representative the sample is and the time to process it. Sample-wise, try to get at least 5 samples per class. Time-wise, a rule-of-thumb is that the time required for "small data" to go from data processing to finish training a model should run in less than 10 minutes.

You could use the following code at the beginning of your notebook to toggle SMALL_DATA_MODE on and off.

SMALL_DATA_MODE = True
if SMALL_DATA_MODE:
    DATA_FILE = "path/to/smallData.csv"
else:
    DATA_FILE = "path/to/originalData.csv"

Using git

As you run more and more experiments, it’s likely to remove old codes and replace them with new codes. Creating a new notebook for just a small change in code is bad because we might not even need them in the future, it will take up space clutter our workspace.

Using git helps us to version control our notebooks while keeping our workplace clean. If required, you could always revert back to an older version by returning back to the previous git commits. Also, we don’t have to worry about code loss if we regularly push our codes to a remote repository. You could just push everything to the master branch if you’re working on this project alone or push to different branches if you’re on a team.

To install git,

On Windows, go to https://git-scm.com/download/win

On macOS, run git --version in terminal. If you have not installed git, it will prompt you for installation.

On Linux Ubuntu, run sudo apt install git-all

After installing, run the following in your project directory

git init

Also, let’s specify the files we don’t want to track with git. Create a new file called .gitignore and put the following texts in that file. We’re going to ignore the Jupyter checkpoints, python cache, and the data directory.

.ipynb_checkpoints/
data/
__pycache__/

To commit, use the following (this should be familiar to your if you have joined software projects before). If not, I recommend checking out git tutorials.

git add .
git commit -m "Give a clear message here on what's changing compared to last time you commit"
# remember to set a remote named `origin` for this:
git push origin master

Separation of concerns

The figure shows the recommended structure for a machine learning project
The figure shows the recommended structure for a machine learning project

A machine learning project will usually have multiple experiments using the same data and the same model. Instead of enclosing everything in a directory, a good way is to separate the data, preprocessing, modeling, and experiment output(exp).

This is my favorite file structure:

  • data/ – the storage bucket for all kinds of data (raw, preprocessed, etc)
  • exp/ – the output of experiments go here (saved models + actual and predicted labels)
  • logs/ – just place for our log files from preprocessing data and modeling
  • a_MASTER0.ipynb – The "master" Jupyter notebook which can call other "slave" notebooks (preprocessing, modeling, reporting). We will show how to call another notebook from a notebook in the next section.
  • a_MASTER1.ipynb – Just another "master" notebook for running another experiment in parallel. You can add as many master notebooks as you need.
  • b_preprocess.ipynb – The preprocessing notebook that takes in raw data from data/raw and outputs data to data/{dir}
  • c_model_svm.ipynb – This notebook takes in output from preprocessing, do a slight modification to fit into the SVM model, then outputs modeling results (like learned model parameters, predictions, etc) to exp/.
  • c_model_randomForest.ipynb – If you have another model, just name it like this.
  • d_reporting.ipynb – This will read from exp/ and plot tables or visuals for your report.

The Preprocessing Notebook

Here I’ll show what should we do when we first run our code in notebooks, we should start our notebook with the parameters for the notebook.

# PARAMETER
#-------------
# check if IS_MASTER exists, this variable will only exist if it's being called by MASTER notebook.
# if it does not exist, set it to False
try: IS_MASTER
except: IS_MASTER = False
# The code below will only run if it's NOT being called from MASTER notebook
if IS_MASTER:
    DATA_DIR = './data/temp/' # 
    RAW_FILE = f'/path/to/smallData.csv' # use "small data" here
    PROCESSED_FILE = f'{DATA_DIR}processed.pkl' # always use pickle for fast I/O!
    OTHER_PREPROCESS_PARAMETER = ... # e.g. batch size, sliding window size, etc

The code above sets the default parameters for our notebook. We will only use a temporary directory TMP_DIR (under data/ directory) to store our output. This is to ensure a quick iteration. When we’re writing code, we should always use "small data".

You could continue coding out the preprocessing part. Since you’re using "small data" for this "development" phase, you should be quick! When you’re done preprocessing, remember to output a Pickle file using pickle library:

import pickle
with open(PROCESSED_FILE, 'wb') as f:
    pickle.dump(python_object, f)

Alternatively, using pandas‘ shortcut:

df.to_pickle(PROCESSED_FILE)

We use Pickle instead of CSV format for persistence and speedy read and write. This PROCESSED_FILE will be read by The Modelling Notebook in the next part.

The Modelling Notebook

Start off our Model Notebook with this:

# PARAMETER
#-------------
# check if IS_MASTER exists, this variable will only exist if it's being called by MASTER notebook.
# if it does not exist, set it to False
try: IS_MASTER
except: IS_MASTER = False
# The code below will only run if it's NOT being called from MASTER notebook
if IS_MASTER:
    DATA_DIR = './data/temp/'
    EXP_DIR = './exp/temp/'
    PROCESSED_FILE = f'{DATA_DIR}processed.pkl'
    MODEL_FILE = f'{EXP_DIR}model.pkl'
    PREDICTION_FILE = f'{EXP_DIR}ypred.pkl'
    OTHER_MODEL_PARAMETERS = ... # like N_ESTIMATOR, GAMMA, etc

Note that the DATA_DIR and PROCESSED_FILE is the same as, and connected to, the previous Preprocessing Notebook’s output.

In this Modelling Notebook, you should do 3 things (not shown here as it will be different for every model):

  1. Read the processed data and do a slight modification to fit the data into the model.
  2. Train & evaluate the model
  3. Output the model’s learned parameter MODEL_FILE and prediction PREDICTION_FILE to EXP_DIR directory. For prediction output, put both the actual labels and predicted labels in the same data frame (easier for reporting).

The Reporting Notebook

The Reporting Notebook is a quick one, it just needs to read from exp/ directory. Its input is connected to the Modelling Notebook’s output via EXP_DIR, MODEL_FILE, and PREDICTION_FILE.

# PARAMETER
#-------------
# check if IS_MASTER exists, this variable will only exist if it's being called by MASTER notebook.
# if it does not exist, set it to False
try: IS_MASTER
except: IS_MASTER = False
# The code below will only run if it's NOT being called from MASTER notebook
if IS_MASTER:
    EXP_DIR = './exp/temp/'
    MODEL_FILE = f'{EXP_DIR}model.pkl'
    PREDICTION_FILE = f'{EXP_DIR}ypred.pkl'

Here is where you take the predicted labels and score them against the actual labels. You would use metrics like Precision, Recall, or ROC AUC here. You would also run the codes for charts and plots here.

The MASTER Notebook

Finally, the MASTER Notebook!

The Master notebook is the notebook that calls all the other notebooks. While you’re in this notebook, you will oversee the entire data pipeline (from raw data preprocessing to modeling and reporting).

The Master notebook is also where you call the other (well-tested) Preprocessing and Modelling notebooks to run the actual "big data". I will also introduce a logging trick (because hey, if "big data" got an error, we want to know why).

We will also be using a Jupyter magic command %run here as well.

Firstly, create a file called print_n_log.py and paste the code below in it:

"""
Returns a modified print() method that returns TEE to both stdout and a file
"""
import logging
def run(logger_name, log_file, stream_level='ERROR'):
    stream_level = {
        'DEBUG': logging.DEBUG,
        'INFO': logging.INFO,
        'WARNING': logging.WARNING,
        'ERROR': logging.ERROR,
        'CRITICAL': logging.CRITICAL,
    }[stream_level]

    # create logger with 'logger_name'
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)
    # create file handler which logs even debug messages
    fh = logging.FileHandler(log_file)
    fh.setLevel(logging.DEBUG)
    # create console handler with a higher log level
    ch = logging.StreamHandler()
    ch.setLevel(stream_level)
    # create formatter and add it to the handlers
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    # add the handlers to the logger
    logger.addHandler(fh)
    logger.addHandler(ch)
    def modified_print(*args):
        s = ' '.join([str(a) for a in args])
        logger.info(s)
    return modified_print

The code above will create a modified print() method that redirects the output (stdout) and error (stderr) to BOTH the Master Notebook’s cell’s output AND a log file.

Next, import this module in your Master Notebook:

import print_n_log

In your next cell, let’s try to call the Preprocessing Notebook:

# Parameters for Preprocessing Notebook
#---------------------------------------------
IS_MASTER = True # Remember this? We need to set this to True in MASTER Notebook so that it does not use the default parameters in processing notebook.
RAW_FILE = f'/path/to/smallData.csv' # use "small data" here
PROCESSED_FILE = f'{DATA_DIR}processed.pkl' # always use pickle for fast I/O!
OTHER_PREPROCESS_PARAMETER = ... # like batch size, sliding
# Let's save the original print method in ori_print
#---------------------------------------------------
ori_print = print
# Now we set the print method to be modified print
#--------------------------------------------------
print = print_n_log.run('preproc', './logs/preprocess.log', 'DEBUG')
# Now, we run the Preprocessing Notebook using the %run magic
#-------------------------------------------------------------
%run 'c_preprocess.ipynb'
# Finally, after running notebook, we set the print method back to the original print method.
#-----------------------------------------------------
print = ori_print

Note that we’re using %run magic to run the Preprocessing Notebook. This is an IPython magic that let us run other python files and Jupyter notebooks from our current notebook. We’re using this command to run all the codes in the Preprocessing Notebook.

By calling print_n_log.run('preproc', './logs/preprocess.log', 'DEBUG'), we modify the original Python built-in print() method to redirect the output to both the screen and a log file. 'preproc' is just a name for our logger, you could use any other name. After running, you could go to './logs/preproc.log' to see the logged output from running Preprocessing Notebook. The last parameter 'DEBUG' is just saying "print every output to the screen". You could also use 'ERROR' if you only want to see the errors on the screen (no outputs).

And yea, that’s it! You could follow the same template for calling the Modelling Notebook and Reporting Notebook. However, just a tip, you might not want to log the output of Reporting Notebook, so you could use the original print() method for that one.

Conclusion

The key idea here is to "modularize" parts in a Jupyter notebook that do not have much entanglement with other parts, and at the same time, have an overview of the entire data pipeline.

Unlike the software philosophy of encapsulation, we don’t actually want to encapsulate our code in python files and not see them again (although we can do this for some utility methods). We want to keep most of the code in notebooks because when designing an experiment, every part should be changeable. When a part is changed, we also want to make them testable. By staying in Jupyter notebook, we know that our code is testable when we run it with "small data". By following these guidelines, we will have a Jupyter project that’s organized and modularised, yet every part is easily editable and testable.

To get notified for my posts, follow me on Medium, Twitter, or Facebook.


Related Articles