Illustration photo by Ollia Danilevich from Pexels

Converting Scientific Kaggle Notebooks to Friendly Python Package

This post shows how to easily convert a notebook to a standard Python package and include a simple command-line interface (CLI) to the training script for faster hyper-parameters interaction.

--

In the previous post, we shared how to screen the given dataset and showed how to wrap a file-like dataset to the PyTorch class dataset, which is a core of data handling. Moreover, we wrote a basic image multi-label classification model within PyTorchLightning based on a TorchVision model and trained it seamlessly on GPU without extra code in the Jupyter notebook.

Notebooks are great for quick prototyping or data exploration but not very practical for development on a scale. The cell-like structure of notebooks makes it challenging to search for different parameter configurations to optimize models. Notebook results are not necessarily reproducible as you may swap some edit cells or run them out of order. Notebooks are not version control friendly, making collaboration with colleagues or the broader open source community difficult.

Convert Notebooks to Sharable Python Package

With PyTorch Lightning, converting a notebook to a basic python package is trivial and sharing it as a GitHub repository. PyTorch Lightning enforces the best practice and implies natural code organization by its functionality — data handling and model architecture.

A python package is a structure that organizes functions, classes, etc., into python modules from they can be imported and used anywhere. Such package is compact and unit which is suitable for distribution with Pip or Conda. So in our case, we create two python modules — data and models and copy-paste earlier implemented classes.

Using setup.py we can install our package and import it anywhere across our environment. Otherwise, we need to add a path to the package to our Python system path. With moving data and model implementation to our own package, we can use it in our notebook via imports and simplify all training just to these few lines:

Code snippet from the shared repository.

This enables you to share your repository/package on GitHub, making it easy for people to build on top of your work.

Schema of moving PL implementations to a package and then use them via imports.

Template/starter repository

As inspiration for your next project, I have created this stater/template repository which has set up a basic project structure with a demo package challenge_xyz. It includes basic workflows for testing, code formatting, and Github Issue/PR templates. You can create your following repository from this template; see step-by-step docs.

Small tips for package substantiality

Now that we open-sourced our code, we are ready for them to contribute or fix issues. That sounds nice, but how to ensure that new changes which suppose to fix a bug do not break something else?

The standard Software engineering answer is by TESTING (unit-tests or integrations tests). We can run all our tests in the repository before we commit/publish every new contribution and verify that all works as expected. Personally, I am using the pytest package for most of my projects, reed following to dive in.

Here again comes handy using a standard framework, in our case Pytoch Lightning, which has already hundreds of tests to guarantee correct behavior. We only need to write a small set of test cases to cover our solution since lightning already takes care of the rest. For example, we may want to try to instantiate all classes.

As presented below, we add parametrization, which calls the same test twice with a different network — first gets network as an instance, second gets networks as a string name:

Code snippet from the shared repository.

To test the training capacity, we can test load and run your model on a few sample images by setting the following Trainer’s flag fast_dev_run=True.

Code snippet from the shared repository.

The last tip for testing is to include several sample images and annotation-like files to imitate the actual dataset. This allows you to test the proper use case more realistically.

Convert Notebooks to CLI Configurable Python Scripts

One way how to convert a notebook to a python script is via JupyterLab — select “Download as -> Python”. The simplified notebook shall be shorter and focus mainly on the Data/Model/Trainer configuration (if we skip all data exploration parts).

Unfortunately, it generates a script with hardcoded parameters. To address this limitation, we would need to write our own argparser and map all CLI arguments to the Model/DataModule/Trainer. Luckily, Lightning recently introduced its own minimalistic LightningCLI interface, which handles argument biding for you.

The LightningCLI empowers simple argument parsing for Lightning scripts with almost no extra code! It is designed to take these main components/actors: LightningModule, LightningDataModule and Trainer, parse command-line arguments and them for instantiating these actors. In the end, the LightningCLI perform the training. The final script then would be just as follows:

Code snippet from the shared repository.

With the CLI, you can run your python script in any terminal.

python kaggle_plantpatho/cli_train.py \
--model.model 'resnet34' \
--data.base_path /home/jirka/datasets/kaggle_plant-pathology \
--trainer.max_epochs 5

As we plan to experiment with different model architectures, we may need to adjust the batch size for each model to use the maximal resources (with a smaller model, such as resnet18 we can use a much larger batch size than with resnet101). This can be accomplished with Lightning Trainer tune implemented in the method before_fit(...) as it finds the maximal batch_size that fits in GPU memory and uses it for training.

Code snippet from the shared repository.

In this post, we showed how to convert scientific notebooks to a standard python package that is ready to share. We have discussed a motivation to add minimal tests to facilitate any future package enhancement without accidentally breaking it. Last, we presented the simple transition from a plain python script to a versatile script that exposed all the command line parameters.

In the future, we will show how to use this simple script to run a hyper-parameter search across multiple machines in parallel on Grid.ai and online observe training performances.

Stay tuned and follow me to learn more!

About the Author

Jirka Borovec has been working in Machine learning and Data science for several years in a few different IT companies. In particular, he enjoys exploring interesting world problems and solving them with state-of-the-art techniques. In addition, he developed several open-source python packages and actively participating in other well-known projects. He works in Grid.ai as Research Engineer and serves as a lead contributor of PyTorchLightning.ai.

--

--

I have been working in ML and DS for a while in a few IT companies. I enjoy exploring interesting world problems and solving them with SOTA techniques…