This post shows how to easily convert a notebook to a standard Python package and include a simple command-line interface (CLI) to the training script for faster hyper-parameters interaction.
In the previous post, we shared how to screen the given dataset and showed how to wrap a file-like dataset to the PyTorch class dataset, which is a core of data handling. Moreover, we wrote a basic image multi-label classification model within PyTorchLightning based on a TorchVision model and trained it seamlessly on GPU without extra code in the Jupyter notebook.
Notebooks are great for quick prototyping or data exploration but not very practical for development on a scale. The cell-like structure of notebooks makes it challenging to search for different parameter configurations to optimize models. Notebook results are not necessarily reproducible as you may swap some edit cells or run them out of order. Notebooks are not version control friendly, making collaboration with colleagues or the broader open source community difficult.
Convert Notebooks to Sharable Python Package
With PyTorch Lightning, converting a notebook to a basic python package is trivial and sharing it as a GitHub repository. PyTorch Lightning enforces the best practice and implies natural code organization by its functionality – data handling and model architecture.
A python package is a structure that organizes functions, classes, etc., into python modules from they can be imported and used anywhere. Such package is compact and unit which is suitable for distribution with Pip or Conda. So in our case, we create two python modules – data
and models
and copy-paste earlier implemented classes.
Using setup.py
we can install our package and import it anywhere across our environment. Otherwise, we need to add a path to the package to our Python system path. With moving data and model implementation to our own package, we can use it in our notebook via imports and simplify all training just to these few lines:
This enables you to share your repository/package on GitHub, making it easy for people to build on top of your work.

Template/starter repository
As inspiration for your next project, I have created this stater/template repository which has set up a basic project structure with a demo package challenge_xyz
. It includes basic workflows for testing, code formatting, and Github Issue/PR templates. You can create your following repository from this template; see step-by-step docs.
[GitHub – Borda/kaggle_sandbox: a starting point for Kaggle attempt :]](https://github.com/Borda/kaggle_sandbox)
Thoughts on Why Code Formatting is even more important for Open-Source
Small tips for package substantiality
Now that we open-sourced our code, we are ready for them to contribute or fix issues. That sounds nice, but how to ensure that new changes which suppose to fix a bug do not break something else?
The standard Software engineering answer is by TESTING (unit-tests or integrations tests). We can run all our tests in the repository before we commit/publish every new contribution and verify that all works as expected. Personally, I am using the pytest package for most of my projects, reed following to dive in.
Here again comes handy using a standard framework, in our case Pytoch Lightning, which has already hundreds of tests to guarantee correct behavior. We only need to write a small set of test cases to cover our solution since lightning already takes care of the rest. For example, we may want to try to instantiate all classes.
As presented below, we add parametrization, which calls the same test twice with a different network – first gets network as an instance, second gets networks as a string name:

To test the training capacity, we can test load and run your model on a few sample images by setting the following Trainer
‘s flag fast_dev_run=True
.

The last tip for testing is to include several sample images and annotation-like files to imitate the actual dataset. This allows you to test the proper use case more realistically.
Convert Notebooks to CLI Configurable Python Scripts
One way how to convert a notebook to a python script is via JupyterLab – select "Download as -> Python". The simplified notebook shall be shorter and focus mainly on the Data/Model/Trainer configuration (if we skip all data exploration parts).
Unfortunately, it generates a script with hardcoded parameters. To address this limitation, we would need to write our own argparser
and map all CLI arguments to the Model/DataModule/Trainer. Luckily, Lightning recently introduced its own minimalistic LightningCLI
interface, which handles argument biding for you.
Auto Structuring Deep Learning Projects with the Lightning CLI
The LightningCLI
empowers simple argument parsing for Lightning scripts with almost no extra code! It is designed to take these main components/actors: LightningModule
, LightningDataModule
and Trainer
, parse command-line arguments and them for instantiating these actors. In the end, the LightningCLI
perform the training. The final script then would be just as follows:

With the CLI, you can run your python script in any terminal.
python kaggle_plantpatho/cli_train.py
--model.model 'resnet34'
--data.base_path /home/jirka/datasets/kaggle_plant-pathology
--trainer.max_epochs 5
_As we plan to experiment with different model architectures, we may need to adjust the batch size for each model to use the maximal resources (with a smaller model, such as resnet18
we can use a much larger batch size than with resnet101
). This can be accomplished with Lightning Trainer [tune](https://pytorch-lightning.readthedocs.io/en/1.4.0/common/trainer.html?highlight=tune#auto-scale-batch-size)
implemented in the method before_fit(...)
as it finds the maximal batch_size
that fits in GPU memory and uses it for training._

In this post, we showed how to convert scientific notebooks to a standard python package that is ready to share. We have discussed a motivation to add minimal tests to facilitate any future package enhancement without accidentally breaking it. Last, we presented the simple transition from a plain python script to a versatile script that exposed all the command line parameters.
In the future, we will show how to use this simple script to run a hyper-parameter search across multiple machines in parallel on Grid.ai and online observe training performances.
Hyper-Parameter Optimization with Grid.ai and No Code Change
Stay tuned and follow me to learn more!
Best Practices to Rank on Kaggle Competition with PyTorch Lightning and Grid.ai Spot Instances
About the Author
Jirka Borovec has **** been working in Machine learning and Data Science for several years in a few different IT companies. In particular, he enjoys exploring interesting world problems and solving them with state-of-the-art techniques. In addition, he developed several open-source python packages and actively participating in other well-known projects. He works in _Grid.a_i as Research Engineer and serves as a lead contributor of _PyTorchLightning.a_i.