Illustration photo by Mateusz Dach from Pexels.

How to Prepare your Development Environment to Rank on Kaggle

This post walks through the best practices for configuring a custom development environment for Kaggle Image Classification Challenges.

--

In this story, I will present how to set up a development environment for participation in a Kaggle challenge:
1) select a recent interesting competition,
2) start a JupyterLab environment in the cloud,
3) download the full dataset, and
4) perform initial image preprocessing.

Initial choice: Competition, Deep Learning Framework and Cloud platform

Showing sample images and counts per each concatenated class combination.

We will use the recent Kaggle Plant Pathology 2021 — FGVC8 to showcase the full data science cycle. One nice aspect of this competition is that it is quite simple and represents the challenges faced with real-world image classification tasks beyond CIFAR and MNIST.

My framework of choice for these competitions is PyTorch Lightning, which allows me to focus on iterating my data handling and model architecture without worrying about heavy engineering for the training loops and device management. For more information about why you should check out the PyTorch Lightning post below.

As a computing platform, I chose Grid.ai, which allows me to specify exactly the machine I want to use with pre-installed JupiterLab inside. As a bonus, Grid enables me to sync all my Kaggle code with Github. Later, I will use it for running a hyper-parameter search to find the best configuration for my model.

Full disclosure — I currently work as a Sr. Research Engineer at Grid.ai
Note that there are other alternatives that you can also use to leverage these best practices, such as Kaggle kernels or Colab but Grid is my platform of choice as it enables me to easily scale training my models using the cloud.

All the code and visualisation included in this tutorial are available in this repo, which you are free to use and contribute to.

Preparing your Kaggle Development Environment

I selected Grid.ai as my developer environment since it offers the three main building blocks:

  1. Datastores (which enable me to upload data once and use it anywhere; they ensure that all our prototypes and experiments use the same data and are reproducible)
  2. Sessions (which enable me to focus on machine learning itself. I found better performance CPU/GPU/RAM with the Grid Instances than on Kaggle and Colab kernels)
  3. Runs (which enable me to scale experimentation; I can closely monitor multiple experiments in real-time, stop suboptimal configurations early, compare the results and export the best model for Kaggle) in one place.

Starting a Grid Session

Starting Jupyter Lab in Grid session.

First, log in to Grid.ai with a GitHub or Google Account. Then you can Create a Session — configure your instance type and set the required storage size (becomes handy when you want to play with big datasets). The new interactive Session instance comes with a build-in JupyterLab.

Steps to Create a Grid Session for a Full Tour check here

I can easily launch JupyterLab and use the cli tool glances to monitor real-time resources usage, which lets me know if there are any resource bottlenecks such as CPU utilization during preprocessing or GPU utilization during training leading to increase training batch size.

Setting up Kaggle CLI & Downloading the dataset

To get the Plant Pathology 2021 — FGVC8 dataset on our session, we use Kaggle CLI provided directly by Kaggle and indexed on PyPI.

pip install kaggle

For any with Kaggle CLI, you need to set up your credentials — logging to Kaggle web and create your personal key, which will be automatically downloaded as kaggle.json.

Step-by-step bow to download credentials file from personal Kaggle account: (1) select account, (2) scroll down and “Create New API Token”.

When we have our kaggle.json key we can upload it using JupyterLab it to our session (simple drag-and-drop) and move it to the correct destination for the Kaggle CLI as follows:

Step-by-step how-to setup Kaggle CLI: (1) install kaggle package, (2) drag-and-drop file with credentials and (3) move it to the expected system folder.
mv kaggle.json /home/jovyan/.kaggle/kaggle.json

Now, we are set to download the dataset to our session. The particular competition name to download is the same as the URL name. Also, the exact command for download is provided in the Data section in each Competition. Most datasets are distributed as compressed achieve, which we need to unzip to a destinations folder (I usually name the folder the same as the competition). So for our competition, we call the following commands:

# doanload dataset via Kaggle CLI
kaggle competitions download plant-pathology-2021-fgvc8
# unzip the dataset to a folder
unzip plant-pathology-2021-fgvc8.zip -d plant-pathology

Dataset pre-processing

Once extracted, we can pre-process the dataset. The original dataset is distributed in 4K resolution, which is way too large for most applications, so we downscale images to about 640px. This observation with reasoning is described in how to place on the leaderboards. We use a simple CLI tool ImageMagick.

apt install imagemagick
# apply to all JPEG images in given folder
mogrify -resize 640 plant-pathology/train_images/*.jpg

Eventually, you can write your own small script which would explore python multiprocessing and speed this conversion…

Simple script for multiprocessing image scaling.

There you have it—everything you need to know to prepare a Kaggle Developer Environment with Grid.ai. To recap, we selected an interesting Kaggle competition, showed how to start interactive sessions, and configure Kaggle credentials for any (local or cloud) environment.

I will continue exploring this downloaded dataset in future blog posts and prepare a baseline computer vision model to solve this task. I will also review some of the tricks that helped me rank on the leaderboards, so stay tuned. Later we also share how to make submissions from the offline kernel easily.

Stay tuned and follow me to learn more!

About the Author

Jirka Borovec has been working in Machine learning and Data science for several years in a few different IT companies. In particular, he enjoys exploring interesting world problems and solving them with state-of-the-art techniques. In addition, he developed several open-source python packages and actively participating in other well-known projects. He works in Grid.ai as Research Engineer and serves as a lead contributor of PyTorchLightning.ai.

--

--

I have been working in ML and DS for a while in a few IT companies. I enjoy exploring interesting world problems and solving them with SOTA techniques…