Data Preparation Guide for Histopathologic Cancer Detection

Guide on how to prepare data for model training for Kaggle’s Histopathologic Cancer Detection.

Published in

Towards Data Science

6 min readOct 26, 2020

Photo by National Cancer Institute on Unsplash

Kaggle serves as a wonderful host to Data Science and Machine Learning challenges. One of them is the Histopathologic Cancer Detection Challenge. In this challenge, we are provided with a dataset of images on which we are supposed to create an algorithm (it says algorithm and not explicitly a machine learning model, so if you are a genius with an alternate way to detect metastatic cancer in images; go for it!) to detect metastatic cancer.

This article serves as a guide on how to prepare Kaggle’s dataset and the guide covers the following 4 things:

How to download the dataset into your notebook from Kaggle
How to augment the dataset’s images.
How to balance target distributions, and split the data for training/test/validation.
How to structure data for model training in Keras.

Downloading the Dataset within a notebook

Download the kaggle package using the commands below. I ran them in Google’s Colab but you should just be able to do it using your command line/Jupyter Notebook.

In order to use Kaggle’s API to download the data with your account, you need to do the following two things.

Go to your account settings, scroll to the API section, click Expire API Token to remove previous tokens (in case you have any) then click on Create New API Token. This will download a ‘kaggle.json’ file.
Upload this file to your project directory.

Then run the code below which uses the json file to grant access, and download the dataset.

Extract the zipfile. It takes a few minutes because it is around 6–7 GB worth of data.

At this point, you should have two folders, and two csv files.

A ‘train’ folder containing the training set images
A ‘test’ folder containing the test set images
A ‘train_labels’ csv file which contains each image’s id and its corresponding label (0 for no cancer, 1 for cancer).
A ‘sample_submission’ csv file which is an example of how you want to submit your results if you want to compete in the Kaggle competition.

Data Augmentation

What is Data Augmentation and why do we do it? In its essence, data augmentation is a way to increase the size of our image dataset by introducing slight changes to our data. The goal is for our model to generalize to more examples which are introduced through image augmentation. More importantly, image augmentation also allows our model to generalize different versions of different kinds of image.

Look at the image below. Suppose the terribly drawn cat on the left is from a training dataset, but the cat on the right is in the test dataset. They are, quite literally, the same image except that the image on the right is flipped. A human is able to easily notice that, but a model may not be able to succeed at this task and will fail at identifying it in the test dataset. Ideally, you want your model to not care about variations in images, and correctly classify the image, which is why you introduce augmented versions of images so the model can generalize better.

Here is code for a function that augments images by applying various transformations to them such as varying the contrast of an image, increasing its brightness, applying random rotations/shifts. The code was referenced from here.

We then use this function to augment our extracted images in the ‘train’ folder. The code below generates an X amount of new images (based on a variable called ‘images_to_generate’ that you can change) by randomly choosing images from our train set, applying the augmentation function above, then saving the image in the train folder with an ‘augmented’ + its previous ID as the new name. Lastly, we also append the label and name to train_labels csv file and save a new csv file which also contains the ID and labels to the augmented images.

Balancing Target Distributions

In machine learning, we want to ensure that our model is able to work well on all types of data we throw at it! One problem that can arise if we blindly use all our available training data is the case of imbalanced class/target distributions. What this means is that your dataset contains more observations of one or more classes than the other classes. If we train our model on this complete dataset, our model will get really good at predicting the class with more data points, while performing poorly at classifying other classes. Let’s see what happens in the case of our cancer detection model!

If we observe our training dataset, we will realize that it is imbalanced. Running the code below shows that we have around 130000 images without cancer, and around 90000 images with cancer. As mentioned before, training using an imbalanced dataset can cause our model to become awesome at recognizing one class, while failing miserably on other classes. The minority class, in this case, is the one with detected cancer cells, and we really do not want to fail at classifying images with cancer; since doing so results in people going about their lives with undiagnosed cancer, when they should be getting treated!

The output:

To fix this, we can drop images from the no cancer (0) class which is done below. I dropped more than necessary images just for faster training, but we could have simply kept it to a 95000:93099 ratio. The code below drops the images and keeps the distribution of classes equal.

One more step is that we need to divide our data into a training, test, and validation set. A training set is used to train our model, a validation set is used to tune hyperparameters for our model, and the test set is used to check the performance of the model. The code snippet below does the split in a 60:20:20 ratio for train:test:validation.

In the example above, we had the luxury of having a meaty dataset from which we could drop data points to balance class distributions; however, what if we don’t have the luxury of dropping data? What if we need as much data as possible?

There are a few ways to go about this:

You can oversample the minority class. In our case, this would simply be taking random samples of the images in our training set with detected cancer cells, and appending those samples to our training set. You may get a bit skeptical about this, but this technique at least ensures that your model does not rely heavily on the majority class for training, and also learns to classify the minority class.
For non image-based datasets, you can create fake data. This can be done using techniques like Kernel Density Estimation where you learn the distribution of the features in your dataset, and draw samples from those distributions to generate new data.

These are just two approaches, and you can find more here.

Directory Structuring for Keras generators

Machine learning models use data stored in the computer/server’s memory. Sometimes, the dataset is small enough to fit in the memory; but in most practical cases, it is not. To overcome this, we can use generators (they also perform image augmentation, but I manually performed that above instead) which take our images, and pass them to the machine learning model in batches instead of all at once. To do this, we need our images stored in a specific directory structure. The directory for this example should be in the format below.

Assuming you are in the project base directory, it should be:

— Training data folder

— — — class_0 folder

— — — class_1 folder

— Testing data folder

— — — class_0 folder

— — — class_1 folder

— Validation data folder

— — — class_0 folder

— — — class_1 folder

The training data folder has two sub-folders, which have images of each class. The testing and validation data folders are similarly structured. The code below creates these directories.

Once we have created the directories, we have the transfer the respective images to these directories. The code below does that. It takes around 30 minutes on colab to do so since we have to transfer 160000 images.

Ending Notes

Great! At this point, you should have your dataset set up to feed into your model. The complete A-Z repository can be found here!

Hopefully, you found this article and notebook useful in understanding preprocessing data for the challenge, and preprocessing in general.

Data Preparation Guide for Histopathologic Cancer Detection

Guide on how to prepare data for model training for Kaggle’s Histopathologic Cancer Detection.

Downloading the Dataset within a notebook

Data Augmentation

Balancing Target Distributions

Directory Structuring for Keras generators

Ending Notes

References

Written by Abdul Qadir