Histopathological Cancer Detection with Deep Neural Networks

Published in

Towards Data Science

12 min readApr 20, 2019

(Note: The related Jupyter notebook and original post can be found here: https://www.humanunsupervised.com/post/histopathological-cancer-detection)

Being able to automate the detection of metastasised cancer in pathological scans with machine learning and deep neural networks is an area of medical imaging and diagnostics with promising potential for clinical usefulness.

Here we explore a particular dataset prepared for this type of of analysis and diagnostics — The PatchCamelyon Dataset (PCam).

PCam is a binary classification image dataset containing approximately 300,000 labeled low-resolution images of lymph node sections extracted from digital histopathological scans. Each image is labelled by trained pathologists for the presence of metastasised cancer.

The goal of this work is to train a convolutional neural network on the PCam dataset and achieve close to, or near state-of-the-art results.

As we’ll see, with the Fastai library, we achieve 98.6% accuracy in predicting cancer in the PCam dataset.

We approach this by preparing and training a neural network with the following features:

Transfer learning with a pre-trained Resnet50 ImageNet model as our backbone.
The following data augmentations: Image resizing, random cropping, and
horizontal and vertical axis image flipping.
Fit one cycle method to optimise learning rate selection for our training.
Discriminative learning rates to fine-tune.

In addition we apply the following “out-of-the-box” optimisations and regularisation techniques in our training:

Dropout
Weight decay
Batch normalisation
Average and Max-pooling
Adam Optimisers
ReLU Activations

This notebook presents research and an analysis of this dataset using Fastai + PyTorch and is provided as a reference, tutorial, and open source resource for others to refer to. It is not intended to be a production ready resource for serious clinical application. We work here instead with low resolution versions of the original high-res clinical scans in the Camelyon16 dataset for education and research. This proves useful ground to prototype and test the effectiveness of various deep learning algorithms.

Background and Data Source

Original Source: Camelyon16

PCam is actually a subset of the Camelyon16 dataset; a set of high resolution whole-slide images (WSI) of lymph node sections. This dataset is made available by the Diagnostic Image Analysis Group (DIAG) and Department of Pathology of the Radboud University Medical Center (Radboudumc) in Nijmegen, The Netherlands. The following is an excerpt from their website: https://camelyon16.grand-challenge.org/Data/

The data in this challenge contains a total of 400 whole-slide images (WSIs) of sentinel lymph node from two independent datasets collected in Radboud University Medical Center (Nijmegen, the Netherlands), and the University Medical Center Utrecht (Utrecht, the Netherlands).

The first training dataset consists of 170 WSIs of lymph node (100 Normal and 70 containing metastases) and the second 100 WSIs (including 60 normal slides and 40 slides containing metastases).

The test dataset consists of 130 WSIs which are collected from both Universities.

Examples above of a metastatic region (from Camelyon16)

PatchCam (Kaggle)

PCam was prepared by Bas Veeling, a Phd student in machine learning for health from the Netherlands, specifically to help machine learning practitioners interested in working on this particular problem. It consists of 327,680, 96x96 colour images. An excellent overview of the dataset can be found here: http://basveeling.nl/posts/pcam/, and also available via download on github where there is further information on the data: https://github.com/basveeling/pcam

This particular dataset is downloaded directly from Kaggle through the Kaggle API, and is a version of the original PCam (PatchCamelyon) datasets but with duplicates removed.

PCam is intended to be a good dataset to perform fundamental machine learning analysis. As the name suggests, it’s a smaller version of the significantly larger Camelyon16 dataset used to perform similar analysis (https://camelyon16.grand-challenge.org/Data/)

From the author’s words:

PCam packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST. Models can easily be trained on a single GPU in a couple hours, and achieve competitive scores in the Camelyon16 tasks of tumor detection and whole-slide image diagnosis. Furthermore, the balance between task-difficulty and tractability makes it a prime suspect for fundamental machine learning research on topics as active learning, model uncertainty, and explainability.

With a bit of background on the data out of the way, let’s start setting up our project and working directories…

Getting the data

The data we are using lives on Kaggle. We use Kaggle’s SDK to download the dataset directly from there. To work with the Kaggle SDK and API you will need to create a Kaggle API token in your Kaggle account.

When logged into Kaggle, navigate to “My Account” then scroll down to where you can see “Create New API Token”. This will download a JSON file to your computer with your username and token string. Copy these contents to you ~/.kaggle/kaggle.json token file.

Preparing the data with ImageDataBunch

With our data now downloaded, we create an ImageDataBunch object to help us load the data into our model, set data augmentations, and split our data into train and test sets.

ImageDataBunch wraps up a lot of functionality to help us prepare our data into a format that we can work with when we train it. Let’s go through some of the key functions it performs below:

Data Augmentation

By default ImageDataBunch performs a number of modifications and augmentations to the dataset:

Centre crop the images
There’s also some randomness introduced on where and how it crops for the purposes of data augmentation
It’s important that all the images need to be of the same size for the model to be able to train on.

Image Flipping

There are various other data augmentations we could also use. But one of the key ones that we activate is image flipping on the vertical.

For pathology scans this is a reasonable data augmentation to activate, as there is little importance on whether the scan is oriented on the vertical axis or horizontal axis,

By default fastai will flip on the horizontal, but we need to turn on flipping on the vertical.

Batch Size

We’ll be using the 1cycle policy (fit_one_cycle()) to train our network (more on this later). This is a hyper parameter optimisation that allows us to use higher learning rates.

Higher learning rates acts as a form of regularisation in 1cycle policy. Recall that a small batch size adds regularisation, so when using large batch sizes in 1cycle learning it allows for larger learning rates to be used.

The recommendation here is to use a batch size that is the largest our GPU supports when using 1cycle policy to train.

Training, validation and test sets

We specify the folder location of the data (where the subfolders train and test exist along with the csv data)
ImageDataBunch under the hood splits out the images (in the train sub-folder) into a training set and validation set (defaulting to an 80/20 percent split). There are 176,020 images in the training set and about 44,005 in the validation set.
We also specify the location of the test sub-folder, that contains unlabelled images. Our learning model will measure accuracy and the error rates against this dataset
The CSV file containing the data labels is also specified

Image size on base architecture and target architecture

Images in the target PCam dataset are square images 96x96. However, when bringing a pre-trained ImageNet model into our network, which was trained on larger images, we need to set the size accordingly to respect the image sizes in that dataset.

We choose 224 for size as a good default to start with.

Normalising the images

Once we have setup the ImageDataBunch object, we also normalise the images.

Normalising the images uses the mean and standard deviation of the images to transform the image values into a standardised distribution that is more efficient for a neural network to train on.

Below we take a look at some random samples of the data so we can get some understanding of what we are feeding into our network. This is a binary classification problem so there’s only two classes:

Negative (0) / Metastasis (1)

Learner (CNN Resnet50)

Once we have a correctly setup the ImageDataBunch object, we can now pass this, along with a pre-trained ImageNet model, to a cnn_learner. We will be using Resnet50 as our backbone.

Fastai wraps up a lot of state-of-the-art computer vision learning in its cnn_learner. It is the top-level construct that manages our model training and integrates our data.

Transfer learning

Starting with a backbone network from a well-performing model that was already pre-trained on another dataset is a method called transfer learning.

Transfer learning works on the premise that instead of training your data from scratch, you can use the learning (ie the learned weights) from another machine learning model as a starting point.

This is an incredibly effective method of training, and underpins current state-of-the-art practices in training deep neural networks.

When using pre-trained models we leverage, in particular, the learned features that are most in common with both the pre-trained model and the target dataset (PCam).

So for example, for models pre-trained on ImageNet such as Resnet50, training will leverage the common features (for example such as lines, geometry, patterns) that have already been learnt from the base dataset (in particular in the first few layers) to train on the target dataset.

For our model, we’ll be using Resnet50. Resnet50 is a residual neural net trained on ImageNet data using 50 layers, and will provide a good starting point for our network.

Training and fit one cycle

Fit one cycle

We will be training our network with a method called fit one cycle. This optimisation is a way of applying a variable learning rate across the total number of epochs in our training run for a particular layer group. This has proven to be an extremely effective way to tune the learning rate hyperparameter for training.

Fit one cycle varies the learning rate from a minimum value at the first epoch (by default lr_max/div_factor), up to a pre-determined maximum value (lr_max), before descending again to a minimum across the remaining epochs. This min-max-min learning rate variance is called a cycle.

An excellent overview can be found here in the fastai docs https://docs.fast.ai/callbacks.one_cycle.html along with a more detailed explanation in the original paper by Leslie Smith [7], where this method of hyperparameter tuning was proposed.

So how then do we determine the most suitable maximum learning rate to enable fit one cycle? We run fastai’s lr_find() method.

Running lr_find before unfreezing the network yields the graph below. We want to choose a learning rate just before the loss starts to exponentially increase.

From a visual observation of the resulting learning rate plot, starting with a learning rate of 1e-02 seems to be a reasonable choice for an initial lr value.

Freeze

By default we start with our network frozen. This means that the layers of our pre-trained Resnet50 model have trainable=False applied, and training begins only on the target dataset. The learning rate we provide to fit_one_cycle() applies only to that layer group for this initial training run.

Analysing first results

Analysing the graph of the initial training run, we can see that the training loss and validation loss both steadily decrease and begin to converge while the training progresses.

Accuracy at the moment is 97.76%.

We can learn more about this training run by using Fastai’s confusion matrix and plotting our top losses.

The confusion matrix is a handy tool to help us obtain more detail on the effectiveness of the training so far. Specifically, we get some clarity on the amount of false positives and false negatives predicted by our neural net.

Plotting our top losses allows us to examine specific images in more detail. Fastai generates a heatmap of images that we predicted incorrectly. The heatmap allows us to examine areas of images which confused our network. Its useful to do this so we obtain better context around how our model is behaving on each test run, and direct us to clues as to how to improve it.

Fine-tuning, unfreezing, and discriminative learning rates

Initial results are already good on the first training run. But with some more fine-tuning, we can actually do a little better.

Transfer learning + Fine-tuning = Better Generalisation

Transfer learning alone brings us much further than training our network from scratch. But this method is prone to optimisation difficulties present between fragile co-adpated layers when connecting a per-trained network. We counter this by fine-tuning our model; making the all layers of our network, including the pre-trained Resnet50 layers, to be trainable. When we unfreeze we train across all of our layers. (See [6])

This leads to better results and an improved ability to generalise to new examples.

Discriminative Learning Rates and 1cycle

With all of our layers in our network unfrozen and open for training, we can now also make use of discriminative learning rates in conjunction with fit_one_cycle to improve our optimisations even further.

Discriminative learning rates lets us apply specific learning rates to layer groups in our network, optimising for each group. Fit one cycle then operates on these values and uses them to vary learning rates according to the 1cycle policy. (https://docs.fast.ai/basic_train.html#Discriminative-layer-training)

How do we find the best range of learning rates to use for fit 1cycle? We can use lr_find() to help us with that.

Analysing our lr plot above, we choose a range of learning rates just before the loss begins to radically increase and apply that as a slice to our fit_one_cycle method below.

From our plot above, it seems reasonable to select an upper bound rate of 1e-4, and as a recommended rule for our lower bound rate, we can select a value 10x smaller than our upper-bound, in this case 1e-5.

The lower bound rate will apply to the layers in our pre-trained Resnet50 layer group. The weights here are already well learned so we can proceed with a slower learning rate for this group of layers.

The upper bound rate gets applied to the final layer group of layers previously trained in our last training run on the target dataset. The layers in this group will benefit from a faster learning rate.

Final analysis

In the final fine-tuning training run, we can see that our training loss and validation loss begin to diverge from each other now mid training, and that the training loss is progressively improving at a much faster rate than validation loss, steadily decreasing until stabilising to a steady range of values in the final epochs of the run.

Any further increases in our validation loss, in the presence of a continually decreasing training loss, would result in overfitting, failing to generalise well to new examples.

Finalising the at this point in our training yields a fine-tuned accuracy of 98.6% over our stage 1 training run result.

References

[1] Practical Deep Learning for Coders, v3. Fastai. Jeremy Howard. Rachel Thomson. https://course.fast.ai/index.html

[2] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling. “Rotation Equivariant CNNs for Digital Pathology”. arXiv:1806.03962

[3] Ehteshami Bejnordi et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA: The Journal of the American Medical Association, 318(22), 2199–2210. doi:jama.2017.14585

[4] Camelyon16 Challenge https://camelyon16.grand-challenge.org

[5] Kaggle. Histopathologic Cancer Detection — Identify metastatic tissue in histopathologic scans of lymph node sections https://www.kaggle.com/c/histopathologic-cancer-detection

[6] Jason Yosinski. Jeff Clune. Yoshua Bengio. Hod Lipson. “How transferable are features in deep neural networks? “. arXiv:1411.1792v1 [cs.LG]

[7] Leslie N. Smith. “A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay”. arXiv:1803.09820v2 [cs.LG]