Deep Learning for Diagnosis of Skin Images with fastai

Learn to identify skin cancer and other conditions from dermoscopic images

Aldo von Wangenheim
Towards Data Science

--

We show how to use fast.ai to solve the 2018 Skin Lesion Analysis Towards Melanoma Detection challenge and automatically identify seven kinds of skin pathologies.

Pixabay/Pexels free images

Posted by Aldo von Wangenheim — aldo.vw@ufsc.br

This is based upon the following material:

  1. TowardsDataScience::Classifying Skin Lesions with Convolutional Neural Networks — A guide and introduction to deep learning in medicine by Aryan Misra
  2. Tschandl, Philipp, 2018, “The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions”, https://doi.org/10.7910/DVN/DBW86T, Harvard Dataverse [arXiv preprint version: arXiv:1803.10417 [cs.CV]]
  3. Tools for workup of the HAM10000 dataset — GitHub. This repository gives access to the tools created and used for assembling the training dataset.

The HAM10000 dataset (Human Against Machine with 10000 training images) served as the training set for the ISIC 2018 challenge (Task 3). The official validation- and test-sets of this challenge are available, without ground-truth labels, through the challenge website https://challenge2018.isic-archive.com/. The ISIC-Archive also provides a “Live challenge” submission site for continuous evaluation of automated classifiers on the official validation- and test-set.

The images look like this:

Random images from the HAM10000 dataset with their ground truth labels

What is in this Dataset?

This dataset contains pigmented skin lesions acquired through standard dermoscopy. These are lesions where the tissue produces melanin, the natural pigment of the human skin, and that are dark. Not all kinds of lesions initially investigated and triaged through dermoscopy are necessarily pigmented lesions. This means that, in a real world situation, a GP or a nurse examining a patient through dermoscopy (or a patient performing self-examination) with the intent to submit these images to a dermatologist for an initial triage, would possibly encounter other lesions than the ones depicted in this dataset here (look at Is there an ISIC 2019? below).

The lesion classes in the HAM10000 Dataset are:

  1. nv: Melanocytic nevi — benign neoplasms of melanocytes [6705 images]
  2. mel: Melanoma — a malignant neoplasm derived from melanocytes [1113 images]
  3. bkl: Benign keratosis — a generic class that includes seborrheic keratoses, solar lentigo and lichen-planus like keratoses [1099 images];
  4. bcc: Basal cell carcinoma — a common variant of epithelial skin cancer that rarely metastasizes but grows destructively if untreated (bccs do not necessarily produce pigmented lesions) [514 images];
  5. akiec: Actinic Keratoses and intraepithelial Carcinoma — common non-invasive, variants of squamous cell carcinoma that can be treated locally without surgery [327 images];
  6. vasc: Vascular skin lesions ranging from cherry angiomas to angiokeratomas and pyogenic granulomas [142 images];
  7. df: Dermatofibroma — a benign skin lesion regarded as either a benign proliferation or an inflammatory reaction to minimal trauma [115 images].

For a more detailed description of each class, look at the Kaggle kernel: Skin Lesion Analyzer + Tensorflow.js Web App — Python notebook using data from Skin Cancer MNIST: HAM10000 and also the paper by Philipp Tschandl, above.

What are the diagnostic limitations of this data set?

Dermoscopic images alone do not provide enough for data for a dermatological diagnosis or a reliable remote patient triage in a Teledermatology setting. Dermoscopic images lack context. In order to provide context, you will need to perform an image acquisition protocol that includes panoramic whole body images of the patient and also approximation images of each lesion, which are taken with a ruler or other frame of reference visible in the image, in order to provide contextual information about the size of the lesion. Approximation images taken with the ruler are also important for a patient already in treatment in order to allow the accompanying physician to follow the evolution of the lesion. Both panoramic and approximation images, in order to be acquired correctly, need to be performed following a protocol that guarantees that the images are in focus, taken from the correct distance, and with the correct illumination. There are also details that cannot be reliably detected through the standard dermoscopy technique presently being employed and, in several cases, a confirmation biopsy will be necessary.

If you are interested in knowing more about Teledermatology examination acquisition protocols, look here:

How are the dermoscopic images acquired?

The contact dermoscope used nowadays is the result of an international effort for the standardization of this examination performed during the first half of the 1990s, led by a group of researchers at the University of Munich in Germany. This equipment uses a single lens with a 10x magnification and internal illumination by LEDs. The examination is performed with mineral oil, which is applied on the surface of the lesion before the dermoscope is applied on the lesion and a photograph is taken.

Two different 10x contact dermoscopes: (a) analog pocket dermoscope and (b) dermoscopy adapter for Sony digital cameras employed by us at the Santa Catarina State Telemedicine and Telehealth Network — STT/SC

The choosing of a monocular 10x magnification lens as a standard allowed for the development of very small devices that soon became very popular. Analog dermoscopes can be carried in a breast pocket and digital ones can be easily developed as small USB devices or as adapters for digital cameras and smartphones.

The drawback of this standard is that the 10x magnification is not enough for the reliable detection of some pathologies, such as the basal cell carcinoma, which is the most common form of skin cancer. This form of neoplasm is characterized by vascular alterations, called arboriform vascularizations, which cannot be reliably observed with a monocular lens employing a 10x magnification: A confirmation biopsy will always be necessary in order to provide the definitive diagnosis. Reliable detection needs higher magnification and binocular optics [1][2].

Non-pigmented basal cell carcinomas acquired (a) with a 10x contact planar dermoscope and (b) with a 50x binocular stereoscopic dermoscope, showing arboriform vascularizations (J.Kreusch, Univ. Lübeck, Sur Prise e.K. collection, POB 11 11 07, 23521 Lübeck, Germany) — only one image of the stereo pair is depicted here

Until the 1990s there were other types of dermoscopes being developed, which could provide better imaging quality, but that were larger and not so easy to operate, such as the 50x binocular stereoscopic contact dermoscope. This equipment is better suited for the visual early detection of pathologies such as the basal cell carcinoma [1][2]. The practicality, popularization and standardization of the 10x contact monocular dermoscope, however, stopped these other lines of research.

50x sterescopic dermoscope produced by Kocher Feinmechanik, Germany in the late 1990s and still in clinical use (L.F.Kopke, Florianópolis [2])

Is there an ISIC 2019?

The ISIC — The International Skin Imaging Collaboration has corrected the lack of some dermatological diagnostic categories in the HAM10000 Dataset publishing a new dataset in the ISIC 2019 challenge: Skin Lesion Analysis Towards Melanoma Detection. The 2019 dataset, released on May 3, 2019, now contains nine different diagnostic categories and 25,331 images:

  1. Melanoma
  2. Melanocytic nevus
  3. Basal cell carcinoma
  4. Actinic keratosis
  5. Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis)
  6. Dermatofibroma
  7. Vascular lesion
  8. Squamous cell carcinoma
  9. None of the others
  • At the moment of the writing of this posting, test metadata for this new dataset are not yet available. They are announced to be released on August 9, 2019.
  • The ISIC has, up to now, sponsored four challenges in Image Analysis: ISIC 2016 through ISIC 2019, allways with the theme “Skin Lesion Analysis Towards Melanoma Detection”. The four challenges can be found at the ISIC Archive.

Let’s start working on the dataset of the 2018 challenge.

Initializations

Do what is in this section here below once each time you run this notebook

%reload_ext autoreload
%autoreload 2
%matplotlib inline

Testing your virtual machine on Google Colab…

Only to be sure, look at which CUDA driver and which GPU Colab has made available for you. The GPU will typically be either:

  • a K80 with 11 GB RAM or (if your really lucky)
  • a Tesla T4 with 14 GB RAM

If Google’s servers are crowded, you’ll eventually have access to only part of a GPU. If your GPU is shared with another Colab notebook, you’ll see a smaller amount of memory made available for you.

Tip: Avoid peak times of the US west coast. I live at GMT-3 and we are two hours ahead of the US east coast, so I always try to perform heavy processing in the morning hours.

!/opt/bin/nvidia-smi
!nvcc --version

When I started running the experiments described here, I was lucky: I had a whole T4 with 15079 MB RAM! My output looked like this:

Thu May  2 07:36:26 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 63C P8 17W / 70W | 0MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

Libray imports

Here we import all the necessary packages. We are going to work with the fastai V1 library which sits on top of Pytorch 1.0. The fastai library provides many useful functions that enable us to quickly and easily build neural networks and train our models.

from fastai.vision import *
from fastai.metrics import error_rate
from fastai.callbacks import SaveModelCallback
# Imports for diverse utilities
from shutil import copyfile
import matplotlib.pyplot as plt
import operator
from PIL import Image
from sys import intern # For the symbol definitions

Export and restoration functions

# Export network for deployment and create a copydef exportStageTo(learn, path):
learn.export()
# Faça backup diferenciado
copyfile(path/'export.pkl', path/'export-malaria.pkl')

#exportStage1(learn, path)
# Restoration of a deployment model, for example in order to conitnue fine-tuningdef restoreStageFrom(path):
# Restore a backup
copyfile(path/'export-malaria.pkl', path/'export.pkl')
return load_learner(path)

#learn = restoreStage1From(path)

Download the Dermoscopic Images of Pigmented Lesions

We will download the Kaggle version of this dataset because Google Colab has the Kaggle API preinstalled and it is all organized in one .zip file. In order to download from Kaggle you need:

  • an account at Kaggle
  • to install your Kaggle credentials (a .json file) on Colab

To see how to do this, first look at this tutorial and the Kaggle API instructions and generate and upload your credentials to Colab:

When you’ve created and copied your Kaggle credentials to Colab

Run the cell below. It will Create a Folder for your Credentials for the Kaggle API and install your credentials in Colab:

!mkdir .kaggle
!mv kaggle.json .kaggle
!chmod 600 /content/.kaggle/kaggle.json
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!kaggle config set -n path -v{/content}

For some arcane reason this script sometimes does not work. It seems to have to do with the form Colab names the home folder. If you experience an error message, simply execute it again. The final output should look like this:

- path is now set to: {/content}

Perform the actual Downloading and Unzipping

Create a ‘data’ folder and download the dermatoscopy images into it

!mkdir data
!kaggle datasets download kmader/skin-cancer-mnist-ham10000 -p data

This will produce the following output:

Downloading skin-cancer-mnist-ham10000.zip to data
100% 2.61G/2.62G [00:52<00:00, 42.3MB/s]
100% 2.62G/2.62G [00:52<00:00, 53.4MB/s]

Unzip the whole zipfile into /content/data and then quietly (-q) unzip the image files (you don’t want to verbosely unzip more than 10k images!). We will use the — override option (-o) in order to allow quiet overriding of files that could have been created in some interrupted prior attempt you’ve made.

# Unzip the whole zipfile into /content/data
!unzip -o data/skin-cancer-mnist-ham10000.zip -d data
# Quietly unzip the image files
!unzip -o -q data/HAM10000_images_part_1.zip -d data
!unzip -o -q data/HAM10000_images_part_2.zip -d data
# Tell me how many files I unzipped///
!echo files in /content/data: `ls data | wc -l`

If you have a files count of 10,023 files, you’ve done it right!

Archive:  data/skin-cancer-mnist-ham10000.zip
inflating: data/hmnist_28_28_RGB.csv
inflating: data/HAM10000_metadata.csv
inflating: data/HAM10000_images_part_1.zip
inflating: data/hmnist_28_28_L.csv
inflating: data/hmnist_8_8_L.csv
inflating: data/HAM10000_images_part_2.zip
inflating: data/hmnist_8_8_RGB.csv
files in /content/data: 10023

Prepare your data

Now we’ll prepare our data for processing with fast.ai.

In our previous medical image classification posting (Deep Learning and Medical Image Analysis for Malaria Detection with fastai) we have sorted the image categories into folders, one folder for each class.

Here we have all images stored in one folder and the metadata stored in a spreadsheet we will read with the fast.ai ImageDataBunch.from_csv() method from the fast.ai Data Block API.

What we do different here?

The HAM10000 dataset does not provide the images sorted into folders according to their classes. Instead, all images are in one folder and a spreadsheet with several metadata for each of the images is provided. In this tutorial we will be reading the class of each image from this .csv spreadsheet, instead of organizing the image files into folders where the name of the folder is the class to which the images belong. fast.ai also provides ready-to-use methods for interpreting spreadsheets and extracting the classification data for the images. In this posting we will be learning how to make use of these methods.

For this purpose we will be using the data block API from fast.ai. There’s a very good explanation how it works in the following posting:

Create your training and validation data bunches

In the original tutorial bove, which employs Keras, there’s a routine to create training, validation and test folders from the data. With fast.ai it is not necessary: if you only have a ‘train’ folder, you can split it while creating the DataBunch by simply passing a few parameters…

With fast.ai we also can easily work with resolutions that are different from the original ImageNet resolutions employed to pre-train the networks we will be using. In the tutorial listed above the author reduced the dataset image resolution to 224x224, in order to use the Keras MobileNet model. We will be employing a 448x448 resolution:

bs = 64        # Batch size, 64 for medium images on a T4 GPU...
size = 448 # Image size, 448x448 is double than the orignal
# ImageNet
path = Path("./data") # The path to the 'train' folder you created...
# Limit your augmentations: it's medical data! You do not want to phantasize data...
# Warping, for example, will let your images badly distorted, so don't do it!
# This dataset is big, so don't rotate the images either. Lets stick to flipping...
tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_rotate=None, max_warp=None, max_zoom=1.0)
# Create the DataBunch!
# Remember that you'll have images that are bigger than 128x128 and images that are smaller,
# so squish them all in order to occupy exactly 128x128 pixels...
data = ImageDataBunch.from_csv('data', csv_labels='HAM10000_metadata.csv', suffix='.jpg', fn_col=1, label_col=2,
ds_tfms=tfms, valid_pct = 0.2,size=size, bs=bs)
print('Transforms = ', len(tfms))
# Save the DataBunch in case the training goes south... so you won't have to regenerate it..
# Remember: this DataBunch is tied to the batch size you selected.
data.save('imageDataBunch-bs-'+str(bs)+'-size-'+str(size)+'.pkl')
# Show the statistics of the Bunch...
print(data.classes)
data

This will produce the following output:

Transforms =  2
['akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc']
ImageDataBunch;Train: LabelList (8012 items)
x: ImageList
Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448)
y: CategoryList
bkl,bkl,bkl,bkl,bkl
Path: data;
Valid: LabelList (2003 items)
x: ImageList
Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448)
y: CategoryList
bkl,nv,nv,nv,nv
Path: data;
Test: None

Look at your DataBunch to see if the augmentations are acceptable…

data.show_batch(rows=5, figsize=(15,15))

First Training Experiment: ResNet34

If you do not know what model to use, it is a good choice to start with a Residual Network with 34 layers. Powerful, but not too small and not too big…

In the tutorial listed above the author used a MobileNet implemented in Keras and the network’s original image resolution of 224x224. In fast.ai the ResNet easily adapts to work with the resolution of 448x448 of our DataBunch.

Now we will start training our model. We will use a convolutional neural network backbone and a fully connected head with a single hidden layer as a classifier. Don’t know what these things mean? Not to worry, we will dive deeper in the coming lessons. For the moment you need to know that we are building a model which will take images as input and will output the predicted probability for each of the categories (in this case, it will have 37 outputs).

We will be using twodifferent metrics to look at our training success:

  • accuracy: validation accuracy
  • error_rate: validation error rate

If you want more information, look at https://docs.fast.ai/metrics.html.

learn = cnn_learner(data, models.resnet34, metrics=[accuracy, error_rate, dice(iou=True), fbeta])
learn.model

Just pass the data variable, which contains the DataBunch instance, to the cnn_learner() function and fast.ai will automatically adapt the input layer of the new network to the higher image resolution. The model will look like this:

Sequential(
(0): Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace)
(3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(2): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(5): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
...
...
... and so on...

Training strategy

We will employ the fit1cycle method developed by Leslie N. Smith — see below for details:

If you want to to know more about the new learning API in the fast.ai library, look at this notebook prepared by Sylvain Gugger.

We will also save the network each epoch, if the performance gets better: https://docs.fast.ai/callbacks.html#SaveModelCallback

learn.fit_one_cycle(10, callbacks=[SaveModelCallback(learn, every='epoch', monitor='accuracy', name='derma-1')])
# Salve a rede (necessita regerar o databunch caso a gente continue)
learn.save('derma-stage-1')
# Faça o deploy desta rede para podermos usar offline depois para fazer testes
exportStageTo(learn, path)
85% accuracy…

Results for ResNet34

Let’s see what results we have got.

We will first see which were the categories that the model most confused with one another. We will try to see if what the model predicted was reasonable or not. In this case the mistakes look reasonable (none of the mistakes seems obviously naive). This is an indicator that our classifier is working correctly.

Furthermore, when we plot the confusion matrix, we can see that the distribution is heavily skewed: the model makes the same mistakes over and over again but it rarely confuses other categories. This suggests that it just finds it difficult to distinguish some specific categories between each other; this is normal behaviour.

Let’s generate a ClassificationInterpretation and look at some results, the confusion matrix and the loss curves.

interp = ClassificationInterpretation.from_learner(learn)losses,idxs = interp.top_losses()len(data.valid_ds)==len(losses)==len(idxs)

Look at your worst results, first without using a heatmap:

interp.plot_top_losses(9, figsize=(20,11), heatmap=False)

Now, do the same, but highlight the images with a heatmap in order to see which parts of each image induced the wrong classification:

interp.plot_top_losses(9, figsize=(20,11), heatmap=True)

Show the confusion Matrix

Here we have seven classes and it makes a lot of sense to look at the confusion matrix. Besides, it makes beautiful pictures…

interp.plot_confusion_matrix(figsize=(5,5), dpi=100)

What we can see here is that:

  • nevi are far the most common occurrences. One can think of reducing the instances of nevi in the training set in order not to skew the results;
  • there are several benign keratoses (bkl) wrongly classified. This is probably because bkl in this dataset is a generic class that includes seborrheic keratoses, solar lentigo and lichen-planus like keratoses, which are dermatoses that, even if related, look really very different;
  • there are also several melanomas (mel) that were misclassified. This was a surprise. I expected the network to perform better here.

If you are confused by confusion matrices, look here:

Show your learning curve:

Plot your losses,in order to see the learning curve:

learn.recorder.plot_losses()

This result is really good. The network oscillated a bit but learned steadilly. Now let’s try to fine-tune the network.

Fine tune the ResNet34

First unfreeze the network and try to find a good range of learning rates for this particular network.

The method learn.lr_find() helps you find an optimal learning rate. It uses the technique developed in the 2015 paper Cyclical Learning Rates for Training Neural Networks (http://arxiv.org/abs/1506.01186), where we simply keep increasing the learning rate from a very small value, until the loss starts decreasing.

If you want to know more about finding the best learning rates, look here:

Let’s do it:

# Unfreeze the network
learn.unfreeze()
# Find optimum learning rates
learn.lr_find()
# Include suggestion=True in order to obtain a suggestion on where to look...
learn.recorder.plot(suggestion=True)

Let’s fine tune the ResNet34. We’ll employ 30 epochs to be sure. The learning rate finder identified 1e-5 as a “secure” learning rate. So we will define a range of learning rates using the rule of thumb: ending at the “secure” rate 1e-5 and starting at a rate that is one order of magnitude higher: max_lr=slice(1e-4,1e-5).

# Unfreeze the network
learn.unfreeze()
learn.fit_one_cycle(30, max_lr=slice(1e-4,1e-5),
callbacks=[SaveModelCallback(learn, every='epoch', monitor='accuracy', name='derma')])
# Agora, salve como estágio 2...
learn.save('derma-stage-2')
# Deploy definitivo
exportStageTo(learn, path)
93% accuracy!

So we achieved 93% accuracy with our first run. This is very good! The accuracy achieved in the tutorial and kernel above is respectively 85% and 86%.

Now let’s look at our statistics.:

interp = ClassificationInterpretation.from_learner(learn)losses,idxs = interp.top_losses()# Test to see if there's not anything missing (must return True)
len(data.valid_ds)==len(losses)==len(idxs)

If this returns True, then plot the confusion matrix for the fine-tuned network:

interp.plot_confusion_matrix(figsize=(5,5), dpi=100)

The melanoma predictions look much better in this matrix! Let’s look at the training curves:

learn.recorder.plot_losses()

We can see that both, train and validation curve oscillated. The training curve seems to arrive at a plateau and it is separating from the validation curve, which is decreasing much slower. This indicates that we probably are moving towards an overfitting of the network. It would be indicated to stop here.

In the notebook we made available on Colab, we trained the network for another 30 epochs, only to be sure. It actually becomes worse, probably due to overfitting. So stopping here is a good choice for the ResNet34.

Go bigger: ResNet50

With the ResNet34 with achieved 92.9% accuracy. Let’s see if with a larger network we can perform better. We will create the DataBunch again, this time with a smaller batch size in order not to overload the GPU memory…

bs = 28         # Batch size, 28 for medium images on a T4 GPU and ResNet50...
size = 448 # Image size, 448x448 is double than the orignal
# ImageNet size of the pre-trained ResNet we'll be using,
# should be easy to train...
path = Path("./data") # The path to the 'train' folder you created...
# Limit your augmentations: it's medical data! You do not want to phantasize data...
# Warping, for example, will let your images badly distorted, so don't do it!
# This dataset is big, so don't rotate the images either. Lets stick to flipping...
tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_rotate=None, max_warp=None, max_zoom=1.0)
# Create the DataBunch!
# Remember that you'll have images that are bigger than 128x128 and images that are smaller,
# so squish them all in order to occupy exactly 128x128 pixels...
data = ImageDataBunch.from_csv('data', csv_labels='HAM10000_metadata.csv', suffix='.jpg', fn_col=1, label_col=2,
ds_tfms=tfms, valid_pct = 0.2,size=size, bs=bs)
print('Transforms = ', len(tfms))
# Save the DataBunch in case the training goes south... so you won't have to regenerate it..
# Remember: this DataBunch is tied to the batch size you selected.
data.save('imageDataBunch-bs-'+str(bs)+'-size-'+str(size)+'.pkl')
# Show the statistics of the Bunch...
print(data.classes)
data

The output now should look like this:

Transforms =  2
['akiec', 'bcc', 'bkl', 'df', 'mel', 'nv', 'vasc']
ImageDataBunch;Train: LabelList (8012 items)
x: ImageList
Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448)
y: CategoryList
bkl,bkl,bkl,bkl,bkl
Path: data;
Valid: LabelList (2003 items)
x: ImageList
Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448),Image (3, 448, 448)
y: CategoryList
nv,nv,nv,mel,bkl
Path: data;
Test: None

Now create a ResNet50:

learn50 = cnn_learner(data, models.resnet50, metrics=[accuracy, error_rate])
learn50.model

Observe that we included accuracy in the metrics, so we will not need to manually perform the calculation based on the error rate. The model should look like:

Sequential(
(0): Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace)
(3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(4): Sequential(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
(2): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
)
)
...
...
...
(1): Sequential(
(0): AdaptiveConcatPool2d(
(ap): AdaptiveAvgPool2d(output_size=1)
(mp): AdaptiveMaxPool2d(output_size=1)
)
(1): Flatten()
(2): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.25)
(4): Linear(in_features=4096, out_features=512, bias=True)
(5): ReLU(inplace)
(6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.5)
(8): Linear(in_features=512, out_features=7, bias=True)
)
)

Put it to learn:

learn50.fit_one_cycle(15, callbacks=[SaveModelCallback(learn50, every='epoch', monitor='accuracy', name='derma50-1')])
# Save weights
learn50.save('derma50-stage-1')
# Deploy the whole network (with the databunch)
exportStageTo(learn50, path)
87.6% accuracy….

Look ath the results:

interp = ClassificationInterpretation.from_learner(learn50)losses,idxs = interp.top_losses()interp.plot_confusion_matrix(figsize=(5,5), dpi=100)

Look at the learning curves:

learn50.recorder.plot_losses()

For now, this did not impress: the network oscillated more than the ResNet34 during this first transfer learning phase and the results, even if numerically better (85% x 87% accuracy), actually look worse on a visual analysis of the confusion matrix. Let’s do a fine tuning of the ResNet50 and see if this produces better results.

We will do this in two experiments.

ResNet50 Experiment #1: Fine tuning with blind acceptance of the learning rate suggestion

While training a Deep Neural Network selecting a good learning rate is essential for both fast convergence and a lower error. The first step in fine-tuning is to find an adequate range of learn rates:

# Unfreeze the network
learn50.unfreeze()
# Find optimum learning rates
learn50.lr_find()
# Include suggestion=True in order to obtain a suggestion on where to look...
learn50.recorder.plot(suggestion=True)

This will output:

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
Min numerical gradient: 9.12E-07
Min loss divided by 10: 1.10E-07

The lower bound, 9.12E-07, we will round up to 1.0E-06. Now, you can fine-tune using our rule-of-thumb. We will define a range of learning rates using the rule of thumb: ending at the “secure” rate 1e-06 and starting at a rate that is one order of magnitude higher:

# Unfreeze the network
learn50.unfreeze()
learn50.fit_one_cycle(35, max_lr=slice(1e-5,1e-6),
callbacks=[SaveModelCallback(learn50, every='epoch', monitor='accuracy', name='derma50')])
# Save the weights of stage 2 each "better" epoch:
learn50.save('derma50-stage-2')
# Do not overwrite the stage 1 .pkl with stage 2
# We will need it for the ResNet50 Experiment #2
# exportStageTo(learn50, path)
90% accuracy…

So, after 35 epochs and a lot of processing, we arrived at 90% accuracy. This looks unpromising… Let’s look at the confusion matrix and the learning curves:

interp = ClassificationInterpretation.from_learner(learn50)losses,idxs = interp.top_losses()interp.plot_confusion_matrix(figsize=(5,5), dpi=100)
learn50.recorder.plot_losses()

What we see here is a result that is much different from the ResNet34. With the ResNet50 the network learns the training set during the fine-tuning phase, but it is oscilllating all the time: some batches make it perform better, other batches make it turn back and perform worse. This normally means that the data are of bad quality and possess too much noise. In this case however, we’ve already seen that we can achieve 93% accuracy with a ResNet34. So, bad data is not the case here. Another possibility is that the network has too much parameters and it is not generalizing, but adapting to individual instances of the training set, thus learning individual examples and de-generalizing. This makes its performance worse for other parts of the data set and so the network acts like a pendulum, going back and forth in the error space. The validation loss, that is much higher than the training loss during the whole fine-tuning process, corroborates this interpretation of the learning curve.

But we are stubborn… Let’s perform another experiment.

ResNet50 Experiment #2: Fine tuning with manual learning rate setting

In the formar experiment we blindly accepted the suggestion from the analysis algorithm and took a very low learning rate. Maybe this was the reason for the bad learning?

If we look at the learning rate graph, we can see that a plateau is already forming at about 1.0E-4. Then the learning rate plunges into two valleys, one at 1.0E-5 and another at 1.0E-6. What if we take the more stable, flat plateu zone, from 1.0E-4 to 1.0E-5 and make this our learning rate range?

First, restore the network to the state you had when you finished the initial transfer learning:

# Will always load a path/'export.pkl' deployment file
learn50 = restoreStageFrom(path)

Fine-tune it again, now with max_lr=slice(1e-4,1e-5):

# Unfreeze the network
learn50.unfreeze()
learn50.fit_one_cycle(35, max_lr=slice(1e-4,1e-5),
callbacks=[SaveModelCallback(learn50, every='epoch', monitor='accuracy', name='derma50')])
learn50.save('derma50-stage-2')
exportStageTo(learn50, path)
92.6% accuracy….

Let’s look at the result graphics:

These results are better than before,even if the network oscillated a lot. The confusion matrix, however, shows us that the results for melanoma, which is the most important pathology here, do not look good.

Let’s go in the other direction and experiment with a network that is smaller than ResNet34, instead of larger: ResNet18.

ResNet18

Why not, then, try a much smaller network and see how it behaves? Let’s try a ResNet18 with the dermatoscopic data!

For this purpose we will change our batch size and re-generate the DataBunch. ResNet18s are much smaller and we’ll have more memory avilable to us, so it makes sense to use a larger batch size:

bs = 48         # Batch size, 64 for medium images on a T4 GPU and ResNet18...
size = 448 # Image size, 448x448 is double than the orignal
# ImageNet size of the pre-trained ResNet we'll be using,
# should be easy to train...
path = Path("./data") # The path to the 'train' folder you created...
# Limit your augmentations: it's medical data! You do not want to phantasize data...
# Warping, for example, will let your images badly distorted, so don't do it!
# This dataset is big, so don't rotate the images either. Lets stick to flipping...
tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_rotate=None, max_warp=None, max_zoom=1.0)
# Create the DataBunch!
# Remember that you'll have images that are bigger than 128x128 and images that are smaller,
# so squish them all in order to occupy exactly 128x128 pixels...
data = ImageDataBunch.from_csv('data', csv_labels='HAM10000_metadata.csv', suffix='.jpg', fn_col=1, label_col=2,
ds_tfms=tfms, valid_pct = 0.2,size=size, bs=bs)
print('Transforms = ', len(tfms))
# Save the DataBunch in case the training goes south... so you won't have to regenerate it..
# Remember: this DataBunch is tied to the batch size you selected.
data.save('imageDataBunch-bs-'+str(bs)+'-size-'+str(size)+'.pkl')
# Show the statistics of the Bunch...
print(data.classes)
data

Create the network:

learn = cnn_learner(data, models.resnet18, metrics=[accuracy, error_rate])
learn.model

The model will look like this (observe that this smaller ResNet model has only 1024 out features instead of 4096 in the last block):

Sequential(
(0): Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace)
(3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
...
...
...
(1): Sequential(
(0): AdaptiveConcatPool2d(
(ap): AdaptiveAvgPool2d(output_size=1)
(mp): AdaptiveMaxPool2d(output_size=1)
)
(1): Flatten()
(2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.25)
(4): Linear(in_features=1024, out_features=512, bias=True)
(5): ReLU(inplace)
(6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.5)
(8): Linear(in_features=512, out_features=7, bias=True)
)
)

Let’s transfer-learn it for 15 epochs:

learn.fit_one_cycle(15, callbacks=[SaveModelCallback(learn, every='epoch', monitor='accuracy', name='derma-1')])
learn.save('derma-stage-1')
exportStageTo(learn, path)
85.5% accuracy…
interp = ClassificationInterpretation.from_learner(learn)losses,idxs = interp.top_losses()interp.plot_confusion_matrix(figsize=(5,5), dpi=100)
learn.recorder.plot_losses()

Numerically, this 0.5% better than the ResNet34. Let’s look howit behaves after a fine-tuning:

# Unfreeze the network
learn.unfreeze()
# Find optimum learning rates
learn.lr_find()
# Include suggestion=True in order to obtain a suggestion on where to look...
learn.recorder.plot(suggestion=True)

Again, we have a curve that first achieves a plateau and then plunges into two holes. Only here the holes are deeper down than the plateau. Let’s consider that, in this case, the holes are significant and accept the suggestion of the learning rate finder and make max_lr=slice(1e-5,1e-6):

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
Min numerical gradient: 1.32E-06
Min loss divided by 10: 1.58E-07
# Unfreeze the network
learn.unfreeze()
learn.fit_one_cycle(35, max_lr=slice(1e-5,1e-6),
callbacks=[SaveModelCallback(learn, every='epoch', monitor='accuracy', name='derma18')])
learn.save('derma18-stage-2')exportStageTo(learn, path)

88% accuracy! That is much worse than the ResNet34. Let’s plot the learning curves:

learn.recorder.plot_losses()

What we see here is that the ResNet18 oscillates still more than the ResNet50, when we trained it with a too small learning rate. This is an indication that here also the learning rate was too small.

I’ve repeated the fine-tuning experiment #2 we did before with the ResNet50 and set the learning rate range to max_lr=slice(1e-4,1e-5) and trained it again. In this case the ResNet18 achieved a validation accuracy of 0.904643. The results graphics look like this:

This is better than before but still worse than the ResNet34 and the ResNet50. The ResNet18 does not seem to be a good choice for this problem. The code is in our notebook.

What do I do if the training was interrupted halfway through?

What do you do if your training was interrupted? This can happen because you reached your 12 hours of continuous “free” operating time on a Google Colab notebook or because your computer stopped for some reason. I live in Brazil and events of power shortages are common…

The fit_one_cycle method works with varying, adaptive learning rates, following a curve where the rate is first increased and then decreased. If you interrupt a training in epoch #10 of, say, 20 epochs and then start again for more 9 epochs, you’ll not have the same result as training uninterruptedly for 20 epochs. You have to be able to record where you stopped and then resume the training cycle from that point and with the correct hyperparameters for that part of the cycle.

A fit_one_cycle training session divided into three subsessions. Image by PPW@GitHub

The first thing you have to do is to save your network every epoch:

learn.fit_one_cycle(20, max_lr=slice(1e-5,1e-6), 
callbacks=[SaveModelCallback(learn, every='epoch',
monitor='accuracy', name='saved_net')])

This will have your network be saved every epoch, with the name you provided followed by _#epoch. So at epoch #3, the file saved_net_3.pth will be written. This file you can load after you:

  • re-created the DataBunch and
  • re-instantiated the network with it.

After reloading the .pth file, you can restart your training, only you’ll tell fit_one_cycle to consider 20 epochs, but to start training from epoch #4.

To learn how this is done, look here:

How do you do it?

The fit_one_cycle method in fast.ai has been developed to allow you to tell it from which part of the cycle to resume an interrupted training. The code for resuming a training will look like this:

# Create a new net if training was interrupted and you had to 
# restart your Colab session
learn = cnn_learner(data, models.<your_model_here>,
metrics=[accuracy, error_rate])
# If you're resuming, only indicating the epoch from which to
# resume, indicated by start_epoch=<epoch#> will load the last
# saved .pth, it is not necessary to explicitly reload the last
# epoch, you only should NOT change the name given in
# name=<callback_save_file>:
# when resuming fast.ai will try to reload
# <callback_save_file>_<previous_epoch>.pth
# Unfreeze the network
learn50.unfreeze()
# Use start_epoch=<some_epoch> to resume training...
learn.fit_one_cycle(20, max_lr=slice(1e-5,1e-6),
start_epoch=<next_epoch#>,
callbacks=[SaveModelCallback(learn,
every='epoch', monitor='accuracy',
name=<callback_save_file>)])

fast.ai will tell you “Loaded <callback_save_file>_<previous_epoch#>” and resume training.

You can look at all parameters supported by the fit_one_cycle method here:

What have we achieved?

Employing an image resolution of 448x448 and fast.ai, we obtained a validation accuracy of roughly 93% with two of the three network models we employed in our experiment, ResNet34 and ResNet50. This is much better than the 85% of the tutorial above, which employed a MobileNet, an image resolution of 224x224 and Keras. The (presently) most voted Kernel on Kaggle (employs TensorFlow.js) obtained a precision of 86%.

In the ISIC 2018 challenge final test of task #3 the highest ranking competitor, Jordan Yap from MetaOptima Technology Inc., achieved an accuracy of 95.8% and a balanced multiclass accuracy, the competitor evaluation criterion chosen by ISIC, of 88.5%. Jordan Yap employed a method that based upon:

  • additional, external data [33,644 images];
  • images with a low resolution, equivalent to the original ImageNet-resolution;
  • an ensemble of 19 classification algorithms, of which one was not a neural network (histogram analysis);
  • an XGBoost Classifier, which was trained on top of the results of this ensemble.
Aleksey Nozdryn-Plotnicki, Jordan Yap, and William Yolland, MICCAI, 2018

The paper Aleksey Nozdryn-Plotnicki, Jordan Yap and William Yolland, submitted to the ISIC Skin Image Analysis Workshop and Challenge @ MICCAI 2018 together with their results is here.

Our results are not directly comparable to the original ISIC 2018 challenge because there ISIC provided a test set of 1512 images manually extracted from the HAM1000 dataset and all competitors had to employ that test set and submit their results to ISIC. We validated our training with a set of 2003 images randomly extracted from the HAM1000 dataset and trained our networks with the remaining 8012 images.

Statistics of Jordan Yap from MetaOptima Technology Inc.

You can look at the ISIC 2018 results here:

It is, however, interesting to note that the accuracy of 92.9% we achieved with fast.ai and just one network, a ResNet34, and a total training time of 190 minutes on a NVIDIA T4 GPU (4.2 x 10 + 4.3 x 35 minutes), is only 2.9% worse than the accuracy the highest ranking competitor of the ISIC 2018 challenge obtained with a considerably more complex approach, employing a machine learning algorithm on top of 18 different neural networks.

This is most probably due to the fact that we employed twice the image resolution, allowing for much more details, but the HYPOs (hyper parameter optimizations) in fast.ai have probably also played a role.

What have we learned?

ResNet34 is a good choice to start: ResNets are really well-managed in fast.ai, with various easy-to-use HYPOs (hyper parameter optimizations). If you do not know which network to use, take the ResNet34, which will be small enough to train relatively fast, even if you do it with your GPU at home, and big enough to represent a large set of problems. How ResNet34 trains will provide hints if you should go up or go down with the size of your network.

Blindly accepting learning rate suggestion is not always the best option: the lr_find() for the ResNet50 produces a long plateau and the method suggested a very low learning rate value in a small valley at the left end of the graphic. When we trained the network with this value, it oscillated and did not produce a good result (90% accuracy only). When we employed a visual analysis of the graph and took as the lower bound for the learning rate range a value one order of magnitude higher, which lied at the beginning of the flatter part of the plateau, the ResNet50 learned much better and achieved the same 93% accuracy of the ResNet34. So, use the suggest=True mode but give it a no-nonsense treatment before you accept it, actually looking at the graphic. This is the second rule-of-thumb of the learning rate range: look at the whole graphic and find the real plateau — your ideal lower bound for the learning rate range will lie towards the beginning of this plateau.

The method learn.lr_find() helps you find an optimal learning rate. It uses the technique developed in the 2015 paper Cyclical Learning Rates for Training Neural Networks (http://arxiv.org/abs/1506.01186), where we simply keep increasing the learning rate from a very small value, until the loss starts decreasing. If you want to know more about finding the best learning rates, look here:

Bigger is not always better: in the end, the ResNet50 perfomed almost identically to the ResNet34, but took much more time to train and showed slightly worse results. It is a bad choice to begin your training space exploration with a large network model.

Image resolution plays a role: employing twice the resolution the highest ranking competitor of the ISIC 2018 Challenge used, we obtained comparable results with a single, relatively simple, ResNet34,while that competitor employed a machine learning method on top of 18 different networks, including a huge ResNet152.

Fast.ai is fast: Finally, compared to the other approaches, with fast.ai we were able to solve the same classification problem employing much less code while using high-level hyperparameter optimization strategies that allowed us to train much faster. At the same time, a set of high level functions allows us also to easily inspect the results both as tables and as graphs. This simplicity allowed us to experiment with three different network models and compare their results. This shows that fast.ai is a very promising alternative to more traditional CNN frameworks, especially if the task at hand is a “standard” deep learning task such as image classification, object detection or semantic segmentation, that can be solved by fine-tuning off-the-shelf pre-trained network models.

Acknowledgements

This work was the result of a collaborative effort of a team of engaged researchers besides me:

We wish also to thank to Jürgen Kreusch <juergen.kreusch@gmail.com> and Luis Fernando Kopke <luiskopke@uol.com.br>for the stereodermoscopic material.

References

[1]Kreusch, J. Incident light microscopy: reflections on microscopy of the living skin. Int J Dermatol. 1992 Sep;31(9):618–20.

[2]Kopke, L.F. Dermatoscopy in the early detection, control and surgical planning of basal cell carcinomas. Surg Cosmet Dermatol 2011;3(2):103–8.

--

--

Computer Sciences Professor at UFSC. I do research on image processing, artificial intelligence and telemedicine. https://www.inf.ufsc.br/~aldo.vw/