1st place solution for Kaggle’s skin cancer (Melanoma) Competition

5 months ago I participated in my first official kaggle competition, SSIM-ISIC Melanoma. Here is a brief overview of what the competition was about (from Kaggle):
Skin cancer is the most prevalent type of cancer. Melanoma, specifically, is responsible for 75% of skin cancer deaths, despite being the least common skin cancer. The American Cancer Society estimates over 100,000 new melanoma cases will be diagnosed in 2020. It’s also expected that almost 7,000 people will die from the disease. As with other cancers, early and accurate detection – potentially aided by data science – can make treatment more effective.
Personally, I feel very motivated when working on something that I believe in. This feeling resonated with me throughout this competition as skin Cancer ruins a lot of people’s lives immensely. As someone who is passionate about machine learning and especially in healthcare this was super interesting for me. Since this was my first competition, most of what I was doing was learning rather than actually coding and therefore I ended up in the top 72%.
This story will be mainly about my experience and the first place solution (which there is a lot to learn from).
One of the quite difficult challenges of this competition is that the dataset was extremely unbalanced. The domain of the problem was binary classification for images. The distribution of the classes was 98% for class 1 and only 2% for class 2 ! For the sake of brevity and not duplicating content, I found this article quite helpful to tackle this issue :
The rest of this story will be about tackling other challenges in this competition.
1. Modelling:
In terms on the actual model, these 3 nets were extremely popular among all solutions and thus it made a lot of sense that the top solution would be using an ensemble of those 3. I won’t dive into the details of how each of them works since there are tons of resources for that, but I will give an overview.
One of the main challenges in modelling CNNs is scaling them up and determining their width and depth. These networks tackle those challenges in different ways. There are multiple issues that you will have to deal with if you just keep on increasing the number of layers (bruteforce solution!) like the vanishing gradient problem.
- ResNet & ResNext:
Residual networks operate on the idea of "residual blocks" which have identity skip connections. If you want to further understand this, check out this story:
https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035

ResNext is just a variation of ResNet that adds a "split-transform-merge" step where the outputs of different paths are merged by adding them together.
- EfficientNet:
EfficientNet is to me, the most impressive one. Probably because it has only been around for about 2 years. The main idea of EfficientNet is an efficient technique to scale up the CNN so that you get higher accuracy with a much lower number of parameters.

To explain how this works is a bit out of the scope of this story, but you can probably get a sense of it if you are familiar with the different types of scaling:

- Fully Connected Networks
One of the lessons that I have learned from this competition is to use all of the data given, I was mostly focused on using the images and did not pay much attention to the metadata that was provided for each patient. The first place solution was using a concatenation of the CNNs outputs and the output of a fully connected network. This fully connected network was using an activation function called Swish, which is an upgrade of the classic ReLu.
For this fully connected network, they were also using Batch Normalisation and Dropout, which are also typically used in a lot of networks.
This is the final model pipeline:

The CNN model block represents an ensemble of the 3 CNN models discussed above. Ensembling was also a new trick that I have learned that proved to be very effective. Not only does it give you better accuracy / results, but because the final competition rankings are evaluated on a small subset of the data, ensembling gives you a much better chances of surviving a "shake down". A shake down is where your models overfit the training data and thus the final result is much worse so you end up with a worse rank.
I also found it quite helpful to think about your Machine Learning solution as a pipeline and visualise it in this way. As a web developer, I am used to data architecture diagrams, however I never applied the same methodology of architecture in machine learning (which is a huge mistake).
2. Data processing tricks and feature engineering:
One of the main reasons that they managed to achieve the first place was that they were using a much bigger ensemble of networks and a bigger combination of data.
To have stable validation, we used 2018+2019+2020’s data for both train and validation. We track two cv scores,
cv_all
andcv_2020
. The former is much more stable than the latter.
Also a common trick to further increase the size of the dataset it to use data augmentation. They were using a very good mixture of simple data augmentation techniques and complex ones. As for the simple ones they were using vertical & horizontal flips, random brightness & contrast and resizing. For the complex ones, they were using Gaussian Blur / Gaussian Noise, Elastic transform and Grid distortion. There are others that you can pick up from their code:
transforms_train = A.Compose([
A.Transpose(p=0.5),
A.VerticalFlip(p=0.5),
A.HorizontalFlip(p=0.5),
A.RandomBrightness(limit=0.2, p=0.75),
A.RandomContrast(limit=0.2, p=0.75),
A.OneOf([
A.MotionBlur(blur_limit=5),
A.MedianBlur(blur_limit=5),
A.GaussianBlur(blur_limit=5),
A.GaussNoise(var_limit=(5.0, 30.0)),
], p=0.7),
A.OneOf([
A.OpticalDistortion(distort_limit=1.0),
A.GridDistortion(num_steps=5, distort_limit=1.),
A.ElasticTransform(alpha=3),
], p=0.7),
A.CLAHE(clip_limit=4.0, p=0.7),
A.HueSaturationValue(hue_shift_limit=10, sat_shift_limit=20, val_shift_limit=10, p=0.5),
A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.1, rotate_limit=15, border_mode=0, p=0.85),
A.Resize(image_size, image_size),
A.Cutout(max_h_size=int(image_size * 0.375), max_w_size=int(image_size * 0.375), num_holes=1, p=0.7),
A.Normalize()
])
transforms_val = A.Compose([
A.Resize(image_size, image_size),
A.Normalize()
])
source: Github
And finally, they were ranking the final prediction of the models to ensure that they were evenly distributed. One important point to note here is that data augmentation is becoming a very heavily used technique in almost all modern machine learning projects, I have seen this technique being used in a great amount of competitions and solutions.
- Conclusion:
In terms of my model I was simply using a ResNext model with K-fold cross validation. I have learned the concept of ensembling, concatenating the results with metadata, the power of data augmentation and many other bits. And I think this is one of the main benefits of Kaggle competitions, they expose you to the awesome community and their solutions. Many thanks as well to the "Notebooks" section on the Kaggle competition where every one shares their solutions.
A very good tip for beginner Kagglers is to always take a look at the top solutions, this is where you will experience a huge amount of learning. Compare it to your solution and evaluate the differences. My main takeaway was to not place all of my efforts on modelling, but rather focus on data engineering and preprocessing.
Finally, don’t use only one type of a CNN architecture, use several ones and ensemble, most of the time this will give you better results. I am not going to dive into the advantages of ensembling as this has been covered thoroughly, if you are interested you can check out this article:
If you want to receive regular paper reviews about the latest papers in AI & Machine learning, add your email here & Subscribe!