Using pretrained deep convolutional neural networks for binary classification of COVID-19 CT scans

Published in

Towards Data Science

18 min readJul 7, 2020

By: Raunak Sood¹, Stephen Humphries², PhD.

¹Senior, Acton-Boxborough Regional High School, MA; High School Intern, Quantitative Imaging Laboratory National Jewish Health

²Assistant Professor of Radiology; Director, Quantitative Imaging Laboratory National Jewish Health, Denver, CO.

GitHub

Introduction

The highly virulent COVID-19 virus, which originated in Wuhan, China, resulted in a global pandemic, affecting the lives of many. As of this writing, over 11.6 million people have been diagnosed with the virus and more than 539 thousand people have died worldwide¹. Many groups of people are researching different ways to diagnose COVID-19: RT-PCR (reverse transcription polymerase chain reaction) tests, antibody tests and CT scans. Although RT-PCR is the CDC recommended way of COVID-19 diagnosis, tests can take up to two days to complete². Additionally, antibody tests are only useful after a patient has built immunity to the virus. Using CT scans for diagnosis is promising, although studies³ have shown that it is not reliable for a single diagnostic test.

Nevertheless, recent studies have used artificial intelligence for diagnosis of COVID-19 infected lung images to augment radiologists in their analysis. In the articles written by Trehan⁴ and Markevych⁵, they used convolutional neural networks built from scratch using the Tensorflow library to classify X-ray and CT images respectively. In this article, however, I am going to use transfer learning using state of the art models such as VGG and ResNet to classify COVID-19 positive and negative lung CT scans. Moreover, I will be working with PyTorch.

Data

I used the open source dataset from the COVID-19 CT Grand Challenge⁶, which is a set of over 750 PNG images of lung CT of which about half are COVID-19 positive.

Let’s view some example images.

You may be able to tell that the image to the right is COVID-19 positive. Why? Because the cloudy white region of the left lobe is an example of ground glass opacity (GGO), which is one of the key characteristics of identifying COVID-19 in lung CT. As you can see on the image to the right, there is no trace of this ground glass appearance. Note that lung CT images containing ground glass characteristics are not always COVID-19 positive; there are other conditions such as infectious diseases, interstitial lung diseases and acute alveolar diseases that also display GGO in lung CT scans⁷. But, in this case, this should not be a concern as it is a binary classification problem. Also, not all of the images are this easy to identify; for example let’s look at the following image.

What do you think? Turns out, it is COVID-19 positive because both lobes have GGO. Isn’t this a much harder example? Let’s see if a convolutional neural network can classify these images better.

Data Preprocessing

Before we get started training a model, we have a few preprocessing steps to complete first.

Read in the images and label them
Resize the images
Split the data into training, validation and testing sets
Convert the images to PyTorch tensors
Add image augmentation

So let’s start. Note that all of the code in this article is available in my GitHub COVID-19 classification repository.

The directory structure:

C:./Images-processed-new

| — — — — -CT_COVID

| — — — — -CT_NonCOVID

We first import the libraries necessary for preprocessing. Then, for every image in each directory, we read in the image with RGB channels, resize the image to (224, 224) and append the image with its associated label to the list, training_data. We have to resize the images because the transfer learning models that we will use later only accept images of this shape. We then label the images, 1 for COVID positive and 0 for COVID negative. Note: the tqdm library is used to create a progress bar.

Now we have to split the data into training, validation and testing sets and convert them to PyTorch tensors.

Let’s break this code into parts. I create a split size which is 10% of the entire data, so the training set is 80% of the data (~600 images), and the testing and validation sets are each 10% of the total data (~75 images). Then for each set, I break up the original data into the images (X) and the labels (y) using list comprehension. Then, I convert each array into PyTorch tensors and reshape them into the format (number of images, number of channels, image width, image height). This shape format is the way PyTorch models accept tensor data. Finally, I normalize the pixel intensities for the images. PNG images have pixel intensities between 0 and 255; however, they need to be normalized between 0 and 1 before the training process.

In PyTorch, data loaders are used to create batches of training images and to apply transforms to the images. So, we have to wrap our code into a Dataset class that we can feed into a DataLoader object along with any associated transforms.

The __init__ method is essentially the same as the above code formatted to fit inside the Dataset class. The __len__ and __getitem__ function have to be overwritten to specify how our images can be accessed. I use the “data” argument to specify whether the set is training, validation or testing. You can read more about the Dataset class in the PyTorch documentation.

Finally, we have to create the data loader object along with the transforms that can be applied to the images.

I used Monai, a medical imaging library built over PyTorch, for the image transforms. The LoadPNG(), AddChannel() and ToTensor() transforms are essential to convert the raw PNG images to machine learnable tensors. For the training set, we also use ScaleIntensity() to make sure the image intensity is normalized; RandRotate(), RandFlip() and RandZoom() are for geometric transformations. RandGaussianNoise(), as the name suggests, adds noise to the image. These transformations mimic real world CT images as they can sometimes be morphed in different ways, and noise is likely to exist in the image. Overall, these image augmentation techniques will make the model more robust and generalizable to poor quality images. For the validation and test set, I only added the first three transforms because I don’t want to modify the test set.

Next, I instantiated the Dataset class and passed it to the DataLoader object. I set the batch size to 32 initially, but we will manipulate that value and see the results later in the article. We can view some of the images in the dataset to see what the images look like going into the model.

The images are slightly altered compared to the original images before preprocessing.

Great! We are done with the preprocessing step and ready to develop our models using transfer learning.

Model Development

I used transfer learning for model development. I tested the performance of three different models with varying depths: VGG-16⁹, VGG-19⁹ and ResNet-34¹⁰. These three models were all trained on the ImageNet16 dataset (a 14 million image dataset with 1000 different classes), receiving state of the art accuracies.

The first step is to decide whether you want to train on a GPU or not. I have an Nvidia GeForce GTX local GPU, so I am going to allocate the model to the GPU.

Note: setting up the GPU can be a very complicated process if you don’t already have CUDA on your local device. I highly recommend you read this article for a very simple set up process. It sets up a CUDA environment for Tensorflow but it works with PyTorch the same way.

Then, I have to “freeze” all of the weights in the model. If you want to learn more about transfer learning in PyTorch refer to this article. This means that I make the weights untrainable so they are not updated via gradient descent. So, I loop through the model parameters and set requires_grad to false.

Here is the code for this:

Finally, I create a sequence of trainable layers to replace the final classification layer in the model. The original model has 1000 classes, but this is a binary classification problem so I need to end up with a two class output.

I used the Sequential model from the torch.nn library. The good thing about this model is that it allows the input of a dictionary. So, I can label each layer with a name to keep track of the various layers. I chose to use 4 layers to allow sufficient trainable parameters to learn from the data. Of course, this is a hyperparameter that can be tuned to any extent. Additionally, the number of neurons in each layer is another changeable hyperparameter. Oftentimes, you may see powers of two being used for number of neurons, so I stuck to that. Finally, you may notice that I used dropout regularization between each layer. This prevents the model from overfitting on the training data, which would cause lower accuracies on the validation and testing data.

As I said earlier, I compared three state of the art models. The process of freezing weights and replacing the final classification layer is similar for all three models, so I am not going to show that here.

Now our model is ready for training.

Training

Unlike the Tensorflow and Keras libraries, we have to write our own training loops in PyTorch. But, this is actually an intuitive process, so let’s begin.

I am going to define a method called train that takes in the arguments for model, train loader, validation loader, optimizer, loss function, number of epochs, and potentially, a learning rate scheduler. The function will train the model and return four lists: lists for the training accuracy, training loss, validation accuracy and validation loss.

The method starts like this:

We then loop through the number of epochs (number of times the model passes through the data), and initialize variables to keep track of the loss and accuracy metrics (this will be important later in the article).

Then, we loop through the images and target classes in the training data loader, find the model’s prediction, compare the model’s prediction with the ground truth (computing the loss function) and update the weights accordingly.

I also wrote a smaller loop to calculate the number of correct predictions and the total predictions, so we can calculate the accuracy.

Next, we set the model to validation mode and compute the validation loss and accuracy. The process is similar to that of the training step except that the model should not learn in this step. This means that we don’t update the weights for this data. I created a helper function called validation to calculate the validation loss and accuracy. I do not show it here but you can see the full code on my GitHub.

We print out the loss and accuracy metrics after each epoch and append these metrics to the lists. You may be wondering why I stored all of the metrics in a list. This will come in handy when we view the loss and accuracy curves.

Now we can use the method to train our model.

For this trial, I used binary cross entropy loss and Adam optimization as these are standard for binary classification tasks. I also trained the model for 30 epochs because previous attempts showed that the loss stabilizes near this point.

This was the output after 30 epochs:

Results

I trained VGG16, VGG19 and ResNet-34 on a varying number of epochs, batch size and learning rate schedulers. These are the best models received for each pretrained network.

Discussion

Before we get too far on the analysis of the models’ performance, let’s look a bit closer at cyclic learning rate scheduling. According to Smith (the author of the original cyclic learning rate paper), the idea behind this type of learning rate scheduling is to allow the learning rate to cyclically vary between a certain range, instead of systematically increasing or decreasing it¹¹.

I used cyclic learning rate scheduling because it was easier to tune and showed better results than other schedulers that I tried.

Alright, back to the model analysis.

The VGG16 model had the highest validation and testing accuracy after 30 epochs while the VGG19 model had the highest training accuracy. The ResNet-34 model performed the worst on all the sets. The VGG16 model was the only model that did not overfit, and this is probably because the model is shallower, so it cannot fit such complex functions. So, the shallower networks generally performed better because the data set was very small. The ResNet performed poorly in general because it is deeper than the other networks, so it probably required a longer training time or a larger batch size.

One thing I noticed while training the various models is that the validation/testing accuracy was sometimes higher than the training accuracy. This is most likely due to the small amount of data and the validation split size. The total number of images is around 750, and I used a validation split size of 0.1 meaning there were only about 75 images for validation. So, if even a few more images were classified correctly in a certain epoch, the validation accuracy would increase more than that of the training accuracy. This is a limitation for this project because there isn’t enough data to truly compare the validation accuracy to the training accuracy. However, we can still make a distinction between model performances because the validation accuracy can be compared between models, as the number of images in the validation set stays the same.

Although the results show that the VGG16 model performed the best, the testing dataset was quite small (only 74 images) so it is not enough to say that the VGG19 model is better without additional analysis. So, we will look at further analysis in the evaluation section.

Evaluation on Testing Data

Remember how I said storing the accuracy and loss metrics would be important? Now we can plot those values against the number of epochs and visualize how they change as the training process progresses.

A trend you may notice in the plots is that the validation plots are a lot more noisy than the training plots. This is because the validation set is only 74 images while the training set is around 550 images. So, a few more mistakes in the validation set can cause loss and accuracy to be much worse than a few mistakes in the training set. But overall, the loss does decrease over time for each model. Also, if you look at the training curves for all three models, both the ResNet-34 and VGG16 look as if they are flattening towards the end, while the VGG16 model seems as if it will continue to improve. In future work, I plan on training the model for more epochs to see if the VGG16 would continue to improve.

We can continue to evaluate our models based on the Receiver Operating Characteristic (ROC) curve which is a plot of the false positive rate against the true positive rate.

VGG16 (left) , VGG19 (middle) and ResNet-34 (right) ROC Curves

The ROC curve indicates how well the model is able to separate the two classes. Having an area under the curve (AUC) score close to 1 indicates that there are very few false positives and false negatives. The straight line has an AUC of 0.5 indicating a control binary classifier that randomly guesses. Clearly, the VGG16 is the best distinguisher between the COVID positive and COVID negative CT scans. As we expected from the testing accuracies, the VGG19 model has the second best AUC score and the ResNet has the worst.

Another evaluation of separability is the confusion matrix. The confusion matrix for binary classifiers displays the number of true positives, true negatives, false positives and false negatives in an easy to read matrix.

VGG16 (left), VGG19 (middle) and ResNet-34 (right) Confusion matrices

Again, the VGG16 model has the best results with very few falsely labeled images. It had no false negatives when tested which is really important because if COVID-19 is falsely diagnosed as negative, then it can be a serious issue for the patient’s survival. The other two models had some false negatives and even more false positives than the VGG16 model.

To enhance our analysis of false positives and negatives, we can visualize some of these images. Essentially, we need to create a function that loops through the testing set and organizes the mislabeled data into the appropriate category. This will allow us to see what kind of mistakes the classifiers are making.

Here is the code for this method.

The function takes in the model used, testing data, loss function and a choice between displaying the false positives and negatives. We first create two lists to hold the false negatives and positives respectively. Next, as we did in the training loop, we loop through the testing data and run the model on the data. Then, we write an “if” and “else if” statement to test whether the classification is a false positive or negative and append the mislabeled image to the appropriate list. Finally, depending on whether the function is used to display false negatives and positives, we use the figure and subplot functions in Matplotlib to display the mislabeled data in a grid.

Here are some examples of mislabeled images.

As you can see in the false negative display, it is quite difficult to tell whether the images are COVID positive. Similarly, the false positives do seem to have some abnormalities; however, it may be that the particular abnormality is not due to COVID-19. To fix these misclassifications, we can possibly add more images that are similar to the false positives and negatives to make the model more generalizable to these images. Viewing misclassifications is an important tool to improve model development.

One final test we can perform on the models is the Hosmer-Lemeshow (HL) goodness of fit test. The HL test is for model calibration: it compares the observed (predicted) values with the expected (ground truth) values in multiple subgroups of the data to determine how well the model fits the data.

Here is the equation for the HL test.

This may look complicated, but it’s actually quite straightforward. The test goes through each observed and expected value, takes the square of their difference and divides by the expected value. The first summation symbol is to sum over all the various subgroups created, and the second summation is to sum over each observed and expected value. If you are familiar with statistics, you may notice that the HL test is very similar to the chi-squared test.

The output of the HL test is a chi-squared value as well as a p-value between 0 and 1. Lower chi-squared values indicate a higher correlation between the observed and expected value. In our case, a lower value means that our predictions match the ground truth more accurately. Additionally, p-values that are less than 0.05 indicate a poorly fitted model. P-values that are much greater than 0.05 indicate that there is enough evidence to say that the model is calibrated well and fits the data accurately. For a deeper understanding of the statistics behind this test, you can refer to this article.

Now that we understand the basics of how the test works, let’s see how we can implement it in code. Unfortunately, Python does not offer a convenient way to implement the HL test, so I used R instead.

First, I imported the ResourceSelection library which has a method that implements the Hosmer-Lemeshow test. I then created two vectors for the ground truth classifications and the binary output predictions. I got these values from the test method I created to get the testing accuracies (again, you can find this on my GitHub). Finally, I ran the hoslem.test function on the two sets of data. Note that the argument “g” indicates the number of subgroups that I split the data into.

Here is the output of this code.

We can write the same code for my other two models using their binary output probabilities. I organized the results of all three models in the following table.

The VGG16 classifier has the lowest chi-squared value and the closest p-value to one, meaning that it is the best fitted model to the data. Accordingly, VGG19 is the second best calibrated model and the least fit model is the ResNet-34. Although the ResNet-34 performed the worst on the HL test, it does not mean that the model is poorly calibrated. In fact, the high p-value suggests that the model fits the data well, just not as well as the other two models. This is the benefit of the HL test over other metrics. It allows us to tell whether the lack of fit in a model is statistically significant. We know that the high p-values and low chi-squared values indicate that all three models are relatively well fit to the data.

Limitations

Although we were able to get good results upon training our various models, we did have a few limitations that are important to note. Firstly, we used PNG images as opposed to the standard DICOM format used in medical imaging. Since the COVID-19 pandemic is recent, original CT scans in DICOM format are difficult to find, and getting permission to use patient data is even harder. The benefit of using DICOM images is that they are more standardized and have a higher quality than that of PNG images. The data set we used contained images that were pulled from online articles, websites and blogs, so the quality isn’t as great as original DICOM images.

Another limitation of our work was the number of images in our data set. As I mentioned before, having less data can cause our validation accuracies to be slightly skewed. This is why we saw that sometimes the validation accuracy was higher than the training accuracy.

The final limitation is that we had to downsample the images before using in our models. The state of the art models that we used require input images of (224, 224, 3), while medical images often have much higher resolution such as (512, 512) and (1024, 1024). This loss of information can cause crucial details to be left out and hinder the accuracy of our model. One way to fix this is to create our own model that takes in a higher resolution image. Then, we wouldn’t have to downsample the images.

Conclusion

In this article, we saw how to preprocess the CT scans for classification using the Dataset class and Dataloader object. Then, we fine-tuned the VGG16, VGG19 and ResNet-34 pretrained models on the CT images using transfer learning. Then, we evaluated each model further on ROC curves, confusion matrices and the Hosmer-Lemeshow goodness of fit test. Finally, we discussed some of the limitations in our project that could be improved upon in future work. For example, future work may involve using DICOM images when they are available open source.

Now that we have evaluated all three models on ROC curves, confusion matrices and the Hosmer-Lemeshow test, we can say that the VGG16 model performs the best on this set of COVID-19 CT scans.

WHO. Coronavirus disease (COVID-19). 2020. https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200611-covid-19-sitrep-143.pdf?sfvrsn=2adbe568_4
Gietema et.al. CT in relation to RT-PCR in diagnosing COVID-19 in the Netherlands: a prospective study. 2020. https://www.medrxiv.org/content/10.1101/2020.04.22.20070441v1.full.pdf
Mei et.al. Artificial intelligence–enabled rapid diagnosis of patients with COVID-19. 2020. https://www.nature.com/articles/s41591-020-0931-3
Trehan, Daksh. Detecting COVID-19 using Deep Learning. 2020. https://towardsdatascience.com/detecting-covid-19-using-deep-learning-262956b6f981
Markevych, Ihor. Can a Convolutional Neural Network diagnose COVID-19 from lungs CT scans?. 2020. https://towardsdatascience.com/can-a-convolutional-neural-network-diagnose-covid-19-from-lungs-ct-scans-4294e29b72b
COVID-19 CT Grand Challenge. 2020. https://covid-ct.grand-challenge.org/CT-diagnosis-of-COVID-19/
Knipe et.al. Ground-glass opacification. 2019. https://radiopaedia.org/articles/ground-glass-opacification-3?lang=us
Kinsley, Harrison. “Intro and preprocessing — Using Convolutional Neural Network to Identify Dogs vs Cats p. 1” YouTube, uploaded by Sentdex, 22 February 2019, https://www.youtube.com/watch?v=gT4F3HGYXf4.
Simonyan et.al. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2014. https://arxiv.org/abs/1409.1556
He et.al. Deep Residual Learning for Image Recognition. 2015. https://arxiv.org/abs/1512.03385
Smith. Cyclical Learning Rates for Training Neural Networks. 2017. https://arxiv.org/pdf/1506.01186.pdf
Rosebrock. Cyclical Learning Rates with Keras and Deep Learning. 2019. https://www.pyimagesearch.com/2019/07/29/cyclical-learning-rates-with-keras-and-deep-learning/