In the past few years, we have seen a rapid increase in using Deep Learning for medical diagnosis in various forms, specially in the analysis of medical images. Here we will set up a pipeline to classify chest x-ray images of patients with and without pneumonia. The complete dataset is available in Kaggle under Creative Commons License. Before we set up the pipeline, let’s see what you can expect to learn from this post –
- Binary Classification using Deep Neural Network (DNN).
- Using TensorFlow Dataset to Create a Faster Data Analysis Pipeline.
- Better techniques (ex: standardization) for data pre-processing.
- Augmentation, Rescaling etc as lambda layers within a TensorFlow Model.
- Class Imbalance and Building Custom Weighted Cross-Entropy Loss.
Without any delay let’s begin! [All the codes used here are available in Kaggle Notebook].
Getting Accustomed with the Data Structure:
Since we will directly access the data from the Kaggle input directory, let’s just see the distribution of labels ‘Normal’ and ‘Pneumonia’ in the Train, Validation and Test Folders. Below is the code-block I used for checking the number of files in each folder –
First, we see that training images are class imbalanced with way more images labelled ‘Pneumonia’ than ‘Normal’. Also, the validation folder contains very few examples (8 Normal and 8 Pneumonia images to be exact).
tot_normal_train = len(train_im_n) + len(valid_im_n)
tot_pneumonia_train = len(train_im_p) + len(valid_im_p)
print ('total normal xray images: ', tot_normal_train)
print ('total pneumonia xray images: ', tot_pneumonia_train)
>>> total normal xray images: 1349
total pneumonia xray images: 3883
We can also visualize some of the examples Normal and Pneumonia images as below –
Similarly, we can also view ‘Pneumonia’ images –
After reading some online resources like this one, I noticed that the usual way to find Pneumonia is to search for opacity in chest x-rays. From the pics above, the images, in general, look opaque compared to normal x-rays. But it is also important to remember that, chest x-rays may not tell the whole story all the time and sometimes the visual result can be misleading.
Pre-Processing; Standardization:
We will adjust our image data such that the new mean of the data will be zero, and the standard deviation of the data will be 1. Later on, we will use the TensorFlow dataset and we will define a function where each pixel value in the image will be replaced with a new value which is calculated by subtracting the mean and dividing by the standard deviation ( x−μ)/σ. Let’s see how standardization helps us to distribute the pixel values in some random examples –
To include this type of standardization, we create a function that will be used as a lambda layer in model building. So the GPU power will also be utilized for this process –
#### define a function that will be added as lambda layer later
def standardize_layer(tensor):
tensor_mean = tf.math.reduce_mean(tensor)
tensor_std = tf.math.reduce_std(tensor)
new_tensor = (tensor-tensor_mean)/tensor_std
return new_tensor
Build Input Pipeline with TensorFlow Dataset:
In a post before, I described in detail how building an input pipeline including augmentation using TensorFlow Dataset API can accelerate DNN training. I will follow similar steps. Since the image data are available in train, test, validation folders we can get started with [image_dataset_from_directory](https://keras.io/api/preprocessing/image/)
function –
As we checked before there are only 16 files in the validation directory, using only 16 images for validation isn’t a good idea. So we need to add the training and validation datasets and then split them with some reasonable percentages. First, let’s check the number of elements in ‘train’ and ‘valid’ datasets.
num_elements = tf.data.experimental.cardinality(train_dir).numpy()
print (num_elements)
num_elements_val = tf.data.experimental.cardinality(val_dir).numpy()
print (num_elements_val)
>>> 82
1
We see that there are 82 training batches and 1 validation batch. To increase the validation batch, first, let’s concatenate ‘train’ and ‘validation’ datasets. Then we assign 20% of the total dataset to validation and use dataset.take
and dataset.skip
to create the new datasets using the code block below –
new_train_ds = train_dir.concatenate(val_dir)
print (new_train_ds, train_dir)
train_size = int(0.8 * 83) # 83 is the elements in dataset (train + valid)
val_size = int(0.2 * 83)
train_ds = new_train_ds.take(train_size)
val_ds = new_train_ds.skip(train_size).take(val_size)
#### check the dataset size back again
num_elements_train = tf.data.experimental.cardinality(train_ds).numpy()
print (num_elements_train)
num_elements_val_ds = tf.data.experimental.cardinality(val_ds).numpy()
print (num_elements_val_ds)
>>> <ConcatenateDataset shapes: ((None, 300, 300, 1), (None, 1)), types: (tf.float32, tf.float32)> <BatchDataset shapes: ((None, 300, 300, 1), (None, 1)), types: (tf.float32, tf.float32)>
66
16
I have already described the ‘Prefetching’ technique and how much it is faster than ImageDataGenerator. Let’s add this –
autotune = tf.data.AUTOTUNE ### most important function for speed up training
train_data_batches = train_ds.cache().prefetch(buffer_size=autotune)
valid_data_batches = val_ds.cache().prefetch(buffer_size=autotune)
test_data_batches = test_dir.cache().prefetch(buffer_size=autotune)
I will also add a rescaling layer and some augmentations also as layers and, this all will be included in the model as lambda layers. Let’s define them as below –
from tensorflow.keras import layers
rescale_layer = tf.keras.Sequential([layers.experimental.preprocessing.Rescaling(1./255)])
data_augmentation = tf.keras.Sequential([
layers.experimental.preprocessing.RandomFlip(),
layers.experimental.preprocessing.RandomRotation(10),
layers.experimental.preprocessing.RandomZoom(0.1) ])
Weighted Binary Cross-Entropy Loss:
The idea behind using a weighted BCE loss is that since we have many more x-ray images with ‘Pneumonia’ than the ‘Normal’, the model weights them heavy for misclassification. So we shift this bias and try to force the model to weigh normal and pneumonia images as same. We calculate frequency terms based on the number of images for each class divided by the total number of images. These weights are then used to construct the custom weighted BCE loss function. The code block below is an example of such that was used for the given problem –
Building a DNN Model Including Augmentations:
After defining the appropriate custom loss function, we are left with building a model including the rescaling and augmentations as lambda layers. For the given work I used InceptionResNetV2 pre-trained model and you can check the Keras module here. In Kaggle competitions we are not allowed to use the internet, so I needed to download the pre-trained weights, so that explains the weights
argument inside the InceptionResNetV2 function. Rescaling, standardization & augmentation are all added as lambda layers before feeding the image batches to the model. Let’s see the code block below –
After compiling including some ‘Callbacks’ we are ready to train the model. To evaluate the performance of the model on the test data we can plot the confusion matrix and ROC curve.
y_pred = model.predict(test_data_batches)
true_categories = tf.concat([y for x, y in test_data_batches], axis=0)
Let’s set a threshold of 0.75 for assigning label ‘1’ and anything below to label ‘0’.
y_pred_th = (y_pred > 0.75).astype(np.float32)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_pred_th, true_categories)
class_names = train_dir.class_names
plt.figure(figsize=(8,8))
plt.title('CM for threshold 0.75')
sns_hmp = sns.heatmap(cm, annot=True, xticklabels = [class_names[i] for i in range(len(class_names))],
yticklabels = [class_names[i] for i in range(len(class_names))], fmt="d")
fig = sns_hmp.get_figure()
The code blocks above resulted in the confusion matrix below –
Similarly, we can also plot the ROC curve and the results are shown –
Final Notes:
All the codes used above are available on my GitHub and you can also check the Kaggle Notebook. The dataset used is available in Kaggle under the License Creative Commons so we are free to use and adapt. Finally, to conclude, we went through a data analysis pipeline including class imbalance dataset and learnt to use TensorFlow Dataset effectively.
Stay strong and Cheers!