Detecting Pulmonary Abnormalities in Chest X-Rays

Convolutional Neural Networks (CNNs) for Image Recognition

Harry Roper
Towards Data Science

--

Photo by CDC on Unsplash

Applied data science and artificial intelligence have and continue to contribute to a host of different fields and industries, one example of which is healthcare and medical research.

Nowadays, technology capable of performing X-ray scans is fairly readily available at relatively low cost. Qualified radiological expertise required to make diagnoses from said scans is, however, less abundant, particularly so in developing nations.

The implementation of AI solutions in this area could potentially facilitate a crucial step forward in development.

The motivation behind this project is to explore the possibility of building Convolutional Neural Network (CNN) models to detect pulmonary abnormalities in images of chest X-ray scans in the hopes of making the diagnosis process quicker and cheaper with a reduced reliance on the intervention of human experts.

The stages of completing the project will be as follows:

  1. Exploring and pre-processing the training data
  2. Building, evaluating, and tuning the CNN model
  3. Discussing the successes, limitations, and potential future improvements of the approach

Part I: Pre-Processing and Exploratory Analysis of the Training Data

The training data set for this project will comprise 800 images of chest X-rays: 662 of which have been supplied by the Shenzhen №3 People’s Hospital in China, with the remaining 138 coming from the Department of Health and Human Services of Montgomery County, USA.

Each image in the data set has been reviewed by a radiologist and had a label provided to indicate whether the patient exhibited a pulmonary abnormality resulting from the manifestation of tuberculosis.

The data set was originally published by the U.S. National Library of Medicine with the intention of providing sufficient public training data for research in the field of computer-aided diagnosis of pulmonary diseases. The data used in this project was obtained via Kaggle courtesy of user K Scott Mader. All relevant citations can be found in the References section at the end of this post.

Figure 1: Two examples of the training images, one which displays an abnormality (left), and one which does not (right)

Encoding the Images

Before image data can be analysed or passed into a machine learning model, it must first be converted into some kind of numerical format. To do this, we can encode the images using the OpenCV Python package.

The package’s imread method will essentially create a numPy array for each image with each item in the array representing the encoded greyscale of an individual pixel, with three separate layers for blue, green and red.

The dimensions of each array will be equal to the image’s height (in pixels), width (in pixels), and number of colour layers (three).

Although X-ray images don’t contain colours beyond the scale from white to black, the package we’ll later use to build the CNN model (Keras) requires each input variable to be a 3D array, so we’re not able to read the images in 2D greyscale.

Retrieving the Target Labels

The description in the source of the training data explains that the ground truth label for each image (as determined by a medical expert) is stored as a suffix in its filename: with a 1 denoting that the scan displays an abnormality, and a 0 denoting that it doesn’t.

To retrieve the target labels, we can simply append the final character before the filetype (.png) of each image’s filename to a list that will act as the target variable.

We can create the array of encoded feature variables and list of target labels for each image in each of the two directories with a nested For loop as defined below:

def encode_images():

X = []
y = []
directories = ['xray_images/ChinaSet_AllFiles/ChinaSet_AllFiles/CXR_png/',
'xray_images/Montgomery/MontgomerySet/CXR_png/']
for directory in directories:
for filename in os.listdir(directory):
if filename.endswith('.png'):
X.append(cv2.imread(directory + filename))
y.append(int(filename[-5]))

return np.array(X), np.array(y)

Assessing and Resizing the Images

A CNN model requires each input entry to be equal in shape. In the context of our data, this would correspond to each image being of the same size and thus each feature array being of equal dimensions.

Let’s check whether this is the case by plotting the distribution of heights and widths of the images in the training data:

Figure 2: Distributions of image dimensions

We can see that variation exists across the images’ sizes. To fix this, we can rewrite our encode function to implement the resize method from the Pillow package at the beginning of the For loop.

But what size should we pick for our input images?

In theory, keeping as much resolution as possible should allow for a more accurate model. We can see from the distributions above that the images are of fairly high resolution, with almost all of them having at least 1,000 pixels in height and width.

However, we also need to make considerations for computational speed. Keeping over 1,000 squared pixels will likely take the model a long while to fit on the training data, particularly since we’ll be running the code locally. As usual, we have something of a trade-off between accuracy and speed.

As a point of compromise, and keeping in mind that it’s generally good practice to choose image dimensions that are exponents of two, let’s resize the images to squares of 256 pixels:

img = Image.open(directory + filename)
img = img.resize((256, 256))
img.save(directory + filename)

Normalising the Scale of the Images

The greyscale code of a pixel can range from 0 (black) to 255 (white). At present, our encoded feature array contains values up to 255. Machine learning models, particularly those that involve gradient descent, as neural networks do, typically perform better on normalised data.

To account for this before proceeding to the modelling stage, let’s divide our feature array by the maximum (255) to normalise each entry such that it ranges from 0 to 1:

print((np.min(X), np.max(X)))
X = X.astype('float32') / 255
print((np.min(X), np.max(X)))
>>> (0, 255)
>>> (0.0, 1.0)

Analysing the Target Variable

It’s also a good idea to gain some visibility on the target variable before proceeding to building the model.

Let’s take a look at the split between positive and negative classifications in the training data:

Figure 3: Target variable split

We can see from the above that our training data is well balanced across classes, and that just over 49% of the images exhibit an abnormality. This should act as a benchmark when assessing the accuracy of the model later on.

Now that we’ve preprocessed the training data into a suitable format for a CNN model and gained a better understanding of it through some exploratory analysis, we can move onto the CNN modelling stage.

Part II: Building the CNN Model

Splitting the Training Data

To facilitate evaluation and tuning of the model once it’s been built and trained, we need to split the data into training and testing sets.

When working with any form of neural network, it’s advisable to split the data into three separate sets (training, validation, and testing) instead of the conventional two.

This is because a neural network model uses the training set as input to feed through the model, calculate the loss, and adjust the weights and bias during each epoch, and then uses a separate validation set to determine whether the new parameters are an improvement on those from the previous epoch.

Since the validation data has been “seen” by the model during training, using this same data to evaluate the final model would technically lead to data leakage, hence the need to create a third testing set at the initial split.

We can perform this split by using scikit-learn’s train_test_split method to first split the data into training and testing sets, and again to split the testing set into separate validation and testing sets:

train_size = 0.6
val_size = 0.2
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(val_size + test_size), random_state=42)X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=(test_size / (val_size + test_size)), random_state=42)print('Training: {}; Validation: {}; Testing: {}'.format((len(X_train), len(y_train)), (len(X_val), len(y_val)), (len(X_test), len(y_test))))
>>> Training: (480, 480); Validation: (160, 160); Testing: (160, 160)

Encoding the Target Variable

Regardless of the number of possible classes, any neural network model in Keras requires the target variable to be categorically (one-hot) encoded. We’ll need to carry out a categorical transformation on each of our three target sets before we can begin building the model:

y_train = keras.utils.to_categorical(y_train, len(set(y)))
y_val = keras.utils.to_categorical(y_val, len(set(y)))
y_test = keras.utils.to_categorical(y_test, len(set(y)))
print(y_train.shape, y_val.shape, y_test.shape)
>>> (480, 2) (160, 2) (160, 2)

Building a Baseline Model

The first step of building a CNN model in Keras is to define the model architecture. As with other neural networks, the architecture must consist of:

  1. An initial input layer, in which we specify the shape of our input features
  2. Zero or more “hidden” layers, which will attempt to uncover patterns in the data
  3. A final output layer, which will classify each instance based its input

The code for the architecture of the baseline models can be written as follows:

base_model = Sequential()
base_model.add(Conv2D(filters=16, kernel_size=2, padding='same', activation='relu', input_shape=X[0].shape))
base_model.add(MaxPooling2D(pool_size=2))
base_model.add(Flatten())
base_model.add(Dense(len(set(y)), activation='softmax'))

A few points to note about the baseline model:

  1. We need to specify the number of filters we want to use as well as the size of the square kernels within the input layer
  2. We also need to specify the padding in case a filter spills over the edge of an image, for which we’ll use ‘same’
  3. We’ll start with a ReLu activation function, which leaves positive values intact and sets negative values to 0. We can test out different functions when we tune the model
  4. Since we resized all of our feature inputs such that they don’t vary, we can use the first instance to specify the input shape
  5. A max pooling layer has been added to reduce the dimensionality of the data by taking the maximum value from each 2x2 square
  6. Before we reach the output layer, we need to flatten the data into a 2D array so that it can be fed into a fully connected layer
  7. Within the output layer, we need to specify how many classes the model should make predictions for (equal to the number of unique values in our target array, which is 2 in this case) and use a Softmax activation function to normalise the output as a probability distribution

We also need to compile the model, which involves defining the metric(s) we’ll use to measure its performance. Since we discovered that the target variable was evenly balanced across classes when exploring the data, the pure accuracy should be a suitable metric to use:

base_model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

You can see that the compile method is also used to select an appropriate loss function and optimisation metric for the model to use while training.

Once we’ve defined and compiled the model’s architecture, we can fit it to the training data:

base_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val), verbose=1, shuffle=True)

The parameters inside the fit method essentially tell the model to fit itself on the training set and use the validation set as unseen data on which to assess performance.

The batch size is used to split the data into subsets to pass through the network one at a time. This has the advantage of greatly reducing the computational power required to complete each training step.

The number of epochs tells the model how many times it should push the data through the network and subsequently backpropagate the weights and bias. We’ll discuss the optimal number of epochs to use shortly, but to start with we’ll use 10.

Evaluating the Baseline Model

Once we’ve fit the model over the specified number of epochs, we can evaluate its true performance on the previously unseen testing set:

round(base_model.evaluate(X_test, y_test, verbose=0)[1], 4)
>>> 0.7812

We can see that baseline model can accurately classify 78% of the testing samples. Considering that roughly half of the data set contains positively labeled instances, we can deduce that the simple baseline model already performs better than if we were to randomly guess each image’s class.

This provides a good starting point, but let’s dig in to see if we can take steps to boost the model’s performance.

Selecting the Number of Epochs

To determine the optimal number of epochs to use when fitting the model, we can collect the accuracy scores from a number of test runs of the base model using a varying number of epochs in the fit method:

epochs = [5, 10, 20, 50, 100, 200]
scores = []
for e in epochs:
test_model = Sequential()
test_model.add(Conv2D(filters=16, kernel_size=2, padding='same', activation='relu', input_shape=X[0].shape))
test_model.add(MaxPooling2D(pool_size=2))
test_model.add(Flatten())
test_model.add(Dense(len(set(y)), activation='softmax'))
test_model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
test_model.fit(X_train, y_train, epochs=e, batch_size=32, validation_data=(X_val, y_val), verbose=False,
shuffle=True)
scores.append(test_model.evaluate(X_test, y_test, verbose=False)[1])

and subsequently plot the results:

Figure 4: Accuracy score per number of epochs

We can see from the above that using 20 epochs instead of 5 or 10 significantly improves the performance of the model. Subsequent increases in the number of epochs, however, yield less drastic, if any, improvements.

Although running a neural network model for a greater number of epochs can allow it to become more finely tuned, increasing above a certain level can eventually cause the model to become overfitted on the training data, hence damaging performance on the testing set.

Considering both performance and computational efficiency, let’s use 20 epochs as we tune the baseline model.

Implementing Image Augmentation

A commonly used method for making CNN models more robust, and in turn boosting performance, is image augmentation. This essentially refers to generating new images within the training data that are translations of the existing images as a means of preparing the model for messy, real-world data in which the object to be recognised can appear anywhere within the image space at a number of angles and sizes.

Image augmentation is relatively simple to perform in Keras. We first need to create and fit image generator objects for each of the training and validation sets:

datagen_train = ImageDataGenerator(width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)
datagen_val = ImageDataGenerator(width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True)
datagen_train.fit(X_train)
datagen_val.fit(X_val)

which can then be passed into the fit method of the CNN. To test out whether this technique improves our model’s performance, let’s recreate the baseline model under identical conditions, but this time pass in the data generator object when we fit it to the training data:

batch_size = 32aug_base_model.fit(datagen_train.flow(X_train, y_train, batch_size=batch_size), 
steps_per_epoch=X_train.shape[0] / batch_size, epochs=20, verbose=1, callbacks=[checkpointer],
validation_data=datagen_val.flow(X_val, y_val, batch_size=batch_size),
validation_steps=X_val.shape[0] / batch_size)

Note that we need to specify within the function the number of steps to be made during each epoch. This can be defined as the number of input instances over the given batch size.

We can use the same method as used previously to evaluate the performance of the augmented baseline model:

round(aug_base_model.evaluate(X_test, y_test, verbose=0)[1]
>>> 0.6875

The above indicates that using image augmentation actually hampered our model’s performance instead of improving it.

To understand why this might be, let’s try to apply some domain knowledge. While image augmentation may increase model robustness in the context of typical, unstructured images, such as recognising a car in a picture of a road, the same rules may not apply in the context of our data.

X-ray scans are performed under controlled conditions, so we can expect an element of structural consistency across images in both the training and testing sets. Including translations of the real images in training might therefore only serve to confuse the model by creating unrealistic circumstances.

With this in mind, let’s make the decision to move onto the tuning process without the use of image augmentation.

Tuning the Model

CNN models offer countless opportunities for tuning and adjustment with a view to improving performance. Since it’s a field in which it’s said that the application is ahead of the theory, the most effective way to discover the optimal conditions for a model is to simply dig in and play around.

Tuning CNNs is an iterative, trial-and-error style process. Since the code for this project can be found in the repository linked at the end, I won’t include all of the tuning steps in the post, but below is an overview of the various conditions that were tweaked and tested during the tuning process:

  1. Implementing additional convolutional layers
  2. Adding dense layers before the final output layer
  3. Implementing dropout to randomly “turn off” some of the nodes in each layer
  4. Experimenting with different activation functions, such as the Sigmoid function
  5. Using larger strides within the convolutional layers

The Final Product

After multiple rounds of tuning, the model that produced the highest accuracy score (84%) consisted of:

  • Three convolutional layers of increasing numbers of filters
  • A ReLu activation function in each layer
  • A max pooling layer of size two after each layer
  • A dropout layer with probability of 0.3 after the third convolutional layer
Figure 5: Architecture of final model

Part III: Discussing the Approach

Successes

Using a data set of X-ray images pre-labelled by expert radiologists, we were able to construct a simple CNN model to correctly identify pulmonary abnormalities 78% of the time.

We were then able to tune the model by implementing features such as additional convolutional layers and dropout to boost performance, achieving a final accuracy score of 84%.

Limitations

Although the final model may provide useful foundations for further research, an accuracy score of 84% would likely be deemed insufficient for application in a real-world medical context. A higher level of performance would need to be achieved before a model of this kind could be of genuine use.

The most obvious drawback in this particular approach is the size of the training data set. In the context of machine learning, a sample size of 800 would be considered very small, and we could expect to improve the performance of the model through the use of a larger training set.

However, this would also come with its own issues. CNNs are computationally complex algorithms, so increasing the size of the training data beyond a certain point would eventually become impractical to train on a local machine.

A further limitation of this approach is that the data set contains ground truth diagnoses for only one disease: tuberculosis. If a model of this kind were to be used in the diagnoses of real-world cases, it would need to be able to detect abnormalities resulting from other pulmonary conditions.

Solving this would again require the use of an extended training data set that included labels for multiple possible diseases, turning the approach into a multi-label classification problem as opposed to binary.

Developing the Model in Future Versions

As mentioned above, a first step towards building an improved model in a future version would be to collect a larger and more varied set of training data.

To combat the impracticalities of training a complex CNN on a local CPU, it would also be advisable to employ the services of a cloud-based, GPU-enabled server, such as AWS.

Another avenue to explore in the attempts to improve the model would be the use of transfer learning. This would involve taking a pre-trained neural network and replacing the final output layer with a fully connected layer bespoke to this problem.

This is particularly useful in the context of small training data sets, since it can derive the benefits of models trained on much larger data sets and apply the same patterns to the data in question.

Closing Remarks

The intention behind this project was to explore the possibility of constructing a CNN model to detect pulmonary abnormalities in images of chest X-ray scans as a means of assisting diagnoses through the application of AI.

This was motivated by the scarcity of expert radiologists, particularly in developing nations, despite the availability of X-ray machines.

The journey from a data set of raw images to a final working model involved:

  1. Resizing, encoding, and normalising the image data
  2. Building and evaluating the architecture of a baseline CNN model to act as a starting point
  3. Experimenting with and tweaking a number of the model’s conditions, such as the number of epochs, the number of convolutional layers, and the use of dropout, to improve performance

Readers interested in exploring the code, including the various tuning stages, in more detail can find it in this repository of my Github. Feedback, questions, and suggestions on improving the model are always welcome.

References

K Scott Mader Pulmonary Chest X-Ray Abnormalities https://www.kaggle.com/kmader/pulmonary-chest-xray-abnormalities

Jaeger S, Candemir S, Antani S, Wáng YX, Lu PX, Thoma G. Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. Quant Imaging Med Surg. 2014;4(6):475–477. doi:10.3978/j.issn.2223–4292.2014.11.20

Jaeger S, Karargyris A, Candemir S, Folio L, Siegelman J, Callaghan F, Xue Z, Palaniappan K, Singh RK, Antani S, Thoma G, Wang YX, Lu PX, McDonald CJ. Automatic tuberculosis screening using chest radiographs. IEEE Trans Med Imaging. 2014 Feb;33(2):233–45. doi: 10.1109/TMI.2013.2284099. PMID: 24108713

Candemir S, Jaeger S, Palaniappan K, Musco JP, Singh RK, Xue Z, Karargyris A, Antani S, Thoma G, McDonald CJ. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Trans Med Imaging. 2014 Feb;33(2):577–90. doi: 10.1109/TMI.2013.2290491. PMID: 24239990

--

--