Some Key Challenges in Building an AI Model for Medical Diagnosis

Beginner’s Guide to Working with AI in Medicine

Published in

Towards Data Science

5 min readSep 12, 2020

Currently, Artificial Intelligence is gaining a lot of popularity due to the easy access to a large amount of data as well as the innovation of faster processors. Among all the applied fields of AI, the medical sector is one of the most important ones which can have a lot of beneficial effects on humanity. With the increased development of AI, its possibilities in the medical sector will be endless.

One of the important uses of AI in the medical sector is in medical diagnosis problems. A lot of research is going on in this sector and some models show promising results by performing as accurately as a group of experts. At present, a deep learning model will not be able to replace a doctor, but it will be able to give indications to the users whether a particular problem should be checked. Unlike other typical computer vision problems, the training process of a model regarding medical datasets faces some challenges. Some of the key challenges are:

Class Imbalance

In normal deep learning problems, the dataset usually contains nearly equal amounts of data of each class. But, in medical datasets, the distribution of data usually reflects the distribution (prevalence) of that particular disease in the population, which is in most cases 0–10%.

As the dataset is imbalanced, the loss function becomes more focused on correctly predicting the class with more samples. So, the model will produce a lot of false-negative outputs, which is a horrible scenario regarding the diagnosis of a rare disease.

Two equally naive models have different training losses due to class imbalance in the dataset (Image by Author)

In the figure, two models are presented. Both of them are equally naive. Model 1 predicts every sample as positive, and model 2 predicts every sample as negative. But we can see from the loss function that model 2 has 4 times lower cost function when it predicted every sample as negative. But, the cost of both of these batches should have been the same.

2. Small dataset

Typical deep learning problems regarding computer vision come with a lot of data. As the neural network models are data-hungry, this large amount of data helps the model in solving complex classification problems. But, compared to normal datasets, the availability of medical data is limited. It is very difficult to easily collect a huge amount of medical data due to the nature of their privacy.

Some possible solutions to these problems:

The class imbalance problem can be solved by using a weighted loss function. During the calculation of the loss for each sample, they should be multiplied with a different weight which is inversely proportional to their occurrence in the batch. If we have a batch of 10 samples where 7 samples are negative and 3 samples are positive. During the calculation of loss of a negative sample, it should be multiplied by 3/10. Similarly, during the calculation of loss of positive sample, it should be multiplied by 7/10. So,

This process incentivizes the model to give equal priority to each of the classes despite their frequency of occurrence.

Another possible solution is to over-sample the data of the minority class and under-samples the data of the majority class so that in a single batch the number of samples of each class is equal. So, there is no need to tweak the loss function in this method.

2. The problem of a small dataset can be solved by two different processes:

(i) The lower layers of a deep learning model usually identify the small general features of an object (edge, curve, etc). There are lots of models on the internet which are pre-trained on millions of image data. The weights of the pre-trained model can be used to fine-tune the upper layers using our small dataset. This process is called Transfer learning.

(ii) Using data augmentation, the amount of data can be multiplied by applying suitable transformation on the images. It prevents the model from memorizing unnecessary features which is common in models trained with small datasets.

After training a model by considering above mentioned challenges, the model will face some new challenges during its testing period. During the estimation of accuracy of a medical diagnosis model, these points must be taken into consideration:

A patient may go to the hospital for a test several times and their data can be stored in the database for that many times. During splitting of these data into train, validation, and test sets, samples of the same patient can wind up in different sets. This causes data leaking and gives the model a false sense of accuracy. During splitting of the dataset, it should be taken into consideration that train, validation and test set does not contain the same patient’s sample.
Suppose, in a dataset, there are 80% negative samples and 20% positive samples. The model is trained so poorly that it only outputs 0 (negative) to every input. If the dataset is split randomly, the test set will also contain an uneven (80–20) distribution of the negative and positive samples. The test accuracy of the model will be 80%, which is not true. So careful consideration must be taken so that the samples of the test set (and validation set) are distributed evenly amongst the classes , so that the test accuracy reflects the actual performance of the model.But, it will create more imbalance in the training set,which can be mitigated by weighted loss function or over/under sampling of the training data.
In medical datasets, there is a problem of selecting ground truth. As the datasets are labelled by human beings or tests (which also has some level of inaccuracy), the accuracy of the model can be as good as human beings and test. So, careful consideration should be taken in selecting those ground truths. The ground truth can be selected by:

(i) Taking consensus from a group of experts and taking their collective decision as ground truth. It can also be selected by taking opinions from multiple experts and taking the majority opinion as the ground truth.

(ii) More definitive tests can be performed to increase the reliability of the ground truth. For instance, the diagnosis of an X-ray can be further verified by doing a CT scan. But these tests are expensive and are not always available for the same sample. So option (i) is mostly used in establishing the ground truth.

A lot of deep learning enthusiasts, who are working on their first medical diagnosis project, face these problems while working. I hope this article will help them in structuring and solving their deep learning projects more effectively.

Some Key Challenges in Building an AI Model for Medical Diagnosis

Beginner’s Guide to Working with AI in Medicine

Written by Ishtiak Mahmud