Notes from Industry

Medical AI: Why Clinicians Swipe Left

The most common reason your medical AI will be rejected by clinicians and how to overcome it

Vaughn Spurrier

Published in

Towards Data Science

11 min readDec 1, 2021

A painting of a histopathology slide, showing details of cells visible under a microscope. — Image by Evin Felix.

So, you’ve found success training algorithms for industrial use, and now you want to train an algorithm to help doctors help patients. Developing algorithms for medical applications is a challenging pursuit but also a worthwhile one; artificial intelligence holds the potential to revolutionize the practice of medicine just like automation has revolutionized virtually every other industry. Public medical datasets are readily available. Competitions on medical applications are regularly held on Kaggle and other sites. And yet, very few algorithms are FDA approved for clinical use.

A full discussion about why this is the case is beyond the scope of this article (discussions on this topic abound online). However, this article will help you overcome one of the most common reasons why medical algorithms are rejected by clinicians. Data scientists often develop medical machine learning models and neural networks using datasets that contain very different distributions of medical conditions than those seen in real world settings. This discrepancy leads models that perform well on a balanced, curated dataset to become virtually useless in the clinic. To solve this discrepancy, researchers should validate models for medical applications using two datasets with different class distributions: one that is enriched with rare conditions, useful for generating statistical power; and a second dataset with a condition distribution equal to the distribution the model will face in the clinic.

An Illustrative Example

Consider the most prevalent cancer in America: skin cancer (American Cancer Society 2021). Each year, more people in the United States are diagnosed with skin cancer than all other cancers combined. Overwhelmingly, the most life-threatening skin cancer is invasive melanoma because melanoma often metastasizes and grows uncontrolled in other organs like the lymph nodes, liver, and brain (Skin Cancer Foundation 2021). Tissue specialists called histopathologists diagnose most skin growths as benign (think moles, freckles, skin tags). Less than 1% of skin conditions in a typical clinic are diagnosed as invasive melanoma (Ianni et al. 2020). Standard practice in diagnosing skin lesions is removing the growth through a biopsy and visualizing the tissue of the growth using a microscope. Over 5 million skin biopsies were performed in the United States in 2015. Most dermatopathology, or skin pathology, clinics can barely keep up with the deluge of biopsies. The amount of time between an individual biopsy and the patient receiving a diagnosis can stretch from days to weeks, which means that patients may be forced to wait to begin critical treatments. For many reasons, pathology is currently undergoing a transformation from manual microscope analysis of biopsies to the review of high-resolution scanned images of these biopsies. Computer vision models trained on these scans have huge potential to expedite diagnosis of aggressive disease by identifying likely aggressive specimens and flagging them for immediate diagnosis.

Let’s imagine a data scientist wants to train a model to detect invasive melanoma within scans of skin biopsies. The simplest path forward would be to collect a dataset of about half melanoma, half non-melanoma, while making sure to collect a diverse set of non-melanoma scans. Let’s imagine the model is trained to maximize the F1 score on a withheld test set, and the trained model achieves 90% accuracy with balanced error rates — nowhere near an easy feat. The researcher reports these results to clinical partners, and together they run a preliminary prospective study, meaning they measure model performance on all patients who visit the clinic during the next 6 months to determine the utility of the model.

The Problem

To the researcher’s shock and dismay, the clinician measures the error rate in the model to be nowhere near what was measured during development — many more false positives are detected than true positive detections. The fraction of positive detections by a model that are true positive detections (rather than false positive detections) is measured by the model precision (also known as positive predictive value). Within development, the researcher measured the precision to be near 90%, but within the clinical study, the clinician found the precision was closer to 10%. This means there’s a staggering gap of 80% difference. How could there be such a huge gap between the performance during development and the performance during the study?

The obvious answer is that the class distribution of the dataset during development was not the same as the class distribution the model faced in the clinic. In a clinical setting, physicians diagnose disease in chronological time — the class distribution follows disease prevalence in the general population. Many more patients in the general population have benign conditions than rare, life-threatening conditions — thankfully! A dataset collected with all the cases a clinic sees in some period of time (for example all of the cases from a clinic within a year) has long tails: most of the observations are of benign conditions, and only a few will contain the rarest, or most life-threatening conditions.

In this article, we will refer to a dataset that follows the chronological clinical class distribution as a “sequential dataset.” During development, the researcher used an enriched dataset that contained more melanomas than a doctor would see in a clinical workload of cases. If a model detecting rare disease is validated on an enriched, balanced test dataset, the error rate in a sequential workflow will differ from the error rate measured during model development because, in the clinic, the model takes many more “shots” at classifying non-melanoma than melanoma specimens. Let’s discuss these two types of datasets in more detail.

An Interactive Illustration of the Problem

In the above example, the properties of the development dataset and the properties of the clinical condition distribution resulted in a large difference between development and clinical metrics. To illustrate how the properties of the datasets affect model performance in development and in the clinic, we include an interactive figure hosted on Google Colab that simulates model performance.

Five parameters control the simulation, and three plots are created. The first two parameters are the frequency of the positive class in the clinic and in development. The third parameter is the number of cases in the development dataset. The final two parameters are the false positive rate and the false negative rate measured on the test set during development. The parameters that control each plot are shown above the plot. We assume that the modeling test set is 15% of the total development dataset.

The first output plot indicates the uncertainty in the model true positive rate by showing the mean (sky blue) and 95% confidence intervals (magenta) of a beta distribution parameterized by the numbers of true positive (alpha) and false negative (beta) detections in the test set. The beta distribution models the uncertainty in a “success rate” by measuring how likely the observed distribution of successes and failures are to have arisen from an underlying ground truth success rate (for more information, see Beta Distribution — Intuition, Examples, and Derivation). With enough observations, the mean of this distribution will converge to the model true positive rate (1 minus the false negative rate) because the parameterized rate is the most likely rate to have produced the percentage of true positives measured during development. With enough observations of the positive class in the development test set, the sky blue mean of the beta distribution will be at 1 minus the false negative rate. However, this rate is not the only underlying success rate that could have produced the single observation of the distribution of successes and failures found during model testing. If the mean of this distribution is far from the peak, and if the error bars on the distribution are too large, the development-measured true positive rate is not sufficiently statistically powered, meaning that there is a significant chance that the true positive rate in the clinic will be very different than the measured success rate in development.

The second and third plots show the performance of the model on 10,000 clinical cases, using the parameterized error rates. The first plot shows the log occurrence or counts of each classification, and the second plot shows some statistics using these counts: the accuracy, the recall, and the precision.

A screenshot of an interactive figure showing how dataset distribution in development and the clinic impacts model performance. — Figure 1: An interactive figure in Python Plotly Dash (screenshot shown here) hosted on Google Colab, simulating development and clinical performance of a model with balanced or imbalanced datasets. Image by author.

The Inadequacies of an Enriched Dataset

In the example above, the clinical precision of the model was so different from the development precision because melanoma is such a small percentage of the caseload of a pathologist. As long as the percentage of the positive class in development and the clinic is so far apart, an enriched dataset will provide a poor estimate of clinical precision (Figure 2).

A bar graph motion figure showing degradation of precision in the clinic as the clinical frequency of the positive class is decreased and error rates remain constant. — Figure 2: While development precision on a balanced dataset is fixed at 90%, clinical precision falls as frequency of the positive class in the clinic is reduced from 50% to 1%. Image by author.

The Inadequacies of a Sequential Dataset

Is a long-tailed distribution of conditions unique to skin cancer? Fortunately, no it is not. Disease prevalence and disease aggressiveness are generally distributed with correlated long tails — rare diseases are often the most life threatening. From an evolutionary perspective, aggressive common diseases are the precursors of extinction, so much so that English has a term for common aggressive disease: pandemic. Thankfully, pandemics are rare, even though we happen to currently be experiencing one. However, this leaves researchers with two options to train models that detect life-threatening conditions: option A, train on an extremely large prospective dataset that contains plenty of observations of rare conditions; option B, train on an enriched dataset. Option A is usually practically unattainable, so option B is still the inevitable best choice.

In developing a supervised binary classification model to detect melanoma in diagnostic skin tissue images, it is crucial the model performs well on the most life-threatening type of skin condition — invasive melanoma. In a huge dataset containing biopsies from the last 5,000 patients who visited a skin clinic, only around 50 of the specimens are expected to be invasive melanoma. Accounting for typical dataset partitioning for validation and testing, the researcher is left with only around 35 biopsies for training and 7 each for validation and testing. The first issue this raises is that it is difficult to train a model with so few examples of the positive class — cross-validation can only take model development so far. The second, more important issue is that having so few test examples of the rare, life-threatening condition makes it extremely difficult to generate statistical power to prove good test performance on the rare condition (Figure 3). This is why public challenge datasets for disease detection modeling are relatively balanced — these datasets have been enriched with examples from rare conditions to provide just as many true negative as true positive examples.

Figure 3: As the number of development cases increases, the bounds on the true positive rate become tighter, but even an extremely large sequential development dataset with a small percentage of the positive class does not have enough test observations of the positive class. Image by author.

The Solution

So, what can be done? If a researcher reports statistics solely on an enriched test dataset, the performance in the clinic on rare conditions will be radically different from the reported performance. If the researcher develops a model solely using a sequential dataset, the model will be statistically underpowered while measuring model performance on rare disease subtypes. How can these issues be overcome?

It is imperative that models developed for clinical use report test statistics on a sequential test dataset, so that the performance statistics the researcher measures match the statistics produced in the clinic. If a sequential dataset proves impossible to obtain, the distribution of condition prevalence in the general population can be simulated with bootstrapped random sampling of the enriched test dataset at the frequency observed in the general population.

To generate statistical power on detecting rare, life-threatening conditions, it is also imperative that models for medical applications are trained using enriched datasets. Gigantic sequential datasets with enough observations of rare classes are prohibitively difficult to collect. Too few observations of rare classes leads to poor model training and poor ability to generate belief in measured model performance.

When performance in both enriched and sequential regimes is reported, physicians can utilize models with confidence because the physician has been well informed of model performance on rare, aggressive diseases, and the physician has correct expectations of how the model will perform within their day-to-day clinical workflow.

Often a model can be tuned to decrease one type of error at the expense of increasing others. To balance error rates in the clinic, a tunable model that has higher clinical recall than precision can sometimes be tuned for a lower false positive rate while keeping clinical accuracy equal (through weighting a loss function, or model prediction post-processing, among other techniques). Tuning the errors during model development so that errors are balanced within a sequential dataset alleviates the discrepancy between expected performance measured during development and actual performance of a model deployed in the clinic. Clinicians need to know both how to expect a model to perform in the clinic and how the model performs on a well-powered dataset to make decisions about whether the model is acceptable for use in their practice.

Final Thoughts

There are many opportunities for algorithms to improve medical practice, and every modeling situation is unique. In this article, we have focused on the case in which the development stakeholders desire balanced errors in the clinic. There are certainly other valid objectives for training medical AI. For example, even a model with low precision might be useful to a particular clinic if the objective is to maximize recall. A clinic might find value in obtaining a collection of cases that are virtually assured to contain all the positive cases, even if the precision is low, perhaps allowing them to effectively prioritize cases. Looking at this high recall, low precision model from another perspective, the model provides a very pure collection of cases that are virtually assured to not contain the positive class. Medical AI is still very much a young discipline, so the most useful applications for medical AI are likely yet to be discovered.

However, to develop useful algorithms and to be able to explain their performance to doctors, researchers need to have a clear understanding of the differences between enriched datasets useful for training models and the actual clinical distribution of likely long-tailed conditions. If artificial intelligence researchers obtain a deeper understanding of the challenges of deploying algorithms in clinics, that knowledge will accelerate the artificial intelligence revolution for the practice of medicine, transforming patient care for the better.

References

American Cancer Society. 2021. Cancer Facts & Figures 2021. https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/annual-cancer-facts-and-figures/2021/cancer-facts-and-figures-2021.pdf.

Ianni, Julianna D., Rajath E. Soans, Sivaramakrishnan Sankarapandian, Ramachandra Vikas Chamarthi, Devi Ayyagari, Thomas G. Olsen, Michael J. Bonham, et al. 2020. “Tailored for Real-World: A Whole Slide Image Classification System Validated on Uncurated Multi-Site Data Emulating the Prospective Pathology Workload.” Scientific Reports 10 (3217). https://doi.org/10.1038/s41598-020-59985-2.

Skin Cancer Foundation. 2021. “Melanoma Overview.” https://www.skincancer.org/skin-cancer-information/melanoma/.