The world’s leading publication for data science, AI, and ML professionals.

The world’s first medical AI in Production – eye diagnosis by Deepmind

Putting an AI system into the world's oldest eye hospital and the biggest one in Europe and North America – Moorefield's eye hospital

Photo by v2osk on Unsplash
Photo by v2osk on Unsplash

If you have been following me for a while, you know that I am heavily in medical AI. I have been reading tons of amazing papers achieving tremendous performance over the last few months/years and I wrote a bunch of reviews about some of them. However, I was just thinking a few days ago, were those papers actually used? Or did they just become history? And that’s why I decided to review this paper, the one that actually made it from a research lab into production.

Not too long ago, Deepmind released an AI system that automates the diagnosis of specific eye diseases at Moorefield’s hospital in the UK. I think we all know by now that you can get 0.95+ AUC on any supervised image classification task (using SOTA models like ResNet/EfficientNet), which is great. But is that it? Like does this mean that this model can be put into production? Well, No.

Recently, I have been put into a unique experience. I was asked to implement an AI system that diagnosis Glaucoma (an eye-disease) at an actual Hospital. I was thinking to myself that this would be easy given you have a labeled dataset and a ResNet model. However, I have found out that I was only thinking that because I was just considering optimizing a performance metric, which is typically what you do during projects like Kaggle competitions.

I want to focus here on the real-world challenges that Deepmind had to overcome to put an AI system into production, especially in healthcare where false positives just CAN’T be too many. Because I think AI research is at a very good place, while production AI just isn’t.

We don’t just want this to be an academically interesting result – we want it to be used in real treatment. So our paper also takes on one of the key barriers for AI in clinical practice

Source: Deepmind

Deepmind’s system is being used in Moorefield’s eye hospital to diagnose over 50 sight-threatening eye diseases. The results of its diagnosis mainly provide the doctors with a segmentation map (interpretability) and whether this patient requires urgent care or not. I don’t want this to just be another article where I explain what tricks they used in their models to reach a good performance, I want this article to mainly highlight the unique issues they faced.

The system we have developed seeks to address this challenge. Not only can it automatically detect the features of eye diseases in seconds, but it can also prioritise patients most in need of urgent care by recommending whether they should be referred for treatment. This instant triaging process should drastically cut down the time elapsed between the scan and treatment, helping sufferers of diabetic eye disease and age-related macular degeneration avoid sight loss.

Source: Deepmind

The top challenges faced by Deepmind:

1. You have to be careful about the dataset, and I don’t mean its quality! – Generalisation beyond cross-validation

I am pretty sure this isn’t the first time someone tried to put a medical AI system into production. Actually, I have read an article before about Google trying to put a diabetic retinopathy AI detection system into production in India, and it failed, you can read more about that here. But the main gist was that even though they had a huge dataset with high-quality images and labels, the main issue was that this dataset wasn’t a good representation of the real-world dataset, and thus the AI couldn’t generalize properly.

Typically at the start of an AI project you just look for a high-quality dataset, then you build your model and optimize. You have to think about whether this dataset will actually be similar to the real-world one. And I guess you usually won’t be able to tell unless you try to push your system to production, but keeping this principle in mind helps.

What I mean here is that their dataset didn’t have any images where the quality wasn’t clear, where there were a few artifacts or noise and that’s just not realistic! A possible solution would be a preprocessing technique to remove those.

Okay, so what did Deepmind do differently here to solve this issue? Originally they did a similar mistake where they trained their models on only one type of OCT scans. But then when they were testing on different types of scans (provided by a different brand of OCT scanners), their models failed. The good news is that the majority of OCT scans are only 2 types, so they just trained their models on those 2. However, they added another part to improve their generalization which I am going to talk about.

2. Users of the AI system don’t just need predictions, they also need to understand how the system reached this prediction

Photo by Moritz Kindler on Unsplash
Photo by Moritz Kindler on Unsplash

This might sound obvious, but believe me, it’s not that obvious. When I was presenting a Deep Learning model that I built for a client and I provided them with all the metrics to prove its performance, that wasn’t enough for them. They need some sort of interpretation given back by the system. And I can understand why this would be needed in a medical setting.

To do this, deepmind added a preprocessing stage to their supervised network. They added an ensemble of 5 different U-Nets that give out heatmaps of the OCT scans before passing them to the supervised networks. Doctors can look at those heatmaps in the AI system as an indication that the AI system really understands the different parts of the OCT scan.

And actually, when they were testing their models on the different types of scans and they failed, the heatmaps also showed that. The different parts of the OCT scans weren’t correctly segmented.

I also want to stress the ensemble of 5 different networks as I think this is a great way to boost the generalization of your AI system (since one of the networks might fail, but not all of them).

Another great idea to provide interpretability in neural networks is saliency maps. Saliency maps provide a type of heat map of the activations of the CNN. This essentially shows the parts of the image that helped in reaching the classification and the parts that didn’t contribute to the final prediction.

Visualizing Your Convolutional Neural Network Predictions With Saliency Maps

3. Even if your model performs greatly, there are certain criteria that you must meet

I originally thought that ML is all about optimizing a certain metric, but it’s not, you have to consider a lot of other factors. In a clinical setting, false positives are much worse than false negatives. To produce a successful medical AI model you have to have a low False-positive rate. Because think about it, if a patient actually has the disease and you didn’t predict that, then the consequences are going to be detrimental (but that’s not the case for the other way around).

A significant part of the study goes into researching:

  1. When those false positives occur
  2. Why
  3. And how to reduce them

A very good example of this methodology is this paper where they were trying to do a very similar task to what is here (AI system to diagnose eye diseases). They have a complete section called "false-negative and false-positive findings", I don’t want to go into the details since it’s quite specific. However, I think the main takeaway that I wanted to highlight is that even if you have very few false-positives or false-negatives, looking at those data points, evaluating and examining them is going to heavily boost the performance and your understanding of the ML system you are trying to build.

Essentially what I think those other criteria would be:

  • Proof of Generalisation
  • Methods to interpret the model such as segmentation maps, heat maps, activation models, saliency maps. (Check out Deepdream if interested)

And those 2 criteria require tons of work, might even be more work than optimizing performance in some cases. I also think that when pushing the models into production, you will have to think about other methods of testing generalization (on top of cross-validation metrics), which can be quite challenging. The only method I could think of is just to test the model in a production scenario! But, I understand that this isn’t likely to be possible. Let me know down in the comments what you think.

Final thoughts and takeaway

To sum up, I just wanted to highlight some of the most significant conclusions that I got when attempting to build a production medical AI system. And by all means, I am pretty sure that even if the paper doesn’t get turned into a product, it probably helped other papers get there!


Related Articles