Machine Learning methods to aid in Coronavirus Response

Battling Coronavirus with Data Science and Artificial Intelligence

Published in

Towards Data Science

11 min readMar 18, 2020

With Coronavirus on everyone’s mind and forcing almost all of us indoors, many in the ML community are wondering how they might help. While there have been other articles on fighting coronavirus with AI, few have offered a truly comprehensive view. Therefore, I decided to bring together a list of datasets and use cases of machine learning applied to coronavirus. I understand the criticism that when you have a hammer every problem seems like a nail; in other words, to a machine learning practitioner/data scientist every problem seems to have a ML solution. Nevertheless, I believe that machine learning and data analytics can help accelerate solutions and minimize the impacts of the virus in conjunction with all the other great research and planning going.

As we will see below, machine learning can help expedite the drug development process, provide insight into which current antivirals might provide benefits, forecast infection rates, and help screen patients faster. Additionally, although not currently researched, I think there are several other appropriate application areas. That said there are many barriers related to lack of limited training data, the ability to integrate complex structures into DL models, and, perhaps most importantly, access to the available data. I’m not going to detail the techniques below (not that I could as my chemistry/drug-development knowledge is severely lacking), but instead aim to summarize the different resources. Also, I will be creating a central GitHub repository to list resources for using AI to combat coronavirus. Feel free to make a pull request if you find another resource/dataset that you find helpful.

First a couple of quick notes and words of caution:

Almost all the papers and datasets mentioned here are very recent and thus have not been peer-reviewed or verified. Treat claims with skepticism.
In addition to the normal best practices we use in machine learning, I recommend consulting with at least one specialist in the relevant field (e.g., epidemiologist, chemist, biologist, etc). This provides a sanity check as sometimes machine learning researchers can misinterpret the problems that need to be solved. Also, it is important not to contribute to the already significant amounts of misinformation going around.

Datasets and Applications of Machine Learning to the Coronavirus

Most applications fall under one of four areas:

(A) Predict the structure of proteins and their interactions with chemical compounds to facilitate new antiviral drugs/vaccines or recommend current drugs.

Methods here rely on applying deep learning to molecules such as proteins. This is a somewhat niche area that generally has a high learning curve to understand. However, breakthroughs here could potentially pave the way to vaccines or an effective antiviral. A majority of the techniques listed below use convolutional neural networks in some way to model molecules or molecular interactions. However,

1.Deep Learning Based Drug Screening for Novel Coronavirus 2019-nCov (Zhang, et al.). This article studies using deep learning to predict which current antivirals might help patients with coronavirus. The authors use a modified DenseNet (with the convolution replaced by a fully connected layer) to predict protein-ligand interactions. They can then use the model with the RNA sequences of coronavirus together with chemical compounds to predict which drugs work the best. The authors conclude that more research is necessary, but suggest that Adenosine, Vidabrine, and other compounds might potentially help.

2. Predicting commercially available antiviral drugs that may act on the. novel coronavirus (2019-nCoV), Wuhan, China through a drug-target interaction deep learning model This is similar to the paper discussed above, however the authors look at commercially available drugs and take a completely different modeling approach. Here the authors use a network named the “Molecule transformer-drug target” or MT-DTI. Fascinatingly, as those familiar with BERT might have guessed this at its core is the same architecture. However, in this case the network was trained on the SMILES dataset, a large dataset that represents molecules as text to encode and decode each molecule. This can actually form effective representations of molecules in much the same way it does for textual data. The authors then fine-tuned this pre-trained model to predict “binding affinity values between commercially available antiviral drugs and target proteins.” They found that “the 2019-nCoV 3C-like proteinase was predicted to bind with atazanavir.” Specifically, atazanavir is an anti-viral medication used to treat HIV/AIDS. This provides a good use case both of how new deep learning architectures in different domains can be adapted (though obviously this should not be taken as medical advice).

3. Deepmind used available data from GISAid and their AlphaFold library to predict the protein structures of Covid-19 virus. AlphaFold is a deep learning library for computational chemistry. With these protein structures (if correct), researchers will gain insights into the molecular structure of the virus which could potentially pave the way for finding vaccines or antivirals quicker.

(B) Forecast infection rates and spread/patient prognosis to enable hospitals/health officials to better plan resourcing and response.

There actually haven’t been too many models (at least publicly documented ones) that have explicitly attempted to model the coronavirus spread. However significant amounts of prior research has studied forecasting the seasonal flu and other outbreaks. Interestingly, a large number of the methods currently being used to forecast disease spread and patient mortality are based on shallow methods. I think there is a lot of potential to use deep learning models with attention + transfer learning (on related flu outbreak data) to achieve better results in this area.

Prediction of criticality in patients with severe Covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in Wuhan: In this article, the authors describe using a XG-Boost model to predict if a patient infected with Covid-19 would survive the infection based on age and other risk factors. This is useful for forming recommendations about who should isolate themselves from the disease the most.
Finding an Accurate Early Forecasting Model from Small Dataset: A Case of 2019-nCoV Novel Coronavirus Outbreak:
Data-Based Analysis, Modelling and Forecasting of the COVID-19 outbreak
Using Kalman Filters to predict spread of Coronavirus

(C ) Help diagnose if a medical image like a X-Ray or CT shows coronavirus.

Diagnosing a case of a coronavirus related pneumonia from CT scan can, potentially, both shorten the time of diagnosis and enable better treatment. With record numbers of patients flooding ICUs, radiologists can quickly become overwhelmed. Deep learning on imaging can help lessen the burden. Moreover, learning how disease manifests itself in CT-scans can help provide more insight into the disease itself.

Figure from Deep learning-based model for detecting 2019 novel coronavirus pneumonia

Deep learning-based model for detecting 2019 novel coronavirus pneumonia on high-resolution computed tomography: a prospective study (Chen et al) Here the authors used a UNet++ to extract relevant features from the CT scans and classify them. The authors trained the model on 40k scans from 106 admitted patients. The authors found that the model performed well in terms of precision and recall. Specifically: “The model achieved a per-patient sensitivity of 100%, specificity of 93.55%, accuracy of 95.24%, PPV of 84.62%, and NPV of 100%; a per-image sensitivity of 94.34%, specificity of 99.16%, accuracy of 98.85%, PPV of 88.37%, and NPV of 99.61% in retrospective dataset. For 27 prospective patients, the model achieved a comparable performance to that of an expert radiologist. With the assistance of the model, the reading time of radiologists was greatly decreased by 65%.” So, in summary, the model was highly effective at optimizing the workflow of radiologists. This is a good example of a positive impact of AI in real-world cases. Thus radiologists can examine more CTs in the same amount of time. It is important to note, however, the exceptional scores seemed to be on a test set of only 27 patients for prospective testing and 44 patients for retrospective testing. Moreover, all of these patients were from the same hospital. While this shows the model learned something it is unclear how robust it would be on a larger and more diverse test population.
A deep learning algorithm using CT images to screen for CoronaVirus Disease (COVID-19)

Diagram from paper **A deep learning algorithm using CT images to screen for CoronaVirus Disease (COVID-19)**

3. Deep Learning System to Screen Coronavirus Disease 2019 Pneumonia This article is similar to the two above except the total number of patients in the dataset was a bit larger. They collected examples from around 509 patients (including 175 healthy ones) across three hospitals in China. Interestingly their precision/recall scores were significantly lower than Chen et. al. A reason for this could be that despite having more patients they used significantly fewer scans to train the model.

(D) Mine social media data to better estimate spread/symptoms and general public perception

This area of work focuses on mining social media data for relevant information about the disease. Although social media is very noisy in someways it could contain more information about symptoms/spread in the general public. Once again at this point, no one (at least openly) has conducted research on mining social media data explicitly for Coronavirus tracking. However, there are a number of publications on it for related incidents.

A review of influenza detection and prediction through social networking sites

Forecasting Influenza Levels using realtime social media streams

Regional Influenza Prediction with Sampling Twitter Data

Datasets

Below is a list of datasets related to Coronavirus with the corresponding category listed above.

Coronavirus Genome Link (Requires a GisAid account, which unfortunately can be kind of a pain to get unless you have an academic affiliation). This is what Deepmind used for predicting protein structures and where Zhang et al. got their RNA sequences. (A).

Coronavirus Genome on Kaggle (A)

COVID-19 Open Research Dataset Challenge (CORD-19) (B, C, D) This is an interesting new dataset released on Kaggle today (3/16/20). It contains over 29,000 academic articles on COVID-19, SARS, MERS, and related viruses. The competition is split into 10 separate parts each with their own winner. I will not list all here but you can see them at this link. The competition is mainly aimed at extracting the most relevant information about this disease from this large dataset. You can officially join the competition on Kaggle.

Chemdiv Database Database of different chemical compounds. (A)

GitHub Coronavirus This GitHub repo is a link of the daily John Hopkins data all in CSV Data format. (B)

COVID-19 Korea Dataset & Comprehensive Medical Dataset & visualizer: This is a comprehensive dataset of all the available data on the disease from Korea. (B)

COVID-19 Vulnerability Index (B) A set of open source models that can be used to test an individuals risk if they catch coronavirus.

COVID-19 image data collection (C) This dataset contains X-Rays and CT scans of Coronavirus and other lung conditions.

Coronavirus Tweets Dataset (D)

Other resources/articles

Non-technical article about how the CDC is forecasting coronavirus from MIT review. Unfortunately, they don’t seem to provide any additional details about their model. They do, however, highlight the problem of small data.

CT Angel This seems to be a Chinese website linked to the paper that allows users to upload CT scans and see if the model detects coronavirus (though as stated previously this has not been peer-reviewed and should not be used as medical advice).

TWIML podcast This podcast details how BlueDot, a Toronto AI startup focusing on infectious disease surveillance, helped predict the outbreak and spread of coronavirus. Unfortunately, there don’t seem to be any links to their data or further descriptions of their methodology.

Infectivity prediction of Wuhan 2019 novel coronavirus using deep learning algorithm This is more of a possible method to help prevent future outbreaks than a method to contain the current outbreak; the model could prove powerful in the future. Here the authors purpose a model that would take a DNA strand from a virus in animals and predict whether it has the potential to infect humans.

Additional thoughts and recommendations.

I think this outbreak demonstrates the need for robust models that learn from relatively few examples and rapidly adapt to changes in distribution. For instance, say that you had trained a model to forecast the patient length of stay and mortality risk on the MIMIC-III critical-care dataset. Now you anticipate an influx of coronavirus patients to your hospital ICU but only have very limited data from a couple of other hospitals (or even worse no data at all) on the coronavirus patient trajectories. This is a good example of where many models will completely break-down due to the shift between the training and production distribution. It is also where recent techniques like unsupervised domain adaptation and meta-learning could prove very powerful. For instance, if instead, we had trained a deep learning model to learn how to learn with a popular meta-learning strategy (by partitioning the dataset into task distributions via ICD code for instance) the model might be able to adapt to forecast patient trajectories with only a five or so patients. Additionally, some techniques such as continual domain adaptation could enable models to continually adapt to incoming distribution shifts (without any new data). Unfortunately these methods typically only been applied in very constrained image classification situations (although this paper did show the potential to segment lung X-rays of different patient populations with DA). However, overall their performance in the wild remains to be seen.

Another underutilized area of machine learning in health-care is in hospital operations as a whole. This results, in part, from hospitals not wanting to share their data (more on this point in a second), but also — in my opinion — because it is much harder than simply passing a Chest X-Ray to a CNN. In this instance, the hospital could be thought of as a large scale reinforcement learning problem where there is only a limited amount of beds, staffing, and other resources. The reinforcement learning or optimization problem is essentially to learn to leverage the available resources most effectively to improve patient outcomes. This could prove particularly useful in crises such as the current one where hospitals are pushed to the brink. Moreover, by running these models on historical data (and other similar hospitals) it would hopefully help the hospital identify shortcomings to provide better management resources in the future particularly when combined with the aforementioned disease outbreak models.

Finally, and perhaps most importantly: In order to make progress I believe places need to start sharing their data. Data absolutely should be de-anonymized and stripped of personally identifiable information, but afterwards, hospitals/clinics should share it openly. Too much medical data continues to sit behind closed-doors or available only after submitting “detailed collaboration proposals.” Also, sites like GISAID, that claim to be openly accessible, need actually to be openly accessible, instead of just open to a select few with .edu email addresses and an academic department page. Protecting patient privacy is important, but more often than not research groups seem to use it as a convenient excuse for closing off their data. Moreover, authors need to ensure that they clearly explain how to access the raw data (if available). If the data is proprietary or requires registration, this should be noted clearly in the appendix. Merely appending a citation to a previous study that links to another study leads authors on a wild-goose chase to try to find the original data-source. While some of these changes might be controversial, I believe they are necessary to enable AI to address health challenges of this scale.

In summary, there are many potentially impactful applications of machine learning to fighting the Covid-19, however, most are still in their early stages. Moreover, a lack of data sharing continues to inhibit overall progress in a variety of medical research problems. However, I believe that utilizing things like meta-learning, domain adaptation, and reinforcement learning, while loosening restrictions to healthcare data, can allow ML to play an important role in containing/responding to both Coronavirus and future pandemics. In the meantime stay safe everyone!

Note from the editors: towardsdatascience.com is a Medium publication primarily based on the study of data science and machine learning. We aren’t health professionals or epidemiologists. To learn more about the coronavirus pandemic, you can click here.

Machine Learning methods to aid in Coronavirus Response

Battling Coronavirus with Data Science and Artificial Intelligence

Datasets

Written by Isaac Godfried