Using data science to understand and attack the coronavirus and other epidemics.

Strategies of understanding, action and prevention for epidemiological crises.

Published in

Towards Data Science

14 min readMar 4, 2020

On the last day of 2019, the government of the People’s Republic of China informed international health authorities of a series of cases of an unidentified disease in the city of Wuhan, China. Seven days later, Chinese authorities identified the virus as a type of coronavirus belonging to the common flu family, SARS and MERS. The international scientific community has named this strain COVID-19 and according to epidemiology standards, this virus has gone from an epidemic to a pandemic given its worldwide spread to 75 countries so far.

As of March 4th, 2020, 3062 people have died from the virus (162 on the last day), and the total reported cases amount to more than 92,860, most of them from mainland China (Raw data).

The World Health Organization (WHO) describes some of the symptoms of this virus as follows:

“The clinical signs and symptoms reported are primarily fever and, in some cases, dyspnea and invasive pneumonic infiltrates in both lungs observable on chest X-rays.”

So far it is known that the lethality of the virus is not as high as other similar epidemics (about 2% of infections end in death, similar viruses like SARS and MERS were about 60%). It is also known that the ease of transmission is relatively high and that it usually takes about 10 to 15 days for the virus to incubate asymptomatically. For this reason, people who are infected often infect others during this period of time without knowing they are sick. This can put the health system at risk because there are not enough hospital beds to handle the large number of infected patients.

Paradoxically, the low lethality rate of the virus is precisely what makes it so dangerous, as the inability to contain it threatens to spread the contagion massively and with its “low lethality” claim more lives than any similar virus in recent decades.

However, the massive nature of this and other outbreaks of respiratory diseases in a globalized era suggests large-scale combat strategies where coordination and information exchange become fundamental pillars for understanding, containing and eventually solving epidemics.

This is where data science comes into play; described as large-scale computer data analysis tools, along with Artificial Intelligence (AI) and Machine Learning (ML) techniques to approach a problem of this magnitude with greater power of understanding.

In the midst of 2020, we cannot do without this arsenal of techniques, where decision making and planning of successful strategies at this level must be data-driven.

Data science has an enormous amount of methods whose effectiveness is guaranteed by statistical rigor, however, their implementation can become more of an art than a science due to the continuous development that exists in this field, as well as the constant creative processes in which data scientists are part: there is no perfect recipe.

In this article, deep_dive’s team seeks to propose some ideas and applications of data science that could have an impact on the correct handling of this crisis by following the workflow below:

Understanding the phenomenon
Action
Prevention

In the first phase we will seek to understand the phenomenon by extracting as much information about the virus as possible and understanding it from data visualization using GIS (Geographic Information Systems) techniques and graph analysis. This information will serve as fundamental input for the training and deployment of models in the action phase, where we propose specific applications using spatial-temporal clustering, genomic data and some risk models. Finally, in the prevention phase we will explore data architecture issues that could lay a more solid foundation for addressing future problems of a similar nature.

Understanding the phenomenon

A critical issue in the understanding of an epidemiological phenomenon is the correct diagnosis of the phase in which the phenomenon is found.

Where are the outbreaks? How does the contagion occur? How many sick people are there? How fast is it spreading? How many people are at risk and who are the most exposed? What is the best method of diagnosis?

Let’s try to answer some of these questions with a typical workflow that starts with an exploratory analysis.

This phase consists of extracting the information that the raw data can throw at us, we propose this list as a starting point:

Data visualization.
Geographic hotspots (GIS)
Network analysis.

Data Visualization.

Data visualization is one of the most effective tools in extracting the underlying patterns of a phenomenon; in the initial phases of analysis, viewing the data correctly allows us to generate our first hypotheses.

When talking about an epidemic, the usefulness of seeing graphs that allow us to understand the space-time evolution in the spread and lethality of the virus immediately comes to mind.

This time series allows to visualize the global evolution of the reported cases of the virus. We can imagine the trend of the data from the fit of a quadratic model.

As we can see, a quadratic model does not seem to be the best way to fit the data; however, the animation of such a model allows us to better understand the dimension of the problem and its potential scale.

A model in which new daily infections grow linear (always constant) implies that total infections are quadratic.

Normally, the models used in epidemiology have an exponential growth, whose evolution can quickly exceed that of a quadratic model such as this one.

Geographical hotspots

It is well known that the first outbreaks of the disease have been in China, which is why our geographical analysis will be divided into two regions: China and the rest of the world. To begin with, let’s visualize the evolution of confirmed cases in each region.

It is clear that China is the focus of the epidemic, however, given the interconnectedness of the globalized world, the emergence of outbreaks of equal magnitude in other areas of the world is feasible. Moreover, it seems that in the aggregate the new cases confirmed on March 3rd in the rest of the world exceed those in China. Let’s look at each region in more detail:

The city of Wuhan is located in the Hubei region in central China, where the largest number of confirmed cases has been found so far.

Currently, China has 80,304 reported cases, so in the following map, the color corresponding to the number of cases is out of the proposed scale.

What is worrying about this map is the growth rate of cross-border infection and the emergence of new outbreaks in countries of all kinds. As of the 4th of March, cases have been confirmed in 75 countries.

Let’s now look at the reported cases of deaths in China and the rest of the world.

As of March 3rd, 3112 people have died from the virus.

Clearly the greatest number of deaths from COVID-19 are near the source of the virus infection, again placing China outside the global comparison scale with 2,946 deaths out of 3,112.

It is interesting to explore the cases of Italy, Iran and South Korea, where the lethality seems to be higher than in other countries: an interesting hypothesis may be that the effectiveness of detecting cases of contagion is low, underestimating the number of cases of contagion at the moment.

Network analysis.

Given the discrete nature of the carriers of the virus (humans and some animals), it is convenient to study the dispersion and contagion from a network perspective. In these networks each node means a person and the edges (connections between nodes) mean contagion. The modeling of this type of phenomena becomes extremely complicated given that the contagions are evolutionary processes of change in time.

To better understand the case, let’s abstract our analysis from the global contagion to a local one, for example, within the Mexico City subway (CDMX) facilities.

Suppose a subway car is 2 meters * 40 meters and there are 7 cars per car. This generates a space of 560 𝑚² where contagion could occur. If the car is full, and let’s assume a density of 9 people per 𝑚², the amount of interactions becomes immense.

The total number of interactions by 𝑚² is the binomial coefficient of C(9,2)=36, i.e. 36 interactions. Now let’s suppose that the person in the center is infected and also assume an infection rate given the density of people, then we could simulate the infections.

For example, let’s say that in 15 minutes of interaction each person traveling in the subway with an infected person has a 30% chance of being infected. According to these parameters, in approximately 2 hours, the 8 neighboring passengers would already be infected, this also assuming that the new infected do not infect others immediately (it is important to note that these 30% are an illustrative example, enormous epidemiological efforts have to be focused on estimating these rates as best as possible, in this link you can find more information).

The following graph allows you to see how this average converges to 120 seconds after a few simulations.

Clearly these parameters cannot be assumed lightly and each of them must be estimated in the best possible way. More complex models based on this simple reasoning can be consulted in more detail at the following link.

Action

Taking action on these phenomena is naturally the most complicated part. At this stage, a solution is articulated that involves a large-scale logistical and operational deployment, an enormous outlay of resources and an extraordinary political capacity, not to mention adequate protocols, trained health professionals and a well thought-out solution. All this would be impossible to execute correctly without knowing what we are facing and whether the proposed solution has not been designed according to the relevant aspects found in the exploratory stage.

Scalability in the deployment of the models becomes crucial, since a system that responds to a massive phenomenon such as an epidemic must be able to handle a high level of concurrence without compromising the effectiveness and speed of the response at any time.

This section outlines possible model deployments with medical-epidemiological applications along with the considerations that each must take into account.

Genome display.

Thanks to advances in medical science, the genome sequence of the coronavirus was quickly determined and is now open-source. The following figure is a picture of the first 100 nitrogenous bases of the sequenced DNA:

From this visual sequence we can confirm some basic knowledge of biology, such as that the DNA of any living being is a sequence of only four nitrogenous bases (A,C,G,T).

The full image can be developed from the complete raw information along with other relevant genomic information that can be found here.

This figure may be aesthetically attractive, however, there are few insights that can add value to our analysis of the phenomenon.

Extracting information from this huge sequence is not the focus of this article, however, with techniques such as Principal Component Analysis (PCA) or other dimensionality reduction techniques we can help achieve a measure of similarity between various strains of the virus, which could guide us in the creation of an effective vaccine. On the other hand, genetic trees allow us to visualize the evolution in the mutations. The evolution of the coronavirus can be consulted here.

Spatial-temporal clustering

Beyond the identification of hotspots, another spatial-temporal detection method is the use of baselines. These allow us to define what we will consider within a normal range based on historical information. In this way, we can efficiently detect the congregations (clusters) of abnormal infections to typical or historical behavior.

As an example of application of this technique, we will use a public data set with the number of students vaccinated in Minnesota in 2018.

We detected the region (cluster) where the number of kindergarteners vaccinated (~ 85%) is significantly lower than the percentage in the rest of the state (95%). It is important to note that the group detected does not correspond to the area of highest population density in Minneapolis, since the algorithm implemented uses a baseline that takes into account the population so we can detect geographically situations that are not obvious for the formation of an effective strategy.

Public data set with the number of students vaccinated in Minnesota in the 2018–2019 school year.

This method of analysis can be used for a variety of epidemiological applications. On the one hand, we can detect regions where the number of cases is significantly higher than in other areas. Similarly, if a COVID-19 coronavirus vaccine were available, we could use that data set to identify areas of higher risk.

From here, we would correlate coronavirus infections and vaccination rates to better allocate resources to regions at highest risk of spreading the infection before it begins to spread uncontrollably, avoiding doing so could be negligent.

Risk models

An epidemic risk model can predict the evolution of the infections that a certain disease will have and how this will affect the rest of the actors involved.

These models are very complex because they depend on many interactions particular to the context of each virus. The magnitude of the coronavirus makes the design of risk models an obligatory task since only in this way can the costs involved be weighed: determination of curfews in cities, closure of markets or crowded spaces, migratory and airport measures, among others.

With sufficient information on flights, passenger flows from areas classified as “risk” can be estimated and, based on this, a cost-benefit analysis can be made for the adoption of mandatory preventive measures such as the diagnosis of each arriving passenger and the activation of health protocols in the event of detection, in order to avoid contagion by this means.

Only by following the trend of infection detected in some countries, together with the available medical care capacity, is it possible to generate an estimate of the distance to saturation.

Take, for example, the case of Italy, where the medical care capacity is currently 3.4 hospital beds per 1,000 inhabitants. Crossing the trend of coronavirus cases in this country, we observe the following model.

It is important to mention that the models that best fit the epidemiological trends are not polynomial models, but exponential ones. Also models that adapt to local characteristics to investigate regional developments.

Given the adjustment of the trend of case detection and the number of beds available (thinking that there is no previous occupation), it would be between 36 and 49 days before there is not a single bed available to attend COVID-19 patients in Italian hospitals. It is crucial to take into account that this model considers that all patients develop serious disease and therefore require a bed, which would not reduce the trend a little in more specialized models that include more variables. Obtaining a world or regional map to visualize the distance to saturation and some other risk maps is a pending task that should be carried out soon. In this link, there are some examples of propagation maps.

If we were to use these models for a global strategy, we would have to consider the deployment of large-scale models. Only through the cloud with services such as Google Cloud, Amazon Web Services or any other form of Internet hosting could a task of this magnitude be achieved. Each model is different, given its different architectures and functionalities that are best suited to each case and depend on the computational, memory and availability requirements that each model will have.

Prevention

If we manage to contain the pandemic efficiently, we can’t take our finger off the line. It is important to be updated on best practices, because only then can we be ready for crises of this nature. This section discusses a very important part of data science that pertains to data infrastructure and is found more in data engineering than data analysis.

Privacy in the health sector

Preserving the privacy of individuals involved in the use of ML models is a relevant ethical issue and even more so in cases where the data used might contain sensitive information, as in the case of the health sector.

We are grateful to Johns Hopkins University for providing daily updates on this development, which can be consulted in their github repository.

In addition, this topic may become increasingly relevant in the future for organizations that make use of personal data, as they will have to comply with new regulations protecting the circulation of such data, as in the case of the General Data Protection Regulation (GDPR).

Fortunately, today tools have been developed (and are still being developed) that offer a technical solution to such a problem, which is rather legal and political in nature. The most notable tools in this regard are federated learning, differential privacy, and secure multi-party computing (SMPC). Together, these tools can provide a framework for developing predictive models with encryption of both data and models, which preserves user privacy.

For example, one of the paradigm shifts in federated learning implies that learning is no longer centralized (left), but distributed (right). That is, the model moves to where the data are, not the other way around.

Imagine that a tool is developed that can successfully use a combination of biometric data, geolocation, and search engine queries such as Google to make predictions about the dynamics of an epidemic. In that case, the data could be used to make sensitive observations about the users, about their daily habits and about their health status, which could be used in an adverse way. Otherwise, users might not even be willing to provide this data, thus making it difficult to develop such a model. This creates a moral dilemma: there is a direct social benefit in being able to make predictions about the epidemic, but this would mean that the privacy of individuals would be alleviated. The tools mentioned here could facilitate the development of effective models that could overcome this dilemma.

To learn more about this topic, we recommend researching the work of OpenMined and Dropout Labs.

Conclusions

Despite the fact that the reaction of the international scientific community has been more agile than ever, that information about the virus has been shared throughout research centres and that international health authorities have activated the relevant protocols in time and with sufficient scope, the virus has not been contained and is not expected to be contained in the short term.

To think that a low fatality rate for this virus is a reason not to maintain the strictest measures would be a serious mistake. Looking at the patterns of coronavirus transmission, it is easy to imagine how in a matter of weeks the health systems of an entire country could be completely collapsed, further compromising people who are prone to die from this virus and thus increasing, day by day, the number of people who die from COVID-19.

The contributions of the scientific community active in the development of Artificial Intelligence models, of which deep_dive’s team is a part, provide valuable analytical and predictive approaches to mitigate the crises of this and other natures.

We thank all those who are part of this effort and those who collaborated in the creation of this article.

Manuel Aragonés, Jerónimo Aranda, Javier Cors, Camila Blanes, Arturo Márquez and Jerónimo Martínez.