How Data Science is Revamping Disease Surveillance Systems

Published in

Towards Data Science

7 min readMar 24, 2020

Most shops and all schools remained closed during the corona shutdown in Italy. (Picture: Getty Images)

Data Science technologies like Artificial Intelligence, or more specifically Machine Learning, have driven phenomenal shifts across a spectrum of industries. Of the many domains where the use of ML is being widely explored, its application in the healthcare domain is witnessing revolutionary developments due to an explosion in the availability of patient data. While no algorithm can ever replace the warmth of human touch, and the compassion that underscores a doctor-patient relationship, promising data science technologies can certainly supplement the efforts of medical and healthcare personnel by providing insights into diagnostic and treatment processes, thereby, contributing to improved outcomes and enhanced patient care. From Microsoft’s InnerEye leveraging computer vision for medical image diagnostic tools to Alexa’s foray into chronic disease care for patients at home, we are bound to witness a definitive shift from a diagnostic-curative model to a predictive-preventive model, one that will reduce costs along with patient suffering. Thus, the prediction by ReportLinker of a surge in the ‘AI for healthcare’ market-size from $2.1 billion in 2018 to $36.1 billion by 2025 at a staggering CAGR of 50.2% hardly comes as a surprise.

Although there are a host of aspects where Data Science is assisting practitioners, patients and policymakers, this post will specifically deep dive into how our responses and defences against epidemics (and pandemics!) can be enhanced using data-backed tools for forecasting/nowcasting disease dynamics.

Monitoring, Modelling and Forecasting the Spread of Diseases

In our shrinking world, the threat of infectious diseases is flourishing more now than ever before due to increasing global travel, urbanization and climate change. Infectious diseases kill over 17 million people a year. However, while diseases spread fast, knowledge spreads faster! The tremendous data generation capabilities of modern technology can be leveraged using Data Science tools to get insights into real-time disease surveillance and consequently, forecast the disease spread. Tracking and forecasting the dynamics of infectious disease outbreaks is immensely useful for decision-making regarding healthcare resource allocations and informing public policy regarding the selection and implementation of appropriate interventions to minimize the morbidity and mortality rates.

METHODS

What we are essentially dealing with here is trends of a particular disease over time or ‘Time-series Data’. We model the epidemic over historical periods where it showed significant activity and monitor it for current or future periods of potential resurgence. Important metrics whose nowcast (estimate of the number of cases happening in real-time) or forecast estimations aid public policy include:

1. Daily/Weekly case counts

2. Peak Timing

3. Peak Height (of the curve representing case counts)

4. Outbreak Duration and Magnitude

Further, the granularity of our target variable(hourly/daily/weekly/monthly) will depend on the granularity of the input data available. For example, if we have monthly data points, our target predictions cannot be on a lower granularity, such as weekly. A point to note, however, is that the finer the grain of the data, the more control we have over its analysis and interpretation, resulting in better insights into the epidemic’s dynamics.

Some models try to incorporate spatial data along with temporal. This is mostly done by dividing the data according to geographic regions on scales ranging from cities and districts to nations or latitudinal ranges. This makes sense, especially in the cases of pandemics, since the peculiarity of various regions and the populations residing there might play an important role in the outbreak’s dynamics. In case of a spatial-temporal analysis, the above-mentioned target parameters are also estimated separately for each region, sometimes by fine-tuning the models with data from the particular geographic section.

DATA

Traditional surveillance systems use both virologic and clinical data gathered from hundreds of healthcare providers throughout a nation to publish epidemic reports, which are typically weekly. Though reliable, this method is expensive and slow. There is a lag of 1 to 2 weeks in the data in these reports. To provide real-time monitoring of outbreaks, data is being captured from innovative surveillance systems that monitor indirect signals of influenza activity.

Web Search Queries and Social Networking Sites Data

About 80% of the internet users search online for information about the medical problems they face, making web search queries a uniquely valuable source of information about health trends. Not surprisingly then, health-seeking web-search behaviour has been found to be highly correlated with the percentage of physician visits in which patients present symptoms of the corresponding diseases, measured during the same periods. This is especially true of seasonal influenza-like illnesses (ILI).

Prediction of ILI cases by a model based on search query logs (Black) vs. Actual ILI case counts (Red)

Of course, certain search queries are more highly correlated than others. Also, the set of terms or phrases whose search frequency should be monitored and used in models that nowcast/forecast the disease activity may vary with time and across different regions. Hence, there are tools that employ Machine Learning and Statistical models to automatically discover the most indicative set of queries for a particular disease during a given period in a given region.

Many studies have also demonstrated the success of using Social Networking Sites (SNS) to conduct real-time analysis of the prevalence of epidemics. SNS are used widely by many people to share thoughts and even health status. Therefore, they provide an efficient resource for disease surveillance apart from being a good way to communicate and spread awareness for the prevention of epidemics and pandemics. SNS users can be used as sensors that provide data to be analyzed for early trend detections and predictions. Twitter is an especially mature resource given the frequency of posts which enables a minute-by-minute analysis and its diverse group of users ranging from the young to the tech-savvy older population leading to data points spanning the entire gamut of age-groups. Moreover, Twitter posts are more descriptive as compared to search engine logs and a deeper analysis of the poster’s demographic data through his user profile can offer enhanced insights.

2. Meteorological and Environmental Data

Climatic variations are a known factor impacting the dynamics of disease spread, especially seasonal epidemics. For example, studies have found that rainfall has a significant effect on the interannual variability of epidemic malaria which suggests including rainfall as an input variable in a model forecasting malarial outbreaks. Hence, meteorological data, such as rainfall, temperature variations and humidity provide important data points that can be used in models which forecast the target parameters pertaining to a disease.

Other environmental factors such as vegetation index, populational densities (yes, people are a part of our environments), air quality can also be incorporated into models depending on their co-relations to the spread of a particular disease.

3. Clinical Surveillance Data

Traditional surveillance data such as historical patient counts, historical illness duration and peak values all play a role when we are dealing with time-series. For example, the disease activity in the past 2 weeks may indicate its trend in the coming 1 week. These data points are especially useful while building forecast models which need to be trained to learn patterns from historical trends or those that capture lagged relationships.

Another type of clinical data that provides near real-time patient information is the Electronic Health Record (EHR). EHRs provide a host of information about a patient such as his/her demography and medical history that may result in a more nuanced analysis of the situation. Moreover, data from EHRs can be consolidated to provide statistical counts of the number of patients showing symptoms vs. those actually having the illness.

4. Other Data

Other indirect data suggestive of the scale of the outbreak include over-the-counter drug sales which are mostly observed in the case of ILIs. Some studies even use (pseudo-random) simulation of incidences based on historical evidence.

MODELS

Following are some of the most common models encountered in the literature to analyze epidemic time-series data:

1. ARIMA:

ARIMA (Auto Regressive Integrated Moving Average) and its variants are some of the most effective methods for modelling time-series data. Indeed, it is the most common method used to model infectious disease time-series data as well. Since ARIMA models assume that future values can be predicted based on past observations, it works well with the clinical surveillance data points mentioned above, capturing the lagged relationships that usually exist in periodically collected data.

ARIMA models can, however, be limited while dealing with diseases such as ILI which are not consistent from season to season or while predicting pandemics which occur off-season. Moreover, ARIMA does not work very well with unstructured data such as search query logs and data from SNS, which are playing increasingly important roles in the nowcast of diseases.

2. Regression Models:

Multi-variate linear regression is the most common form of regression analysis that has been applied in outbreak forecasting studies. The models capture various data points including autoregressive and seasonal parameters and (lagged) weather covariates. Generally, regression models are fine-tuned for different populations on a finer spatial grain such as for each city or state within a nation.

3. Neural Networks

Given the variety of input data features available, it is a good choice to put Neural Nets to use, provided a large quantity of data is available for training them. Requiring limited feature engineering, neural networks are the talk of the town while analyzing multi-modal complex datasets, as in the case of outbreak forecasting using multiple types of data sets.

4. Other Methods

Apart from those mentioned above, various other methods and model specific to the data have been used. For example, Text Mining coupled with Topic Models or Graph Data Mining for extracting and analyzing features from search query and SNS data. Besides, several of the feature data types can be combined using a combination of different methods and models to build robust systems for the purpose.

Multiple methods are cropping up in an attempt to improvise and aid our response to disease outbreaks using a powerful tool we have at our disposal: data. The most successful of these have already been implemented on a larger scale and are assisting governments and public policies to fortify defences. It is only when we begin to trust these (too early?) warnings and take the necessary measures, even if they seem an overreaction, can we truly reap their benefits.

How Data Science is Revamping Disease Surveillance Systems

Monitoring, Modelling and Forecasting the Spread of Diseases

METHODS

DATA

MODELS

Written by Aishwarya Jadhav