Predicting Dengue Cases in Singapore

Adeline Ong

Published in

Towards Data Science

6 min readNov 2, 2019

You’re enjoying a pleasant evening in the park, picnic mat laid out, a nice cold drink in hand.

“SMACK!”

You just killed an Aedes aegypti mosquito. You admire its distinctive white-and-black striped body on your hand. Just a harmless bite, you think.

The following week, you’re lying in bed with a high fever, muscle aches, and skin rashes. What on earth happened? You got dengue from an Aedes mosquito that hatched about two weeks ago.

An Early Warning System

Dengue is a terrible affliction. Without timely or effective interventions, sporadic cases of dengue can easily turn into an epidemic. Aedes mosquitos contract dengue from feeding on infected humans, or from their mothers, which can lay hundreds of eggs during their two-week lifespan.

If an epidemic occurs, an early warning system could help healthcare providers better manage an increase in dengue patients.

Potential Predictors of Dengue

Weather info: Female Aedes mosquitos love hot and rainy conditions. They lay their eggs in stagnant water, and warm temperatures encourage their offspring to grow faster.

Google trends: Search terms containing the term ‘dengue’ have been successfully used to predict dengue cases in a number of countries. This is probably because those who are more susceptible to the virus, are also more likely to be googling about it.

Data Collection

Singapore-specific data from 2014 to 2018 on dengue cases, weather info and search interests were collected from the web.

Dengue and weather data were scraped from government websites using Selenium and BeautifulSoup:

Weekly number of dengue cases. Only data from 2014 to 2018 was available at the time of scrapping. This limit set the time period for all other data sources.
Daily weather information. Only data from Changi weather station was scrapped; this station serves as the government’s main historical reference point.

Google Trends search interest data for search term ‘dengue’ from 2014 to 2018.

Weekly search interest data for the following terms were obtained from Google Trends:

ache OR pain
dengue
fever
headache
vomiting
nausea
rashes
eye pain

Data Cleaning & Merging

The datasets were relatively clean, except for duplicates and missing values, which were removed. Dengue haemorrhagic fever cases were also removed from the dengue dataset, as such cases were far and few.

The main challenge was creating a common time feature across the datasets to merge them on. The dengue dataset was dated using epidemiological week (e-week, e.g. Week 52 of 2018). Epiweeks package was used to convert the datasets’ date features to e-weeks.

For the weather dataset, groupby was used to get the average temperature and rainfall for each e-week.

All datetime features were dropped before fitting the data to the model.

Time Lags

Time lags were used to account for the time taken for symptoms to develop in an infected person, and for Aedes mosquitos to hatch.

For each weekly dengue observation, its corresponding search interest data would be from one week earlier, and weather data would be from two weeks earlier. Here’s why:

Symptoms of dengue typically take 4 to 7 days to appear after being infected. I assume potential patients are likely to Google for dengue and its symptoms during this period.
Weather changes are assumed to affect mosquito breeding rates. The assumption is that higher temperatures encourage Aedes mosquitos to lay eggs; Aedes mosquitos take about a week to hatch and grow.

Train-Test Split

The merged dataframe was split into 20% test data, and 80% training-validation data (for k-fold cross-validation).

Feature Engineering & Transformation

A temperature range feature was created by taking the difference between the maximum and minimum temperatures. I expected a smaller temperature range to be predictive of dengue cases, especially during warmer weeks.

Log transformations were applied to the number of dengue cases (target) and search interest in dengue (feature) to correct for skews and improve their correlation.

All variables were standardised using StandardScaler before model fitting.

Feature Selection

Feature selection was conducted at various stages of the model building process.

Before building the models: To avoid multicollinearity, features that were highly correlated were removed.
After cross-validation established that a linear model was likely to be suitable for the data: Lasso regression was used for feature selection. An alpha value was obtained using LassoCV and applied to the model. Features that had zero-value coefficients were removed.
After selecting an appropriate linear model: Features were selected using backward stepwise method. Features with the highest p-values were removed until the model’s adjusted r² was reduced.

Here’s a summary of the variables and why they were removed:

Table of variables, and their status (removed/included in model)

K-fold Cross-validation

A 5 k-fold cross-validation was used for model selection due to the small number of observations ( 248, after removing missing values that arose from applying the time lags to the data).

Below is a summary of the linear models used, and their r² means and standard deviations across k-folds. For Ridge and Lasso, the alpha values were set to the default value of 1.

Simple linear model r^2: 0.755 +- 0.041 
Polynomial model r^2: 0.441 +- 0.153
Ridge model r^2: 0.755 +- 0.043
Lasso model r^2: -0.010 +- 0.012

To select between simple and Ridge models, I re-ran the cross-validation after performing feature selection using Lasso regression, and applying an alpha value obtained from RidgeCV.

Simple linear model r^2:: 0.757 +- 0.043 
Polynomial model r^2: 0.665 +- 0.053
Ridge model r^2: 0.757 +- 0.046
Lasso model r^2: -0.010 +- 0.012

The r² mean and standard deviations for both models were similar, hence a simple linear regression was selected.

Testing & Evaluation

The model was re-trained using StatsModel on the entire 80% of the training-validation data. Backward stepwise feature selection was performed to further simplify the model.

Here are the results from the final model:

StatsModel Output for Training-Validation Set

After fitting the testing data to the model, and the resulting adjusted r² was: 0.760. The model appears to generalise well.

Checking Assumptions

Left to Right: Scatterplot of y-predicted and y-residuals; Q-Q plot of y-residuals; Scatterplot of e-week and y-residuals.

There were no discernible patterns from the scatter plots, and y-residuals appear to be normally distributed, except for some deviations at the tails.

Judging from the StatsModel output and the plots above, a linear regression model appears suitable for the data.

Conclusion

The model could explain 77% of the variance in the log of weekly dengue cases in Singapore, and generalises well.

Search interest in ‘dengue’ appears to be most important when predicting future dengue cases. It’s possible that this feature already incorporates some info about mosquito breeding, and transmission, as Singapore news outlets tend to report on breeding clusters, dengue infections and related deaths.

Therefore, healthcare providers could use dengue search interest as a rough indicator instead of monitoring a wide range of variables.

Future Work

I was surprised by the significant negative coefficients for headache and ache/pain. Awareness of dengue in Singapore is high. I suspect that people who are aware of dengue could be directly searching for ‘dengue’ when these symptoms occur following a mosquito bite or when dengue cases are increasing. Hence, a better understanding of search patterns could help improve the model.

***GitHub***

***This was a METIS Data Science Bootcamp project. Linear regression and web scraping were project requirements***