Dengue Fever and How to Predict It

Brian Connor
Towards Data Science
6 min readOct 31, 2018

--

Introduction

As my final project for class, I decided to revisit machine learning competitions with a more developed data science skillset. I chose predicting cases of Dengue fever in San Juan, Puerto Rico and Iquitos, Peru because of the global health aspect of it, and in part because I wanted to improve on the methodology of my West Nile in Chicago project.

So, why Dengue? In the scope of global health, Dengue is a significant yet sometimes neglected disease, affecting an estimated 50–100 million people per year and costing nearly $9 billion globally. In Peru there were more than 27,000 reported cases in 2016, and over 26,000 reported cases in 2010 in Puerto Rico.

Using a combination of neural networks and regression models, I was able to moderately predict cases of Dengue as well as glean a little insight into what might be some of the key predictive factors.

The Data

The data for this project was downloaded from DrivenData.org as part of their “DengAI: Predicting Disease Spread” challenge. In addition to the number of cases at each location, the data includes information on temperature, precipitation, humidity, vegetation, and what time of the year the data was obtained (the full list of features can be found here). Of the data provided, the only feature I needed to take a closer look at was satellite vegetation, which I learned is essentially an index of “green-ness” for a given area, based on an calculations from satellite measurements of light radiation from plants. For example:

https://en.wikipedia.org/wiki/Normalized_difference_vegetation_index#/media/File:NDVI_062003.png

The data also contained a decent chunk of null values. Rather than dropping all of them, I dropped only the nulls from the worst offending column (ndvi_ne, the vegetation index pixel northeast of the city centroid), totaling about 200 out of 1450 rows, and filled in the rest of the missing data with the average of that column for the same week. Here I made the assumption that, for example, week 36 of 1995 would be similar to week 36 of 1998 in terms of weather features.

I also experimented with separating the data set into two different sets based on cities so that information didn’t bleed from San Juan to Iquitos, making the assumption that similar weather conditions in each city could lead to significantly different numbers of mosquitos. Unfortunately, splitting the data actually made my predictions worse, likely because it halved the available data to each model. My final models therefore used a combined data set, although with stronger models I would still want to try separating the cities because I think the assumption is still a valid one.

Modeling and Results

My first instinct in modeling was to build a neural network, which almost seemed like a silver bullet at first. But before I could score my first model, I hit a snag:

Submitting is as easy as 1, 2, 3… 4… 5… 6…

The dreaded submission format! Luckily, thanks to my tremendous coding chops, I was able to write a function that would take in a model and spit out predictions in a nice, submission-ready CSV.

With that settled, I could finally score my magical neural networks. Unfortunately, they didn’t quite live up to my own hype. My first model, made with Keras, performed just ok, yielding a mean absolute error, MAE from now on, of 29.6 (for reference, the best score on DrivenData was 13 MAE). Fortunately, this particular project benefits from unlimited submissions! Well, slightly limited, 3 per day to be exact. Because of this I had to be a little tactical in how I generated predictions. That’s when I came across a Python package called Talos, which was specifically built to optimize hyperparameters in Keras, essentially working like GridSearch for neural networks.

Various neural nets and their scores

In terms of scores, all my neural networks boiled down to “about 30”. For a little clarification on the labels, “poly-weather” indicates the use of polynomial features engineered from the weather predictors, “cleaning” indicates whether ndvi_ne null values are filled in (1) or dropped (2), “tuned/early” indicates the use of Talos and early stopping respectively, and “adjusted” was when I un-scientifically subtracted 1 from each prediction because I couldn’t seem to get my models to predict 0. On the whole, every model I created was overfit, usually differing between test MAE and holdout MAE by about 15 points.

After feeling that I had tried a sufficient number of neural network predictions to be sure that the next set of predictions would also have a score of “about 30”, I decided to look into other methods. While I definitely wanted to try to improve my score, I also wanted to try to come up with something a little more interpretable; it’s one thing to be able to predict cases of Dengue, but the next step would be to determine which exactly are the features most responsible for incidences of the disease. In order to tackle this dual-goal, I combined a pipeline with GridSearch to try multiple different models and parameters at once:

Thanks to my friend William “Mac” McCarthy!

Out of these new models, the random forest performed the best with an MAE of 26.1, a solid four point improvement! A close second was the bagging regressor at 26.5, followed by K-nearest neighbors at 30.2. I didn’t end up submitting the AdaBoost because the train/test split score was so poor, about 15 points higher than the rest. To look at something more interpretable, I decided to focus on the LASSO model (which scored 34.3).

LASSO coefficients

One thing to mention is that for these models I consolidated the individual vegetation indexes into a single (misspelled) predictive index using a neural network. In general, the features with larger coefficients, and therefore stronger predictive power, were not very surprising, with year, week of year, precipitation, and minimum air temperature having significant influence on the model. Vegetation index was also a robust predictor, so a future step might be to separate it out again and investigate how to interpret the NDVI itself. TDTR, which as we all know stands for Time-Domain Thermoreflectance, is something you might have to investigate for yourself.

Final Thoughts

While I’m pretty happy with what I’ve created and learned over the course of this project, there’s still a lot to try in pursuit of more accurate predictions and a better score, and with 7 weeks left in the competition at the time of writing this, there’s plenty of time too! I think the most important issue to tackle first is the large amount of overfit in my models, which I can potentially address by limiting the number of neural network features to the most significant LASSO predictors. I also think it would be worthwhile to try to come up with some totally new features based population density, healthcare spending, or treatment method data, to see if they would be more effective than some of the weaker predictors.

Looking forward to revisiting this project in a couple weeks, but for now thanks for reading!

--

--