The world’s leading publication for data science, AI, and ML professionals.

Multimodal machine learning approach to understand region-based COVID-19 disease burden

AC295 Group: No Idea Yet (Yu Qian Ang, Haoyuan Li, Chenyue Lu)

Hands-on Tutorials

Image by Authors.
Image by Authors.

By: Yu Qian Ang, Haoyuan Li, Chenyue Lu

This is the final project for Harvard University AC295 Fall 2020 term (Group No Idea Yet). Special thanks to Prof. Pavlos Protopapas and Teaching Fellows.

Introduction

The current COVID-19 pandemic has shown stark disparities in our modern society. We see higher COVID-19 case rates and mortality rates in marginalized communities, such as Native Americans on reservations and immigrant families who live with more people in crowded spaces. When examining the social determinants of health, we ask how geographical and spatial features are related to COVID transmission patterns. Importantly, as the pandemic drags on into the winter months, people are forced to go indoors, and cold, dry air makes the virus more transmissible. Unfortunately, we expect an uptick in case counts, placing higher burden on our healthcare system. We hope to predict which regions will see greater increases in COVID transmission so we can preemptively allocate resources accordingly.

Naturally, urban-scale features affect behavior of the pandemic as well as analysis of the pandemic epidemiology – such as spread and distribution – to a significant extent. However, apart from typical weather and physics-based parameters such as temperature and wind speed, data pertaining to other urban scale-features are typically tedious to collect, collate and analyze. Collecting these data also requires significant computation resources, and information on these parameters is often not readily available leading to some assumptions in the models. Applying current physics-based modeling practices to urban-level simulation that involve multiple buildings and their environment is mainly done using a combination of spreadsheets with physics-based simulations. Such a modeling approach is unsuitable for resolving details of urban microclimate and detailed neighborhood planning, and especially to study different regions in greater detail.

In this context, it will be interesting to build Machine Learning models to determine what these models can learn from publicly available data, and how useful these learned (latent) features can be in helping us understand the COVID-19 pandemic epidemiology. In this project, we seek to develop models to identify features in urban scale datasets, and by combining these machine learned features with socio-economic and physics-based data, to produce accurate prediction models that will help us better understand the pandemic, to ultimately make better policy and epidemiology decisions.

Taking a multi-modal big-data approach, we wanted to find the important features in predicting COVID-19 disease burden, specially based on geographical regions. We decided to specifically look at Wisconsin for the following reasons. Compared to other states, Wisconsin has data at a much more granular level (census tracts), which typically have 4,000 residents and range between 2,000 and 8,000. Wisconsin is #8 nationwide in COVID cases and fewer hospital beds than the national average, yet the state reopened early on and placed fewer public health restrictions. The colder climate also makes it difficult for people to gather outside, making Wisconsin particularly vulnerable. We are using it as a case study to study the important predictors in determining COVID-19 transmission. Specifically, we adopt a multi-modal methodology, shown below.

Data

There are four main data sources for this project. First, we obtain the daily COVID-19 cases by census tract from the Wisconsin Department of Health Services. This allows us to create our target labels.

Histogram for COVID-19 cases in Wisconsin (Left: absolute. Right: normalized). Image by Authors.
Histogram for COVID-19 cases in Wisconsin (Left: absolute. Right: normalized). Image by Authors.

The positive cases are evidently positively-skewed with a long right tail. As data for the initial months are significantly empty, we took the months of June to November for our project. For regions with populations of 200,000 or less, CDC’s primary criterion for classification is based on 1) Covid-19 case counts (cumulative new cases over the past 28 days) and 2) new case trajectory (are new cases over the past 28 days increasing, decreasing, or stable?). In particular, CDC has the following region risk classification:

  1. Level 1 (Low): less than 10 case count over past 28 days
  2. Level 2 (Moderate): 10–50 case count over past 28 days
  3. Level 3 (High): 51–100 case counts over past 28 days
  4. Level 4 (Very High): >100 case counts over the past 28 days

We follow the above CDC risk classifications to develop our target labels, and ended up with 1,392 census tracts instead of 1,409 after pre-processing. Also, there are no census tract with low risk (i.e less than 5 case count over past 28 days). Thus, we simply merge the target labels into 3 classes of Low Risk (Class 2), Moderate Risk (Class 3) and High Risk (Class 4) as our risk classification target labels.

The second set of data comprises socio-economic data that we obtained from the United States Census, tabulates socioeconomic information such as race and ethnicity makeup, poverty level, and housing type.

Histograms for socio-economic (census) data for Wisconsin. Image by Authors.
Histograms for socio-economic (census) data for Wisconsin. Image by Authors.

There is a wide range of diversity across Wisconsin. More rural areas tend to be poorer. In the northern regions, the householders are more likely to be older than 65 years of age. There seem to be a racial make-up difference as well, with the highest percentage of Hispanic populations living in Milwaukee County, for example.

Heatmaps of selected Wisconsin socio-economic (census data). Left: Poverty (normalized). Middle: Percentage of Older Households (normalized). Right: Percentage of Black Individuals (normalized). Image by Authors.
Heatmaps of selected Wisconsin socio-economic (census data). Left: Poverty (normalized). Middle: Percentage of Older Households (normalized). Right: Percentage of Black Individuals (normalized). Image by Authors.

Next, we scrapped information from weather stations around Wisconsin, to obtain data pertaining to the physical environment such as wind speed, relative humidity, temperature, etc., and we also calculated other parameters such as hours of natural ventilation. The image below shows some weather stations which we obtained data from, as well as the calculated natural ventilation hours. For example, also the area around Milwaukee County is closer to the water body, it has higher built-up area due to the urbanized spaces, and therefore less natural ventilation.

Left: Natural ventilation hours. Right: Histograms for average temperature and relative humidity. Image by Authors.
Left: Natural ventilation hours. Right: Histograms for average temperature and relative humidity. Image by Authors.

Our final set of data comprises satellite images scrapped for the entire state of Wisconsin. We wrote a custom script to extract individual raster tiles at the highest resolution provided by the Mapbox API. For each census tract, our script stitched 9 x 13 tiles to form a super-tile characterizing the particular census tract. We automated the process to make over 200,000 api calls for all 1,409 census tracts in Wisconsin. The final satellite imagery dataset was stored in Google Drive.

Approach for scrapping and feature engineering satellite data. Image by Authors.
Approach for scrapping and feature engineering satellite data. Image by Authors.

Model & Methodology

Given the widespread impact and severity of covid19, there were attempts by researchers in various domains (e.g. epidemiology, mechanical engineering etc) to analyze the spreading patterns of pandemic. These are currently analyzed in silos, and restricted within domains. We approached this as a multi-disciplinary problem underpinned strongly by machine learning techniques. Specifically, we are approaching this project as a multi-level inference + prediction problem incorporating computer vision, feature extraction, and regression techniques. The figure appended below illustrates our proposed workflow.

Our multi-modal approach. Image by Authors.
Our multi-modal approach. Image by Authors.

First, we built two convolutional autoencoder model (CAE) – one from scratch and one using a ResNet model from Tensorflow Hub pretrained on a satellite land cover dataset. For each census tract super-tile, we sampled 100 segments in a grid to run it through our CAE. The CAE converts the super-tile of dimension 4608 x 6656 x 3 to a latent dimension of (4, 4, 4) which we flattened into a vector representing the urban latent features of each census tract. We also noted during our training and experimentation that the CAE model from scratch performed better than the pretrained model, and thus we continued with our modeling using the CAE model from scratch.

Baseline Model

In our baseline model, we fit only the socio-economic and physical-environment data (without the latent features from the satellite images). For the baseline model, we compared an XGBoost classifier with a Random Forest Classifier, and both obtained on average 65% to 67% accuracy on classifying the COVID-19 risk classification on a 0.25 hold-out test set. We tuned the hyperparameters and found that a max-depth between 5 to 10 and 100 estimators work better for both models.

Left: Feature Importance for Baseline XGBoost Model. Right: Feature Importance for Baseline Random Forest model. Image by Authors.
Left: Feature Importance for Baseline XGBoost Model. Right: Feature Importance for Baseline Random Forest model. Image by Authors.

As shown above, the feature importance for the XGBoost and Random Forest models differs. In the XGBoost model, physical environment parameters such as relative humidity and temperature range have higher importance, compared to socio-economic parameters such as ethnic mix, poverty, and percentage of older dwellings in the the Random Forest Model.

CAE

To improve our model in line with our methodology, we used the CAE to extract latent features from the super-tiles. Our model has the following encoder-decoder structure, and training for 5 epochs is sufficient for convergence.

Left: Encoder architecture. Right: Decoder architecture. Image by Authors.
Left: Encoder architecture. Right: Decoder architecture. Image by Authors.

One epoch took approximately 15 mins to train on Colab Pro with GPU enabled.

CAE training and val loss. Image by Authors.
CAE training and val loss. Image by Authors.

Even after high dimensionality reduction of(4608 x 6656 x 3) to (4 x 4 x 4), it’s evident that key features are captured (for example, whether the particular tile comprises more urbanized areas or more greeneries). This constitutes part of our hypothesis that COVID-19 risks and growth rate differs for highly urbanized areas vs sparse low-density regions.

Samples of satellite reconstruction vs original high-res tiles. Image by Authors.
Samples of satellite reconstruction vs original high-res tiles. Image by Authors.

Enhanced Model (Classification)

In our enhanced model, we added the latent features from the CAE to complete the model. We adopted two key approaches for the enhanced model. In the first approach, we directly added all 64 dimensions of the latent satellite features to the XGBoost and Random Forest models. In the second approach, we performed Principle Component Analysis (PCA) on the 64 dimensions to obtain 2 principle components that explain the most variance.

left: Feature Importance for XGBoost model with 64 latent features. Right: Feature Importance for Baseline Random Forest model with 64 latent features. Image by Authors.
left: Feature Importance for XGBoost model with 64 latent features. Right: Feature Importance for Baseline Random Forest model with 64 latent features. Image by Authors.

The enhanced models with all 64 latent feature dimensions outperform than the baseline, averaging 68% to 71% accuracy. In terms of feature importance, both the XGBoost and Random Forest models are now more similar, with socio-economic parameters such as racial mix, poverty, older households taking on higher prominence.

left: Feature Importance for XGBoost model with 2 principal components. Right: Feature Importance for Baseline Random Forest model with 2 principal components. Image by Authors.
left: Feature Importance for XGBoost model with 2 principal components. Right: Feature Importance for Baseline Random Forest model with 2 principal components. Image by Authors.

The model with the principle components (PCA1 and PCA2) outperformed the model with all 64 dimensions directly added, averaging slightly over 72% accuracy. We observed from the feature importance that apart from the usual socio-economic parameters, the first principal component (PC1) plays a significant role in both the XGBoost and Random Forest models. The second principal component (PC2) also falls within the top 10 most importance features, albeit lower in importance.

SHAP Analysis

SHAP analysis and waterfall plots for 3 classes in the model (low risk, moderate risk, high risk regions). Image by Authors.
SHAP analysis and waterfall plots for 3 classes in the model (low risk, moderate risk, high risk regions). Image by Authors.

A SHAP analysis was done for the enhanced model, and the SHAP and waterfall plots above show the results. It was observed that both principle components feature prominently across all 3 classes. Other key variables include racial mix, poverty (normalized) and percentage of population above 65 years old, as well as the occasional physical environment parameter such as temperature range.

Enhanced Model (Regression)

In addition to the classification task, we wanted to see how well our model perform in regression tasks, which is arguably more challenging. For the target values, we feature engineer the original COVID-19 data and use cumulative average growth rate for each census tract as targets, where in this context growth is defined as:

end value / start value ^ (1/days) -1 x 100%

We attempted three different tree-based ensemble models – XGBoost, Random Forest, and Extra Trees Regression – each incorporating all predictor data sets (socio-economic, physical environment, and latent features from satellite data with principal component analysis performed). The lowest RSME across the three models was 1.7, and the best R2 for prediction on a 0.25 held-out test set vs true values is 0.63.

R2 for prediction vs true value for best tree-based ensemble regression model. Image by Authors.
R2 for prediction vs true value for best tree-based ensemble regression model. Image by Authors.
Left: XGBoost Regression. Middle: Random Forest Regression. Right: Extra Trees Regression. Image by Authors.
Left: XGBoost Regression. Middle: Random Forest Regression. Right: Extra Trees Regression. Image by Authors.

Across the three models, we observe that some parameters are consistently prominent. For example, the poverty (normalized) parameter consistently features as one of the top predictors, together with racial mixes (e.g. Blacks or Hispanics). However, the XGBoost model seems to favor physical environment factors (e.g. relative humidity and wind speed), while the Random Forest and Extra Trees models favor socio-economic factors. The principle components from the latent satellite features are less important in the regression model than in the classification model, but are still present in the top 10 most important feature list.

Conclusion

In our project, we developed a multi-modal machine learning approach to study COVID-19, and deployed the model in the state of Wisconsin as a Case study. We found several predictor/features as important across all our models, and infer that socio-economic parameters seem to be more importance than physical environment factors in determining risk and spread of COVID-19.

There are a few limitations with this study. SARS-COV-2 is an airborne virus and its transmission is heavily influenced by social distancing, mask wearing, and other public policies and measures. Given we do not have good tabulated data on these factors, limiting our study to one state with more homogenous demographics and regulations could help control for many factors. Our data are also quite static, with the exception of the physics-based data. COVID-19 cases change overtime, and we could benefit from additional data sources that capture more of the temporal changes.


Related Articles