Old Health Behaviors Are Shaping Our ‘New Normal’

Understand how population health history impacts the COVID-19 spread with CDC census data

Estella Zhang
Towards Data Science

--

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

* COVID deaths data is based on the 100th day since the city has the first confirmed case.

200 days into 2020 and ‘new normal’ is still creating jump scares every day. With the curve still not flattening, we start asking ourselves: What is the new normal? More importantly, what can we do to help ourselves and others survive it?

There are hundreds of rumors around COVID-19: Is it just flu? Are young people immune to it? Is it only dangerous to people with underlying conditions? Short answer: no, no, and no. However, from the public health data, we might be able to get insights on who is more vulnerable to the virus, and what should communities do to protect you and your beloved ones.

Data Overview

To understand how population health history impacts the COVID-19 spread, we combined the daily COVID activity data from data.world, the 500 Cities: Local Data for Better Health, 2018 release from CDC and mapping data from simplemaps.

COVID Measures: Since the first confirmed case dates range from February to April, we use the data of the 100th day since the area’s first case to reduce the difference caused by the virus spreading time.

Dates of the first confirmed COVID case by city
  • Deaths1M — Deaths per 1,000,000 people by county

Population Health Measures: The data includes 27 measures of chronic disease related to unhealthy behaviors, health outcomes, and the use of preventive services of 500 U.S cities. The values are shown in percent. For more details, please visit the CDC project page.

Measures by category

City/County Mapping: City/county mapping data was used to combine the above two datasets. If a county includes multiple cities, we assume the COVID death rates are the same for all cities.

After combining and cleaning the data, we have 497 cities with COVID measures, 27 health measures and population density.

Data summary — histogram

What factors are impacting the COVID deaths in an area?

What do we know:

  • Areas with early first confirmed COVID-19 cases are more likely to be under-prepared at first and take heavier hits.
  • Without a vaccine, total confirmed cases might be less important in the longer term. Properly arrange resources to reduce the death rate will be critical for the ‘New Normal’.

Assuming 100 days is sufficient for most areas to stabilize the situation and announce safety regulations, we will start by analyzing the coefficiency between our health measures and Death1M on the 100th day since the first case.

Keep the doctor a̶w̶a̶y̶ close

This chart could be overwhelming with all the measures we have, but take a quick look at the below-0 part: Based on the Pearson Coefficient, 5 out of 6 measures that are negatively correlated to the death case are prevention services*. This could probably indicate better health awareness, as well as a better experience with the healthcare system.

Which groups are more vulnerable to COVID-19?

To better understand this question, we used OLS regression model to investigate which measures have stronger impacts on the death rate of COVID-19 in a certain area.

From the fitted model, we got a 0.508 adjusted r-square score on the training set, which means the health measures we used in the model explained 50.8% of the COVID death rate in an area. If we use this model to predict the testing data, the r-square score is 0.236, which is reasonable as the damage done by the virus is also impacted by a lot of other factors, such as what safety regulations have been taken, how well people are practicing social distancing, and other 100 things we don’t understand yet about the virus itself.

Now take a look at the measures that are statistically significant in the model (p ≤ 0.05):

Among the 27 measures, 6 health conditions, 3 prevention services, and one unhealthy behavior are significantly impacting the COVID death per million people. Arthritis, asthma, high cholesterol and lack of leisure-time physical activity have strong positive correlations with COVID death cases.

Wait, something looks fishy

Looking at the result, is it telling me that cancer and diabetes decrease the COVID death risk while doing dental and routine checkup makes us more vulnerable?

There are a lot of potential reasons behind these ‘weird results’. For example, the rate of cancer among adults ranges between 4% to 7% in our data, which is a lot lower than most other measures. In this case, even if there is no causation between having cancer and COVID death in a population, if some areas happen to have a high cancer rate and high COVID deaths, it could lead to a strong positive correlation in the model.

Another example is routine checkups. If we plot the death rate and routine checkups, we can see:

  • Routine checkup rates are very different between east and west coast
  • Most areas that have both high checkup rates and high COVID deaths are from a small group of states (MA, NJ. NY)
% Routine checkups and COVID deaths per 1M people

All these states were hit badly during the start of the COVID breakout in the U.S. Could it be a coincidence? Could the high death number a result of the hospitals overstressed during the first surge? With the limited data we have, it’s very hard to draw any further conclusion. But we expect to have a clearer vision as the virus spread speed continues to stabilize.

Conclusion

  1. High arthritis, asthma, high cholesterol, and lack of leisure-time physical activity rates are strongly correlated to high COVID-19 death rates.
  2. While population health measures can explain 50% of the COVID-19 death rate. At this moment it is not good enough to predict the spread of the virus since there are too many uncertainties.
  3. The health measures will be more valuable in the longer term to help the communities take action to protect people who are more vulnerable to the virus.

How are you adapting to the ‘New Normal’? Do you have any ideas to make this study better? Leave a comment below and let me know your thoughts! All data and models can be found in the project git repo.

Resources:

Git Repo: https://github.com/estella-zzz/data-science-projects/tree/master/Covid19_vs_Population_Health

Visualize it in Tableau: https://public.tableau.com/profile/estella.zhang#!/vizhome/Covid19vsPopulationHealth/CovidPopulationHealth

*List of all health measures:

--

--