The world’s leading publication for data science, AI, and ML professionals.

Strong Correlation of Wastewater COVID-19 Signal to Hospitalization and Death

The Biobot dataset of USA wastewater strongly predicts COVID-19 disease outcomes with high correlation values. Full Python/pandas code.

Image by fernando zhiminaicela from Pixabay
Image by fernando zhiminaicela from Pixabay

Summary

The USA wastewater dataset from Biobot was joined with COVID-19 disease data. Outcome measures were adjusted in time from the water sample date to match disease progression – 14 days for hospitalization and 28 days for death. The data pairs of water sample to outcome were analyzed for correlation using the standard Spearman rank method. The result was strong statistical correlation (~0.8 out of 1.0) between wastewater RNA levels and hospital admissions, hospital beds, and deaths. This adds more evidence to the predictive power of wastewater monitoring for COVID-19 disease threat.

Background

The strong connection between SARS-CoV-2 in wastewater (sewage) and COVID-19 cases about a week later is now well known. Many dashboards, such as Lawrence KS and Louisville KY, make it obvious.

But "case counts" of COVID-19 have always been a suspect metric. Many people get sick without ever being formally tested. Many people who are sick take a home test that is not reported to any public health agency. And many people who are infected with COVID-19 and contagious never know they have the disease at all. Therefore, establishing a statistical link between wastewater virus levels and case counts is interesting and certainly true, but is not what really matters. A more reliable measure of whether wastewater is predictive of disease is to look at correlation with hospitalization and mortality.

My previous article did this using the detailed wastewater dataset from the US Centers for Disease Control NWSS. That dataset has one row per water test – at one sewage treatment plant on one day. I joined this data with COVID-19 disease outcomes in the county approximately overlapping that water treatment area. I found a consistently positive Spearman correlation of about 0.3 to 0.5. This result showed that wastewater SARS-CoV-2 levels from individual water tests are predictive of COVID-19 hospitalization and death in that locale, but there was not an especially strong correlation.

I posit that the result was weaker than the true correlation because there were too many data points with small differences and random variation between them, essentially introducing noise in the data. To test this hypothesis, I examined wastewater data and COVID-19 disease over a larger area (entire US) and longer timeframe (weekly).

Biobot Dataset

Biobot.io provides a variety of data that aggregates their United States wastewater tests. For this analysis I chose their "regional" data that rolls up individual wastewater sites into four sections of the US and rolls up the four regions into a whole-country aggregate.

Here is their explanation of the process:

… we take a mean of all samples every week within the country and weight those means by the population represented in the wastewater samples. We then take a centered 3-sample centered rolling average that gives higher weight to this week’s measurement, which produces the weekly values shown in the visualization. We use this data to further average across all of our sampling locations for a nationwide average…

All wastewater values in my analysis used this data – a rolling average that is weighted for population near the test site.

Outcome Data

Covid-19 disease outcomes were taken from CovidActNow.org. I smoothed out those numbers to create 10-day rolling averages for hospital admissions, hospital beds in use, and ICU beds in use. The 10-day window was necessary because hospital data is only reported weekly. I smoothed out daily deaths over five days because there is a lot of random variation in that statistic.

This pandas code snippet shows how, with CovidDF being the raw DataFrame from CovidActNow. (The full Python/pandas source is on my GitHub.)

CovidDF["admits_rolling10"] = (CovidDF["actuals.hospitalBeds.weeklyCovidAdmissions"].rolling(10, min_periods=1, center=True, closed='both').mean() )
CovidDF["deaths_rolling5"] = CovidDF["actuals.newDeaths"].rolling(5, min_periods=1, center=True, closed='both').mean()

All hospitalization and mortality data in this analysis used these smoothed values.

Look-Ahead for Outcomes

A key part of the data engineering for this analysis was aligning water test results (copies of SARS-CoV-2 RNA) with COVID-19 outcomes (hospitalization and death) that occur later. A high virus load in Wastewater probably does not predict hospital admissions that day, but it might 10 days later. We want to explore the correlation between wastewater tests and COVID-19 outcomes at future dates.

I experimented with various look-ahead offsets and found that two weeks for hospitalization and four weeks for mortality produced the strongest correlation with wastewater virus levels. The code snippet below shows how to do this. The important DataFrames are:

  • UsaDF initially contains just Biobot wastewater test results for the whole US with one row per week. UsaDF becomes the master DataFrame that holds the overall results for analysis.
  • HospDF contains hospitalization due to COVID-19 for the US per week.
  • DeathsDF contains mortality due to COVID-19 for the US per week.
# Number of days to look ahead for COVID-19 outcomes
HOSP_AHEAD = 14    
DEATHS_AHEAD = 28  
# Create date columns in master DF with future dates
UsaDF["hosp_date"] = UsaDF["week"] + pd.offsets.Day(HOSP_AHEAD)
UsaDF["deaths_date"] = UsaDF["week"] + pd.offsets.Day(DEATHS_AHEAD)
# Join wastewater data with hospitalization in the future
UsaDF = UsaDF.merge(HospDF, how='inner', left_on="hosp_date", right_on="covid_facts_date")
# Join wastewater data with deaths in the future
UsaDF = UsaDF.merge(DeathsDF, how='inner', left_on="deaths_date", right_on="covid_facts_date")

Water tests are merged with disease outcome using inner join. The reason is that recent water tests (such as yesterday) don’t have any known outcomes yet. We cannot join recent water facts with outcomes that have not happened yet, so inner join drops recent water tests from the overall result set.

Visualizing the Data

The key items we are interested in from the results dataset are weekly numbers for:

  • Virus RNA levels in wastewater, as copies per milliliter
  • Hospital admissions later for COVID-19
  • Hospital and ICU beds later for COVID-19 patients
  • Deaths later due primarily to COVID-19

A problem with displaying this data on one chart is the wide scale of numbers for the various measures. Virus copies are often around 200, while deaths can be 3000, and hospital beds occupied over 100,000. A logarithmic y-axis solves this issue by allowing all five measures to fit on one chart, shown below.

Because of the date adjustments for disease outcomes, the peaks for hospitalization and death line up with the peaks for wastewater virus level, even though the disease outcomes occured weeks later.

Correlation

The Spearman rank correlation between SARS-CoV-2 wastewater levels and COVID-19 disease outcomes is strong. From April 2020 through May 2022, the correlations are:

  • Wastewater to hospital admissions = 0.801
  • Wastewater to regular hospital beds in use = 0.8
  • Wastewater to ICU beds in use = 0.745
  • Wastewater to deaths = 0.79

I am not asserting any causation relationship. Virus in wastewater does not cause hospitalization and death – the virus is in the water because people are already sick. But this strong correlation shows that wastewater is a good predictor of later COVID-19 disease outcomes.

Future Work

This analysis looked only at wastewater SARS-CoV-2 signal and its predictive power for COVID-19 outcomes. It is plausible that other factors combined with wastewater data have an even greater correlation to hospitalization and death.

  • Does the level of vaccination in a community, combined with wastewater virus count, give an even more accurate prediction of disease outcomes?
  • What about combining wastewater signal with the level of previous natural infections in a community, or its Social Vulnerability Index, or availability of COVID-19 therapeutic drugs?

I am currently working on a project to address the first of these questions.

For More Information

https://en.wikipedia.org/wiki/Correlation (Spearman and Pearson correlation)

https://biobot.io/science/ (technical articles from Biobot)

https://data.cdc.gov/browse (master page for all CDC datasets)


Related Articles