Clustering U.S counties by their COVID-19 curves

How can we use unsupervised learning to describe COVID-19 trends in the United States

Rory Michelen
Towards Data Science

--

Analytics has become a part of everyone’s daily routine as a result of the pandemic. Every day we look at curves of new cases, positivity rates, and a range of other metrics that give us insight into our current situation. One interesting metric used by the CDC, along with many news networks and publications is hotspot classification. A hotspot is a county or state where cases are currently increasing at a relatively high rate [1]. This metric is simple and easy to understand. But it leaves out a lot of interesting details.

Let’s take a look at 6 U.S. states and their daily count of new cases, below:

Daily COVID-19 case trends for 6 US states. Each of these curves tells a different story. Unsupervised learning can help us classify these trends.

Some of these states certainly jump out as hotspots. But can we think of any more useful labels to describe the trend? How about “Second wave, worse than first”, “late first wave”, or “starting to improve”? These labels tell a more nuanced story than “hotspot” and paint a more detailed picture of our current situation.

It’s easy for us to look at these curves and come up with creative names for them. But this is not scalable and leaves a lot of room for bias. Instead, we need an objective, automated method of grouping counties and states into categories based on their COVID-19 trends.

Unsupervised learning algorithms provide exactly that. Instead of assigning U.S. counties to predetermined labels, we can use clustering to discover the most common trends in COVID-19 cases.

Below, I’ll detail my attempts to cluster U.S. counties into groups based on the shape of their COVID curve.

Methodology

There were three major steps to this analysis:

  1. Clean and pre-process the data.
  2. Describe the shape of each county’s COVID curve using a set of coefficients or features.
  3. Feed these coefficients into an unsupervised classification model

Let’s dig into each of these steps separately.

The Data

Data for this project was taken from USAFacts.org and includes a daily count of COVID-19 cases in the 3,195 United States counties specified by the Federal Information Processing System. The data goes back to January 1st 2020 and was updated daily as of the writing of this article on November 25th, 2020.

Let’s take a look at 8 random counties in the raw data:

Daily counts of COVID-19 cases in 8 random counties in the United States.

These curves look somewhat similar to the statewide trends we looked at earlier. However, since the population sizes are much smaller, we see a lot more volatility in the data. This could pose an issue for modelling down the line, so we’re going to remove the smallest counties from the dataset. However, we don’t want to remove too many counties since we want our clusters to describe a majority of COVID-19 cases in the United States.

To balance the tradeoff generalization and clean, stable data, let’s look at the cumulative distribution of U.S. COVID-19 cases. In other words, how many counties can we exclude and still keep at least 80% of the COVID-19 cases in the original dataset?

This chart shows the cumulative distribution of total COVID-19 cases starting with the largest counties. The 1,000 counties with the most COVID cases represent 80% of the United State’s total U.S cases

The chart above indicates that by focusing on the 1,000 counties with the most COVID cases, we are able to capture 80% of all cases in the United States. Therefore, we can exclude the long-tail of small counties while still analyzing a majority of COVID cases in the United States.

There are a couple of other issues with the data that we will want to take care of. Firstly, we are interested in the shape of each curve, not the absolute count of cases. We need to normalize our data so that two counties with different populations are treated similarly if their curves are a similar shape. There are a few ways to normalize the data, but I have opted to convert daily case counts into a percentage of the total cases in that county (e.g. Clark County, WA experienced 4% of their total cases on November 25th).

The second issue with the data is outlier spikes. Some counties have abnormally large single-day spikes in cases. It is possible that these spikes are due to lags in test results. If two weeks worth of positive test results were published on a single day, then we will see a large single-day spike. However, this spike is not representative of the true trend in new cases. In addition to their lack of interpretability, these spikes will also cause issues later on, so let’s remove them.

There are definitely more elegant ways to do this, but I have opted to replace large spikes with previous day’s new case count.

Now that we’ve processed the data, let’s take a look at the same 8 counties as earlier.

Not bad at all! All of these counties are on the same scale and large suspicious spikes are gone.

Extracting Features from Each Curve

We’re not quite ready to pop this data into a clustering algorithm. That’s because we are working with time series data. In the current format, each county is described using 308 days of data. We can think of each of these days as independent features. Clustering algorithms will not know that there is are relationship between these days.

In other words the clustering algorithm will interpret the data like this:

“There were 10 cases on Friday, 50 cases on Saturday and 200 cases on Sunday”

However, we want the model to interpret it like this:

“Between Friday and Sunday, cases increased from 10 to 50 and then again from 50 to 200.”

We need to transform our data in such a way that we are describing the trend in cases over a period of time, not just independent daily samples. We need to tell our clustering algorithm whether cases are going up, going down, or staying flat. Luckily for us, we have B-Splines.

What are B-Splines?

B-Splines are a way to approximate non-linear functions by using a piece-wise combination of polynomials. B-Splines aren’t typically used for prediction tasks because they can overfit data pretty easily. However, at this stage in our analysis we are just trying to describe the shape of each county’s curve. These models will do a terrible job of predicting future COVID cases. They will, however, do an excellent job of describing the historical trends. Better yet, they can do so with very few data points.

In a nutshell, you can fit a B-Spline curve to a dataset in the following way:

  1. Specify the number of local regions in your data set. These regions can be evenly spaced, or different sizes if that improves the fit of the model.
  2. Fit a different polynomial in each local region. Each of these polynomials can be described using a single coefficient. So, if you have 10 regions in your dataset, the entire model can be described using 10 coefficients.

Check out my previous article for a more detailed explanation of how B-Splines work.

Let’s see how B-Splines performs using 10 local regions:

Let’s see how this model performed for 4 random counties:

This is exactly what we are looking for. A curve that describes the general trend of in each county without capturing too much noise. Better yet, each of these blue curves can be represented using only 10 data points. Now we can feed these 10 features into an unsupervised classification model to identify clusters of counties that have similar shaped curves.

Unsupervised Learning with K-Means

All of our work until now has left us with 1,000 U.S counties, each with 10 features. These 10 features describe the shape of their COVID curve over the past 308 days. Hopefully, if we run K-Means on this dataset, we will be left with interesting and meaningful groups of counties. Let’s see what happens.

There are tons of great articles on how K-Means works, so I’m going to assume you know the basics.

First I’m going to run K-means with values of K between 1 and 30.

I’m not 100% satisfied with this elbow plot. The elbow isn’t as clearly defined as I would have liked, but for the purposes of this analysis let’s keep chugging along. K=5 seems to be a pretty good elbow, so we will use that.

After running K-Means using K=5 and assigning each of the 1,000 U.S. counties to a cluster I looked at a bunch of random samples of counties. The results for 2/5 clusters weren’t particularly, interesting. But the other 3 demonstrated clear and distinct themes.

Here are some counties in Cluster 1

This appears to be the “Second Wave” cluster that I expected to see in the results.

Here is cluster 2:

If I had to name this cluster, I would call it the “late first wave” group. I colored it yellow since most of the counties belong to the Midwestern United States where cornfields are plentiful.

And cluster 3:

The third cluster is similar to the second, but it’s a bit bumpier in the middle. These appear to be counties that experienced a small first wave and are now experiencing something much more significant.

Summary

Using B-Splines and K-means, we were able to identify groups of U.S. counties with similar trends in COVID-Cases. Better yet, these trends were more nuanced than the popular “hotspot” classification used in mainstream news publications.

By using unsupervised learning for this task, we didn’t have to supply predetermined labels to our model. This means that when conditions change and new patterns emerge, we can easily rerun our model to identify new trends without the bias that comes along with manual labeling.

Next Steps

Tweaking the Model

If I had more time to spend on this project, the first thing I would do is tweak my B-Spline model. I fit each county using 10 equally-spaced local regions. But what if we divided the data into 20 regions? What if they weren’t equally spaced?

Tweaking these hyperparameters could potentially improve the fit of our B-Spline model. By improving the fit, we’re a doing a better job of explaining the trends to our clustering algorithm, which can then do a better job of classification.

Adding More Features

This clustering model only looked at the shape of each county’s curve. It ignores other useful bits of information such as the rate of cases, death rates, and political demographics. By adding in more features, we can add more nuance to our clusters and tell an even more detailed story.

Using the Clusters

A clustering analysis such as this one typically isn’t the end of the story. It is just the beginning. Clustering is a tool that helps us describe our data but it doesn’t tell us what to do with it.

Next, we will need to explore COVID cases through the lens of these clusters. Which clusters are good? Which clusters are problematic? How many counties are problematic? What can we do about it?

References and Links

[1] Github with full code in R https://github.com/RoryMichelen/Medium-Articles/blob/master/Clustering_w_B_Splines.rmd#L42-L56

[2] Dataset: https://archive.ics.uci.edu/ml/datasets/Sales_Transactions_Dataset_Weekly

[3] CDC definition of Hotspot: https://www.cdc.gov/mmwr/volumes/69/wr/mm6933e2.htm#:~:text=Hotspot%20counties%20were%20identified%20among,in%20number%20of%20cases.

--

--