Every time I read the news these days there seems to be another story about how cases of COVID-19 are surging throughout the United States. There is a web of opinions regarding the pandemic. Opinion topics include the severity of the virus, the necessity of wearing masks, the virus’s comparison to other microorganisms that cause illness, the best methods to re-open the economy, and so on. More important than this web of opinions is the voluminous amounts of data available regarding the pandemic. With data, we can analyze it and expand our intellectual horizon instead of being caught in the muck of opinion. The focus of this article is to create visualizations to better understand the pandemic in the United States.

The data I’m using comes from the New York Times. I used R and the Tidyverse (along with some complimentary packages) to create the visualizations. So as to no inundate this article with mountains of code, I’ve uploaded the entire script here in case you want to reproduce any of the graphs. Now, let’s get to the analysis!
Comparing the Number of Covid-19 Cases Between States
Which states have the largest number of cases? Let’s answer this question with a visualization. Now, plotting all of the states together on one graph would be messy. We wouldn’t really be able to infer anything from it. Instead, I used the facet_wrap() function in ggplot to create the following time-series graph:

Using a gradient fill, the lines for the states with the highest number of cases will start to turn dark orange as they increase in number relative to the other states. Immediately, I notice California, Florida, Illinois, New York, and Texas. Let’s isolate those states and plot their values on a single graph:

At the beginning of the pandemic, New York was the epicenter of the breakout in the US. The number of cases in New York began to flatten in the summer while the number of cases in Texas, California, and Florida all began to surpass it at high rates. Interestingly, Illinois has only recently seen a sharp surge in cases whereas before the number of cases rose quite gradually. Something to keep in mind is that New York, California, and Illinois contain the three largest urban areas in the US (New York City, Los Angeles, and Chicago). Texas is home to three very large urban areas (Houston, San Antonio, and Dallas). Florida doesn’t have as large of urban areas as the other states (measured by population). However, Florida is the third most populous state. This might indicate the virus is fairly evenly distributed throughout the state as opposed to being asymmetrically distributed. That could be a topic for further investigation in the future.
Now that we’ve seen some time-series graphs, let’s switch to visualizing the pandemic spatially using a choropleth:

Everything looks as expected given what we already know from analyzing the time-series graphs above. California and Texas account for most cases. Illinois, Florida, and New York account for a large number of cases as well. However, after reviewing the map, It looks like a lot of states in the upper-Midwest and southeast are seeing a spike in cases. If you go back up and review the facet_wrap plot, you can see states like Georgia and Wisconsin are trending upwards.
Up to this point, we’ve only reviewed the cumulative cases. That is, we haven’t taken into account the number of cases compared to the population. All we’ve really determined is that the most populous states have the highest number of cases. Let’s now pivot towards analyzing the pandemic through per-capita values as opposed to cumulative values.
Comparing Per-Capita Values Between States
Let’s recreate the facet_wrap graph from above using per-capita numbers:
- note: the incidence rates are cases per 100,000 of population

This graph is quite a bit different when compared to the original facet_wrap graph from above. Multiple states have seen a sharp increase in their per-capita numbers since October. Specifically, my attention was drawn to the Dakotas, Iowa, Wisconsin, Nebraska, Utah, and Wyoming. Let’s take a closer look at those states:

What in the Sturgis is going on in the Dakotas? As an aside, some economists estimated the Sturgis motorcycle rally cause 250,000 COVID cases. I figure we’d have seen these sharper spikes in September and October, given the incubation period of COVID is anywhere from 2 to 14 days and the rally ended in mid-August. But hey, I’m no epidemiologist. Anyways, all of these states are roughly in the same geographic region, so something is afoot (College Football, perhaps?).
Let’s get a better idea of how all of the states compare using a choropleth:

It’s almost as if this choropleth is the inverse of the one we saw earlier. The larger states seem to be doing better relative to less densely populated states in the middle of the country. The upper mid-west has a huge incidence rate while Oregon, Maine, New Hampshire, and Vermont seem to be doing pretty well for themselves compared to the rest of the country.
So, from what we’ve seen above, it seems as if there’s a high correlation between state population and the number of cases. However, there doesn’t seem to be much correlation between state population and per-capita cases. Let’s compute the correlation coefficients between the variables and see what they tell us.
# what is the correlation between # of cases and population?
cor(percapita$Pop, percapita$cases)
[1] 0.6076123
# what about pop and cases per capita?
cor(percapita$Pop, percapita$cases_per_capita)
[1] 0.01356935
As we inferred from the graphs, the correlation between state population and the number of cases is quite high at around 0.61. The correlation between state population and per-capita cases is very weak at around 0.01.
Now, let’s create a graph where cases per-capita is plotted against the state population:

As of 28 November 2020, the average cases per-capita in the contiguous US was approximately 4355 per 100,000. In the graph above, any state above the average is labeled in red. States below the average are labeled in blue. This graph further reinforces the idea that the correlation between the state population and the cases per-capita is weak. California and Texas are both below the average while cases are rising in the Dakotas at an alarming rate.
Conclusion
While New York City was once the epicenter of the pandemic in the US, it seems like the battleground has moved to the upper mid-west. The conditions in that region seem particularly conducive to the spread of the virus. Overall, there’s an alarming upward trend in incidence rates in most of the country. During a pandemic, the reduction of connectivity is a key component to quelling the spread. The less interaction, the lower the probability of infection. However, with the holidays approaching, that’s easier said than done.
If you liked this blog, you might be interested in another article I wrote about the efficacy of COVID-19 antibody testing. As always, questions, comments, concerns, and criticism are welcome!