Visualizing the COVID-19 Surge

Cumulative US Cases by Region

Avonlea Fisher

Published in

Towards Data Science

4 min readDec 11, 2020

Introduction

As countries around the world continue their efforts to combat the COVID-19 pandemic, data on the virus is tirelessly reported every day. The US is currently facing a surge of new cases in nearly all of its states. While the volume and significance of COVID-19 data can be overwhelming, it’s important to stay informed about recent developments as we all do our part to help fight the spread of the virus. This short article presents some visualizations of case count data by region, and explains the steps behind their creation.

About the Data

The data presented in this article comes from two sources:

The state population data was included so that case counts can be presented as a proportion of the population, rather than as raw totals. We would naturally expect states with larger populations to experience more cases of the virus, so raw totals do not offer a useful sense of how the virus is being handled in a given state.

It’s important to note that state populations are changing every day, and examining total cases as a proportion of the 2019 population fails to capture how the population has changed in 2020. Indeed, COVID-19 has contributed to population change not only by increasing the death rate, but also by driving inter-state migration. According to the Pew Research Center, a majority of young adults are living with their parents this year, and millions have been pushed by the pandemic to do so. The 2019 data doesn’t account for these changes, but is used here because it is one of the most recent, complete, and easily accessible sources of state-level population counts.

Finally, it should be noted that the number of reported cases of the virus differs from the number of actual cases. Not everyone who contracts the virus gets tested or reports their symptoms, and not every test result is accurate. Reported cases are simply the best proxy currently available for tracking total cases. The description of the New York Times data states that it includes “both laboratory confirmed and probable cases using criteria that were developed by states and the federal government.”

Preprocessing

Using file paths to the CSV files available through the links above, I created a separate pandas data frame for each dataset:

I reset the state data frame’s index, which contained date values, in order to create a date column. Then I converted each value in this column to a datetime object:

Before the datasets could be merged on their state columns, it was necessary to ensure that they contained appropriate matching values. Values for the ‘NAME’ column of the ‘pop’ data frame, for the most part, had a matching value in the ‘state’ column of the ‘state’ data frame. The code below removes or updates values in the columns such that only matching values remained.

After ensuring that the two column contained only matching values, they could be easily merged by creating a matching column name and using the pandas.merge() method:

With the population estimates now in the state data frame, I created a new column with calculations for the total number of COVID-19 cases per 1 million residents in a row’s state:

The image below offers a preview of the final data frame.

Visualizing Case Counts

To prepare the cases data to be plotted by region, I created a list of string values for the states in each US region, and subsetted the data based on that list. The code below, for example, created a subset with states in the southwest region.

After creating subsetted data frames for each region, I wrote the following plotting function, which requires the Plotly Express module. It takes in the subset and the region name (as a string), and returns a line plot showing total cases for each state in the region. The date is plotted on the x-axis and total cases per 1 million residents is plotted on the y-axis.

Applying the above function to each regional subset rendered the following line plots. By hovering over any point on a line, you can see the state name, reporting date, and total cases per 1M residents corresponding to your hover position.

Plot by Author

These plots may differ from other types of COVID-19 plots that you’re accustomed to seeing. One notable distinction is that these plots show cumulative totals rather than daily new cases. Imagine, for instance, that on Tuesday, there were 100 cumulative COVID-19 cases in a given area, and on Wednesday, there were 50 new cases. A plot presenting cumulative totals for that period would show a value of 150 for Wednesday; a plot presenting new cases only would show a value of 50 for the same day. As we continue to process the wealth of pandemic-related data reports in the midst of this surge, it’s important to have a sense of the various ways that different metrics are commonly reported.

For Further Information

The resources below comprise a small selection of up-to-date, accurate information on the COVID-19 pandemic. Stay informed, and stay safe!

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. See our Reader Terms for details. To learn more about the coronavirus pandemic, you can click here.