The world’s leading publication for data science, AI, and ML professionals.

Data exploration with the COVID-tracking Project

How to easily do exploratory data analysis (EDA) with one of the most comprehensive US databases on COVID-19.

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

What is COVID-tracking-project?

As per their website, "The COVID Tracking Project collects and publishes the most complete testing data available for US states and territories.

Image source: Screenshot from Covid-tracking-project website
Image source: Screenshot from Covid-tracking-project website

Understanding the evolving dynamics and the precise location of regional outbreaks requires a complete testing picture – how many people have actually been tested in each state/territory, when the tests were done, and what the results were.

COVID-19 science: Why testing is so important

Indeed, the project has been cited in and used by major media companies and agencies throughout the nation.

Image source: Screenshot from Covid-tracking-project website
Image source: Screenshot from Covid-tracking-project website

How to verify the quality and veracity of the data?

The website further adds "…our data team uses website-scrapers and trackers to alert us to changes, but the actual updates to our dataset are done manually by careful humans who double-check each change and extensively annotate changes areas of ambiguity."

Some of the visualizations in popular news outlets (e.g. NY Times, Politico, The Wall Street Journal, etc.) have been created from these data.

Image source: Screenshot from Covid-tracking-project website
Image source: Screenshot from Covid-tracking-project website

In this article, we will see how simple Python scripting enables you to read this dataset and create meaningful visualizations of your own for tracking and understanding the spread of Covid-19 across the U.S.

Understanding the evolving dynamics and the precise location of regional outbreaks requires a complete testing picture…

The code and the demo

The code can be found in this Jupyter notebook in my COVID-19 analysis Github repo.

Pulling the data and loading in a DataFrame

The first part is to pull the data from the website and loading it into a Pandas DataFrame for easy analysis.

Fortunately, they provide an easy API endpoint for this purpose in a CSV format.

This means that for analyzing the latest data, you just have to run the script real-time to pull the latest data from the website. No dependency on an old downloaded data file!

Some housekeeping

We may want to convert the date field to a specific format, drop unnecessary columns, and make sure that the state field has string values.

What kind of data are there?

Here is a snapshot of the dataset columns,

Note that, not all data are present in the same number/frequency. Record of positive/negative cases, test results are regularly maintained whereas hospitalization data (ventilators, ICU, etc.) are somewhat sparse.

We expect a lot of NaN or voids in the DataFrame, so we can simply replace them by a special number like -1 to check in our validation code later.

Bar chart of a given variable/state

Bar charts are most common for this type of visual analytics. We write a custom function to plot any variable, present in the dataset, for any given state, over time.

We will not clutter here with the code, but here are examples. Note, not all the series have the same number of data points or not all of them start from the same time-point. But the code simply takes care of it and plots the available data for the specific state, chosen by the user.

Scatter plots to check correlation

One of the basic checks for correlation is done by creating bi-variate scatter plots. Therefore, we write a custom function to create scatter plots for any pair of variables.

Is there a positive correlation between total tests and positive cases?

Does the death count increase monotonically with the rate of hospitalization?

Tracking the progress of testing

As we mentioned in the beginning, the COVID-tracking project is primarily useful for its testing related data. Therefore, it is not surprising that we want to create a visualization to track the progress of various states in the testing effort.

We can even compare multiple states on the same plot using line charts. The function also computes the average testing/day metric for us and puts that in the legend.

Functions to compute various ratios

We code a set of functions to compute various useful ratios such as,

  • Fatality ratio: It is the ratio of total dead to the total positive case
  • Hospitalization ratio: It is the ratio of total hospitalized to the total positive case
  • Positive case ratio: It is the ratio of total positive cases to the total number of tests

Once these functions are coded, they can be used in a standard visualization script to plot bar charts comparing various states on their ratios. This will answer questions like – which state has the highest fatality ratio? which state shows the lowest rate of hospitalization?

Why are these ratios important to compute and track?

This is because, later on, public health officials and social scientists can look at these numbers and hypothesize on what local factors – healthcare system readiness, testing/tracking capacity, political expediency, GDP/capita, etc. influenced fatality or hospitalization in each of the states.

Here are some examples of comparative charts. Note that not all of the charts have the same number of bars – not all states report the same metrics.

It does not matter that the state of NY or CA is much bigger than MI in terms of population or COVID-19 cases. The ratio clearly says that MI has the highest fatality rate among the COVID-19 positive cases whereas NY has the highest fraction of positive cases out of all the tests being done.

Each of these assertions, being visualized by simple bar charts, is a story to follow and analyze to understand the nature and dynamics of this disease.

…public health officials and social scientists can look at these numbers and hypothesize on what local factors – healthcare system readiness, testing/tracking capacity, political expediency, GDP/capita, etc. influenced fatality or hospitalization in each of the states.

Bubble charts for comparing all states together

The same ratios can be plotted in a bubble chart to compare all the states together.

Which states have 14 days of decreasing case counts?

We need to write a couple of custom functions to extract this data or trend from the dataset and visualize it.

Basically, we can look at the successive difference of new COVID-19 cases for the last 14 days. If all the numbers are negative, then the new cases are going down monotonically. Even if that is not the case, then more the negative numbers, the better, as that shows a decreasing trend overall.

The code is in the Notebook, but the results are shown here when we compare four states – CA, GA, LA, and MI. No state, in fact, has shown a complete set of negative numbers.

Summary and other articles

We showed how to pull data from one of the most respected COVID-19 databases in the US and create meaningful visualizations with simple Python coding.

Again, the code can be found in this Jupyter notebook in my COVID-19 analysis Github repo. There are other useful Notebooks in this repo, which you can fork and explore.

This is NOT a predictive model and never will be. The goal of this script is to do visual analytics only. Without a solid knowledge of epidemiology or w/o a collaboration, one should not build any predictive model just from the time-series data.

The greatest global crisis since World War II and the largest global pandemic since the 1918–19 Spanish Flu is upon us today. Everybody is looking at the daily rise of the death toll and the rapid, exponential spread of this novel strain of the virus.

Data scientists, like so many people from all other walks of life, may also be feeling anxious. It may be somewhat reassuring to know that the familiar tools of Data Science and statistical modeling are very much relevant for analyzing the critical testing and disease-related data.

A couple of my articles related to COVID-19.

Analyze NY Times Covid-19 Dataset

False positives/negatives and Bayes rule for COVID-19 testing

Simple modeling of "flattening the curve" and "lifting lockdown"

Stay safe, everybody!


Note from the author: I am a semiconductor technologist, interested in applying data science and machine learning to various problems related to my field. I have no expertise or deep knowledge about medicine, molecular biology, epidemiology, or anything of that sort related to COVID-19. Please do not send me an email with that kind of query.


Also, you can check the author’s GitHub repositories for code, ideas, and resources in machine learning and data science. If you are, like me, passionate about AI/machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.

Tirthajyoti Sarkar – Sr. Principal Engineer – Semiconductor, AI, Machine Learning – ON…


Related Articles