See the Coronavirus for Yourself

Explore COVID-19 through Data Analysis & Visualization

Published in

Towards Data Science

7 min readMar 11, 2020

There’s been a lot of panic around the coronavirus — but how much of it is warranted? We take in most of our information from the news, and thus our entire view of the coronavirus is tainted by their biases.

Data has no agenda and no bias, data has no motive, data is not political. Only listening to what the Internet cranks out of its content machine will deceive you. In this article, I’ll show how anyone can go straight to the raw data and come to their own conclusions with simple data analytics and visualization methods in Python.

See coronavirus for yourself, not through tinted glasses.

Get to Know the Data

We will be using the routinely updated coronavirus dataset on Kaggle. This code loads the .csv file that holds the data from the directory:

data = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/COVID19_open_line_list.csv')

Viewing the data:

data.drop(['Unnamed: '+str(x) for x in range(33,45)],axis=1,inplace=True) #deletes unnecessary columns
data.head()

This is a very data-rich dataset. With len(data) one can find that there are 14126 rows in the dataset. Calling data.columns will print out the columns in the DataFrame.

The Effect of Age on Outcome

Questions:

Which ages have a higher chance of dying?
Which ages have a higher chance of being discharged from the hospital or recovering?
Which ages get the coronavirus?

One of the columns is outcome. This categorical variable’s unique values can be called with data[‘outcome’].unique().

*Be warned — there are very, very high numbers of nan values. In a more statistically rigorous analysis, we would avoid working in this column because of the sheer amount of missing data. However, since the findings from this data match more professional studies to a reasonable degree, I think it is safe to say that the known values are representative of the overall data enough for our purposes.

It seems that this column is very messy — the same meaning is spelled differently (‘discharged’ and ‘discharge’, ‘died’ and ‘death’, etc.). Let’s outline a function clean() that cleans this up a bit.

def clean(x):
    if x == 'death' or x == 'died' or x == 'Death':
        return 'death'
    elif x == 'discharged' or x=='discharge':
        return 'discharge'
    elif x == 'recovered' or x=='stable':
        return 'recovered'
    else:
        return np.nan

We need to apply another function to the age column. All values are strings, not integers, and some are ranges (e.g. ‘34–36’). A function apply_int can convert strings to integers and return nan for all ranges.

def apply_int(x):
    try:
        y = int(x)
        return y
    except:
        return np.nan

Now, we can plot the age distributions of those who died, were discharged, or recovered.

import seaborn as sns
import matplotlib.pyplot as pltplt.figure(figsize=(14,5))sns.distplot(data[data['outcome'].apply(clean)=='death']['age'].apply(apply_int),hist=False,rug=True,label='Deaths')sns.distplot(data[data['outcome'].apply(clean)=='discharge']['age'].apply(apply_int),hist=False,rug=True,label='Discharged')sns.distplot(data[data['outcome'].apply(clean)=='recovered']['age'].apply(apply_int),hist=False,rug=True,label='Recovered')

plt.legend()
plt.show()

This plot is very telling of the impact of coronavirus on different age ranges — the distribution for deaths peaks at an older age, around 75 years of age, whereas the peak for those who recovered was at about 45 years of age, and the peak for those discharged from the hospital peaking at 30 years of age.

Note that because it is a distribution plot that must be continuous and smooth throughout, the distribution plots for discharged and recovered to begin in -20 years of age.

This means that a distribution plot, while helpful in that one can view the overall distribution, deceives the viewer, for example, by suggesting that there are people under 40 who died (which is not true in the dataset).

A boxplot, while not showing the distribution, and give us an idea of where the important checkpoints of a distribution (minimum, 1st & 3rd quartiles, median, maximum) are:

df1 = pd.DataFrame(data[data['outcome'].apply(clean)=='death']['age'].apply(apply_int)).assign(outcome='death')
df3 = pd.DataFrame(data[data['outcome'].apply(clean)=='discharge']['age'].apply(apply_int)).assign(outcome='discharge')
df2 = pd.DataFrame(data[data['outcome'].apply(clean)=='recovered']['age'].apply(apply_int)).assign(outcome='recovered')
cdf = pd.concat([df1, df2, df3])
plt.figure(figsize=(5,5))
sns.boxplot(x="outcome", y="age", data=cdf)  # RUN PLOT   
plt.show()

Note that diamonds represent statistical outliers. It’s clear that each outcome moves lower on the age range. Most people who die are 60 years of age or above.

Let’s find the average age for the three outcomes.

The code for the average age for deaths…

data[data['outcome'].apply(clean)=='death']['age'].apply(apply_int) .mean()

…returns 65.125. The average age for recovery is 45 years of age, and the average age for discharge is 39 years of age.

To adapt the code for other outcomes, simply substitute ‘death’ with other outcomes.

Spread of the Coronavirus Across Countries

Questions:

How many cases are there across countries?
How does the spread differ among continents?

fig = plt.figure(figsize=(18,5))
sns.set_style('whitegrid')
sns.countplot(data['country'],order=data['country'].value_counts().index)
plt.xticks(rotation=90)
plt.show()

In terms of continents, three of Asia’s largest countries are the highest. Otherwise, countries that rank high in coronavirus confirmed cases seem to have more flights with China. South Korea and Japan are close by proximity, they should be expected to have a significant amount of cases. Iran, a political ally of China, has cases (probably) in part to flights between the two countries.

It seems that the vast majority of cases occurred in China. Everyone knows this, but I am sure many are shocked by the visual magnitude. China has at least 10 times more cases than South Korea and Italy combined, but the news coverage is vastly unproportional.

By taking China out of the picture, we can view how the situation is for other countries:

fig = plt.figure(figsize=(18,7))
sns.set_style('whitegrid')
sns.countplot(data[data['country'] != 'China']['country'],order=data[data['country'] != 'China']['country'].value_counts().index)
plt.xticks(rotation=90)
plt.show()

When the United States’ count is put next to other countries’, it looks much more trivial, especially given the United States’ population and land size. In fact, when you plot out the percent of the population infected in South Korea, Japan, Italy, and the United States:

plt.figure(figsize=(7,6))
sns.barplot(x=['South Korea','Italy','Japan','United States'],
        
y=[938/50_800_000,588/60_430_000,731/124_800_000,17/327_170_000])
plt.ylabel("% of population infected")
plt.xlabel("country")

The percent for the United States isn’t even visible.

Spread of Coronavirus across Time

Questions:

What is the shape of a curve representing the new confirmed cases over time, and what does it tell us about the future?
What do the starting dates of infections in different countries tell us about how fast viruses can transmit?

Opening a new table within the dataset:

data = pd.read_csv('/kaggle/input/novel-corona-virus-2019-dataset/time_series_covid_19_confirmed.csv')

This table gives us the number of people who were diagnosed with the coronavirus each day.

Plotting out the confirmed cases in Mainland China by date:

plot_data = data[data['Country/Region']=='Mainland China'].sum().drop(['Province/State','Country/Region','Lat','Long'])
plt.figure(figsize=(15,5))
sns.barplot(plot_data.index,plot_data,palette='Blues')
plt.xticks(rotation=90)
plt.xlabel("Date")
plt.ylabel("Confirmed Cases")
plt.title("Confirmed Cases in Mainland China over Time")
plt.show()

The curve seems to be slowly tapering off at the end for China — the coronavirus may have already run its havoc and is now receding.

We can substitute ‘Mainland China’ for others — for instance, one that would be interesting to see is the number of cases aboard the Diamond Princess, the famous cruise ship quarantined.

plot_data = data[data['Province/State']=='Diamond Princess cruise ship'].sum().drop(['Province/State','Country/Region','Lat','Long'])
plt.figure(figsize=(15,5))
sns.barplot(plot_data.index,plot_data,palette='Blues')
plt.xticks(rotation=90)
plt.xlabel("Date")
plt.ylabel("Confirmed Cases")
plt.title("Confirmed Cases on the Diamond princess over Time")
plt.show()

The number of cases skyrockets from only a few (barely visible) but suddenly jumps to about 50 on 2/7 and continues rising exponentially. In hindsight, would it have been better to take the infected people off the ship earlier instead of quarantining it?

Plotting the newly confirmed cases per day in the US:

…and for South Korea:

Pay attention to the starting times — 2 days after the first confirmed cases in China, the first cases are reported in South Korea and in the United States.

The coronavirus is an important issue. To navigate the hailstorm of news-driven panic, we need data. Websites like Kaggle ensure that everyone has access to data and a notebook to analyze it. I hope this article has given you some starting points to conduct your own analysis.

Analysis does not need to be complicated — it can be just as simple as taking the mean or finding the range. What is more important is that you are taking the raw data and making your own conclusions — your view on major events like the coronavirus should not be purely based on conclusions you are told to believe.

See the Coronavirus for Yourself

Explore COVID-19 through Data Analysis & Visualization

Get to Know the Data

The Effect of Age on Outcome

Spread of the Coronavirus Across Countries

Spread of Coronavirus across Time

Written by Andre Ye