The world’s leading publication for data science, AI, and ML professionals.

Creating a Racing Bar Chart using the Covid-19 Dataset

https://youtu.be/5UKdoEoFaKA

Make your bar chart come to life

In my previous two articles on data analytics using the Covid-19 dataset, I first

The Covid-19 dataset is a good candidate for exploring and understanding data analytics and visualisation. In this article, I will show you how to create a dynamic chart in matplotlib. In particular, I will create a racing bar chart to dynamically display the number of confirmed cases in each country as the days go by. At the end of this article, you will be able to see a chart like this:

Wrangling the Data

For a start, let’s use Jupyter Notebook to clean and filter all the data so that you have a clean dataset. Once the data is prepared, you will then be able to focus on creating the bar chart.

Importing the Packages

The first step is to import all the packages that you will be using for this project:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

Loading the Dataset

For this article, I will be using the latest dataset from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset. So be sure to download it and save it in locally on your machine. For my case, I have saved it in a folder named "Dataset – 20 July 2020", relative to the location of my Jupyter Notebook:

# load the dataset
dataset_path = "./Dataset - 20 July 2020/"
df_conf = pd.read_csv(dataset_path + 
              "time_series_covid_19_confirmed.csv")
df_conf

The df_conf Dataframe looks like this:

Sorting the DataFrame

With the DataFrame loaded, let’s sort the values by the ‘Province/State‘ and ‘Country/Region‘ columns:

# sort the df
df_conf = df_conf.sort_values(by='Province/State','Country/Region'])
df_conf = df_conf.reset_index(drop=True)
df_conf

After the sorting, the df_conf will look like this:

Unpivoting the DataFrame

The next step would be to unpivot the DataFrame so that you now have two new columns Date and Confirmed:

# extract the dates columns
dates_conf = df_conf.columns[4:]
# perform unpivoting
df_conf_melted = 
    df_conf.melt(
        id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
        value_vars=dates_conf, 
        var_name='Date',
        value_name='Confirmed')
df_conf_melted

The unpivoted DataFrame (df_conf_melted) should now look like this:

Converting the Date Column to Date Format

The next step would be to format the date stored in the Date column into a Date object:

# convert the date column to date format
df_conf_melted["Date"] = df_conf_melted["Date"].apply(
    lambda x: datetime.datetime.strptime(x, '%m/%d/%y').date())
df_conf_melted

The updated df_conf_melted dataframe now looks like this:

Grouping By Date and Country/Region and Summing up the Confirmed Cases for Each Country

It is now time to sum up all the confirmed cases for each country (regardless of province/state). You can do it using the following statements:

# group by date and country and then sum up based on country
df_daily = df_conf_melted.groupby(["Date", "Country/Region"]).sum()
df_daily

The df_daily DataFrame now shows you the daily confirmed cases for each country:

Sorting the DataFrame by Date and Confirmed

We want to sort the df_dailyDataFrame based on two keys:

  • Date (ascending order)
  • Confirmed (descending order)

So the following statements accomplish that:

df_daily_sorted = df_daily.sort_values(['Date','Confirmed'], 
                      ascending=[True, False])
df_daily_sorted

The df_daily_sorted DataFrame now looks like this:

One final thing to do, let’s extract the list of countries as a list:

# get the list of all countries
all_countries = list(df_conf['Country/Region'])

This will be used in the next section.

Iterating through Each Day

Now that our df_daily_sortedDataFrame is sorted and formatted in the way we want, we can now start to dive into it.

First, define the top n countries that we want to view:

top_n = 20    # view the top 20 countries

Let’s group the df_daily_sorted dataframe by its first level index (level 0, which is Date) and then iterate through it:

for date, daily_df in df_daily_sorted.groupby(level=0):
    print(date)
    print(daily_df)    # a dataframe

You should see the following output:

2020-01-22
                                       Lat         Long  Confirmed
Date       Country/Region                                         
2020-01-22 China               1085.292300  3688.937700        548
           Japan                 36.204824   138.252924          2
           Thailand              15.870032   100.992541          2
           Korea, South          35.907757   127.766922          1
           Taiwan*               23.700000   121.000000          1
...                                    ...          ...        ...
           West Bank and Gaza    31.952200    35.233200          0
           Western Sahara        24.215500   -12.885800          0
           Yemen                 15.552727    48.516388          0
           Zambia               -13.133897    27.849332          0
           Zimbabwe             -19.015438    29.154857          0

[188 rows x 3 columns]
2020-01-23
                                       Lat         Long  Confirmed
Date       Country/Region                                         
2020-01-23 China               1085.292300  3688.937700        643
           Thailand              15.870032   100.992541          3
           Japan                 36.204824   138.252924          2
           Vietnam               14.058324   108.277199          2
           Korea, South          35.907757   127.766922          1
...                                    ...          ...        ...

The above dataframe shows the total number of confirmed cases (sorted in descending order) for each country, sorted by Date.

For each day, extract the top n countries and their number of confirmed cases:

for date, daily_df in df_daily_sorted.groupby(level=0):       
    print(date)
    # print(daily_df)    # a dataframe
    topn_df = daily_df.head(top_n)
    # get all the countries from the multi-index of the dataframe
    countries = list(map(lambda x:(x[1]),topn_df.index))[::-1]    
    confirmed = list(topn_df.Confirmed)[::-1]    
    print(countries)
    print(confirmed)

Notice that I have reversed the order of the countries and confirmed lists using [::1]. This is to ensure that the countries with the most cases will be listed last. Doing so will ensure that later on when you plot a horizontal bar chart using the plt.barh() method, the countries with the most cases will be listed at the top. This is because theplt.barh() method plots the horizontal bars in an upward direction.

You will see something like this:

2020-01-22
['Bangladesh', 'Bahrain', 'Bahamas', 'Azerbaijan', 'Austria', 'Australia', 'Armenia', 'Argentina', 'Antigua and Barbuda', 'Angola', 'Andorra', 'Algeria', 'Albania', 'Afghanistan', 'US', 'Taiwan*', 'Korea, South', 'Thailand', 'Japan', 'China']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 548]
2020-01-23
['Bahamas', 'Azerbaijan', 'Austria', 'Australia', 'Armenia', 'Argentina', 'Antigua and Barbuda', 'Angola', 'Andorra', 'Algeria', 'Albania', 'Afghanistan', 'US', 'Taiwan*', 'Singapore', 'Korea, South', 'Vietnam', 'Japan', 'Thailand', 'China']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 643]
2020-01-24
['Azerbaijan', 'Austria', 'Australia', 'Armenia', 'Argentina', 'Antigua and Barbuda', 'Angola', 'Andorra', 'Algeria', 'Albania', 'Afghanistan', 'Vietnam', 'US', 'Korea, South', 'Japan', 'France', 'Taiwan*', 'Singapore', 'Thailand', 'China']
...

At this juncture, you now basically have all the required data to plot a bar chart displaying the number of confirmed cases for each country in each day.

Plotting the Bar Charts

You are now ready to display a racing Bar Chart to show the number of confirmed cases for each country. You can’t display a dynamic chart in Jupyter Notebook, and so let’s write the following code and save it as a file named RacingBarChart.py (save it in the same directory where you launched Jupyter Notebook). The code in bold below is the code for plotting the racing bar chart:

import pandas as pd
import numpy as np
import Matplotlib.pyplot as plt
import datetime
# load the dataset
dataset_path = "./Dataset - 20 July 2020/"
df_conf = pd.read_csv(dataset_path + 
              "time_series_covid_19_confirmed.csv")
# sort the dfs
df_conf = df_conf.sort_values(by=
    ['Province/State','Country/Region'])
df_conf = df_conf.reset_index(drop=True)
# extract the dates columns
dates_conf = df_conf.columns[4:]
# perform unpivoting
df_conf_melted = 
    df_conf.melt(
        id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'],
        value_vars=dates_conf,
        var_name='Date',
        value_name='Confirmed')
# convert the date column to date format
df_conf_melted["Date"] = df_conf_melted["Date"].apply(
    lambda x: datetime.datetime.strptime(x, '%m/%d/%y').date())
# group by date and country and then sum up based on country
df_daily = df_conf_melted.groupby(["Date", "Country/Region"]).sum()
df_daily_sorted = df_daily.sort_values(['Date','Confirmed'],
                      ascending=[True, False])
# get the list of all countries
all_countries = list(df_conf['Country/Region'])
# plotting using the seaborn-darkgrid style
plt.style.use('seaborn-darkgrid')
# set the size of the chart
fig, ax = plt.subplots(1, 1, figsize=(14,10))
# hide the y-axis labels
ax.yaxis.set_visible(False)
# assign a color to each country
NUM_COLORS = len(all_countries)
cm = plt.get_cmap('Set3')
colors = np.array([cm(1.*i/NUM_COLORS) for i in range(NUM_COLORS)])
top_n = 20
for date, daily_df in df_daily_sorted.groupby(level=0):
    # print(date)
    # print(daily_df)    # a dataframe
    topn_df = daily_df.head(top_n)

    # get all the countries from the multi-index of the dataframe
    countries = list(map (lambda x:(x[1]),topn_df.index))[::-1]
    confirmed = list(topn_df.Confirmed)[::-1]
    # clear the axes so that countries no longer in top 10 will not 
    # be displayed
    ax.clear()
    # plot the horizontal bars
    plt.barh(
        countries,
        confirmed,
        color = colors[[all_countries.index(n) for n in countries]],
        edgecolor = "black",
        label = "Total Number of Confirmed Cases")
    # display the labels on the bars
    for index, rect in enumerate(ax.patches):
        x_value = rect.get_width()
        y_value = rect.get_y() + rect.get_height() / 2
        # display the country
        ax.text(x_value, y_value, f'{countries[index]} ',
            ha="right", va="bottom",
            color="black", fontweight='bold')
        # display the number
        ax.text(x_value, y_value, f'{confirmed[index]:,} ',
            ha="right", va="top",
            color="black")
    # display the title
    plt.title(f"Top {top_n} Countries with Covid-19 ({date})",
        fontweight="bold",
        fontname="Impact",
        fontsize=25)
    # display the x-axis and y-axis labels
    plt.xlabel("Number of people")
    # draw the data and runs the GUI event loop
    plt.pause(0.5)
# keep the matplotlib window
plt.show(block=True)

To run the application, type the following in your Terminal/Anaconda Prompt:

$ python RacingBarChart.py

You should now see the Racing Bar Chart:

If you do not want to display so many countries, adjust the value of the top_n variable:

top_n = 10

And change the size of the figure:

fig, ax = plt.subplots(1, 1, figsize=(10,6))

If you want to speed up the race, change the duration of the pause() function:

# draw the data and runs the GUI event loop
plt.pause(0.2)

Summary

I hope you have the chance to try out the racing bar chart. The Racing bar chart allows information to be presented dynamically and is a very powerful medium to drive your points across. Let me know if you have made any enhancements and I would love to hear from you! See you next time!


Related Articles