Make your bar chart come to life
In my previous two articles on data analytics using the Covid-19 dataset, I first
- discussed how to perform data analytics using NumPy and Pandas (https://levelup.gitconnected.com/performing-data-analytics-on-the-covid-19-dataset-using-python-numpy-and-pandas-bdfc352c61e9), followed by,
- how to perform data visualization using matplotlib (https://levelup.gitconnected.com/performing-data-visualization-using-the-covid-19-dataset-47c441747c43).
The Covid-19 dataset is a good candidate for exploring and understanding data analytics and visualisation. In this article, I will show you how to create a dynamic chart in matplotlib. In particular, I will create a racing bar chart to dynamically display the number of confirmed cases in each country as the days go by. At the end of this article, you will be able to see a chart like this:

Wrangling the Data
For a start, let’s use Jupyter Notebook to clean and filter all the data so that you have a clean dataset. Once the data is prepared, you will then be able to focus on creating the bar chart.
Importing the Packages
The first step is to import all the packages that you will be using for this project:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
Loading the Dataset
For this article, I will be using the latest dataset from https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset. So be sure to download it and save it in locally on your machine. For my case, I have saved it in a folder named "Dataset – 20 July 2020", relative to the location of my Jupyter Notebook:
# load the dataset
dataset_path = "./Dataset - 20 July 2020/"
df_conf = pd.read_csv(dataset_path +
"time_series_covid_19_confirmed.csv")
df_conf
The df_conf
Dataframe looks like this:

Sorting the DataFrame
With the DataFrame loaded, let’s sort the values by the ‘Province/State‘ and ‘Country/Region‘ columns:
# sort the df
df_conf = df_conf.sort_values(by='Province/State','Country/Region'])
df_conf = df_conf.reset_index(drop=True)
df_conf
After the sorting, the df_conf
will look like this:

Unpivoting the DataFrame
The next step would be to unpivot the DataFrame so that you now have two new columns Date
and Confirmed
:
# extract the dates columns
dates_conf = df_conf.columns[4:]
# perform unpivoting
df_conf_melted =
df_conf.melt(
id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates_conf,
var_name='Date',
value_name='Confirmed')
df_conf_melted
The unpivoted DataFrame (df_conf_melted
) should now look like this:

Converting the Date Column to Date Format
The next step would be to format the date stored in the Date column into a Date object:
# convert the date column to date format
df_conf_melted["Date"] = df_conf_melted["Date"].apply(
lambda x: datetime.datetime.strptime(x, '%m/%d/%y').date())
df_conf_melted
The updated df_conf_melted
dataframe now looks like this:

Grouping By Date and Country/Region and Summing up the Confirmed Cases for Each Country
It is now time to sum up all the confirmed cases for each country (regardless of province/state). You can do it using the following statements:
# group by date and country and then sum up based on country
df_daily = df_conf_melted.groupby(["Date", "Country/Region"]).sum()
df_daily
The df_daily
DataFrame now shows you the daily confirmed cases for each country:

Sorting the DataFrame by Date and Confirmed
We want to sort the df_daily
DataFrame based on two keys:
Date
(ascending order)Confirmed
(descending order)
So the following statements accomplish that:
df_daily_sorted = df_daily.sort_values(['Date','Confirmed'],
ascending=[True, False])
df_daily_sorted
The df_daily_sorted
DataFrame now looks like this:

One final thing to do, let’s extract the list of countries as a list:
# get the list of all countries
all_countries = list(df_conf['Country/Region'])
This will be used in the next section.
Iterating through Each Day
Now that our df_daily_sorted
DataFrame is sorted and formatted in the way we want, we can now start to dive into it.
First, define the top n countries that we want to view:
top_n = 20 # view the top 20 countries
Let’s group the df_daily_sorted
dataframe by its first level index (level 0, which is Date
) and then iterate through it:
for date, daily_df in df_daily_sorted.groupby(level=0):
print(date)
print(daily_df) # a dataframe
You should see the following output:
2020-01-22
Lat Long Confirmed
Date Country/Region
2020-01-22 China 1085.292300 3688.937700 548
Japan 36.204824 138.252924 2
Thailand 15.870032 100.992541 2
Korea, South 35.907757 127.766922 1
Taiwan* 23.700000 121.000000 1
... ... ... ...
West Bank and Gaza 31.952200 35.233200 0
Western Sahara 24.215500 -12.885800 0
Yemen 15.552727 48.516388 0
Zambia -13.133897 27.849332 0
Zimbabwe -19.015438 29.154857 0
[188 rows x 3 columns]
2020-01-23
Lat Long Confirmed
Date Country/Region
2020-01-23 China 1085.292300 3688.937700 643
Thailand 15.870032 100.992541 3
Japan 36.204824 138.252924 2
Vietnam 14.058324 108.277199 2
Korea, South 35.907757 127.766922 1
... ... ... ...
The above dataframe shows the total number of confirmed cases (sorted in descending order) for each country, sorted by Date
.
For each day, extract the top n countries and their number of confirmed cases:
for date, daily_df in df_daily_sorted.groupby(level=0):
print(date)
# print(daily_df) # a dataframe
topn_df = daily_df.head(top_n)
# get all the countries from the multi-index of the dataframe
countries = list(map(lambda x:(x[1]),topn_df.index))[::-1]
confirmed = list(topn_df.Confirmed)[::-1]
print(countries)
print(confirmed)
Notice that I have reversed the order of the
countries
andconfirmed
lists using[::1]
. This is to ensure that the countries with the most cases will be listed last. Doing so will ensure that later on when you plot a horizontal bar chart using theplt.barh()
method, the countries with the most cases will be listed at the top. This is because theplt.barh()
method plots the horizontal bars in an upward direction.
You will see something like this:
2020-01-22
['Bangladesh', 'Bahrain', 'Bahamas', 'Azerbaijan', 'Austria', 'Australia', 'Armenia', 'Argentina', 'Antigua and Barbuda', 'Angola', 'Andorra', 'Algeria', 'Albania', 'Afghanistan', 'US', 'Taiwan*', 'Korea, South', 'Thailand', 'Japan', 'China']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 548]
2020-01-23
['Bahamas', 'Azerbaijan', 'Austria', 'Australia', 'Armenia', 'Argentina', 'Antigua and Barbuda', 'Angola', 'Andorra', 'Algeria', 'Albania', 'Afghanistan', 'US', 'Taiwan*', 'Singapore', 'Korea, South', 'Vietnam', 'Japan', 'Thailand', 'China']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 643]
2020-01-24
['Azerbaijan', 'Austria', 'Australia', 'Armenia', 'Argentina', 'Antigua and Barbuda', 'Angola', 'Andorra', 'Algeria', 'Albania', 'Afghanistan', 'Vietnam', 'US', 'Korea, South', 'Japan', 'France', 'Taiwan*', 'Singapore', 'Thailand', 'China']
...
At this juncture, you now basically have all the required data to plot a bar chart displaying the number of confirmed cases for each country in each day.
Plotting the Bar Charts
You are now ready to display a racing Bar Chart to show the number of confirmed cases for each country. You can’t display a dynamic chart in Jupyter Notebook, and so let’s write the following code and save it as a file named RacingBarChart.py (save it in the same directory where you launched Jupyter Notebook). The code in bold below is the code for plotting the racing bar chart:
import pandas as pd
import numpy as np
import Matplotlib.pyplot as plt
import datetime
# load the dataset
dataset_path = "./Dataset - 20 July 2020/"
df_conf = pd.read_csv(dataset_path +
"time_series_covid_19_confirmed.csv")
# sort the dfs
df_conf = df_conf.sort_values(by=
['Province/State','Country/Region'])
df_conf = df_conf.reset_index(drop=True)
# extract the dates columns
dates_conf = df_conf.columns[4:]
# perform unpivoting
df_conf_melted =
df_conf.melt(
id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates_conf,
var_name='Date',
value_name='Confirmed')
# convert the date column to date format
df_conf_melted["Date"] = df_conf_melted["Date"].apply(
lambda x: datetime.datetime.strptime(x, '%m/%d/%y').date())
# group by date and country and then sum up based on country
df_daily = df_conf_melted.groupby(["Date", "Country/Region"]).sum()
df_daily_sorted = df_daily.sort_values(['Date','Confirmed'],
ascending=[True, False])
# get the list of all countries
all_countries = list(df_conf['Country/Region'])
# plotting using the seaborn-darkgrid style
plt.style.use('seaborn-darkgrid')
# set the size of the chart
fig, ax = plt.subplots(1, 1, figsize=(14,10))
# hide the y-axis labels
ax.yaxis.set_visible(False)
# assign a color to each country
NUM_COLORS = len(all_countries)
cm = plt.get_cmap('Set3')
colors = np.array([cm(1.*i/NUM_COLORS) for i in range(NUM_COLORS)])
top_n = 20
for date, daily_df in df_daily_sorted.groupby(level=0):
# print(date)
# print(daily_df) # a dataframe
topn_df = daily_df.head(top_n)
# get all the countries from the multi-index of the dataframe
countries = list(map (lambda x:(x[1]),topn_df.index))[::-1]
confirmed = list(topn_df.Confirmed)[::-1]
# clear the axes so that countries no longer in top 10 will not
# be displayed
ax.clear()
# plot the horizontal bars
plt.barh(
countries,
confirmed,
color = colors[[all_countries.index(n) for n in countries]],
edgecolor = "black",
label = "Total Number of Confirmed Cases")
# display the labels on the bars
for index, rect in enumerate(ax.patches):
x_value = rect.get_width()
y_value = rect.get_y() + rect.get_height() / 2
# display the country
ax.text(x_value, y_value, f'{countries[index]} ',
ha="right", va="bottom",
color="black", fontweight='bold')
# display the number
ax.text(x_value, y_value, f'{confirmed[index]:,} ',
ha="right", va="top",
color="black")
# display the title
plt.title(f"Top {top_n} Countries with Covid-19 ({date})",
fontweight="bold",
fontname="Impact",
fontsize=25)
# display the x-axis and y-axis labels
plt.xlabel("Number of people")
# draw the data and runs the GUI event loop
plt.pause(0.5)
# keep the matplotlib window
plt.show(block=True)
To run the application, type the following in your Terminal/Anaconda Prompt:
$ python RacingBarChart.py
You should now see the Racing Bar Chart:

If you do not want to display so many countries, adjust the value of the top_n
variable:
top_n = 10
And change the size of the figure:
fig, ax = plt.subplots(1, 1, figsize=(10,6))

If you want to speed up the race, change the duration of the pause()
function:
# draw the data and runs the GUI event loop
plt.pause(0.2)
Summary
I hope you have the chance to try out the racing bar chart. The Racing bar chart allows information to be presented dynamically and is a very powerful medium to drive your points across. Let me know if you have made any enhancements and I would love to hear from you! See you next time!