The world’s leading publication for data science, AI, and ML professionals.

A Single Python Function Generating A Gorgeous Bar Chart Race Video

Step by step tutorial from a raw dataset to a bar chart race video

Photo by Shutterbug75 on Pixabay
Photo by Shutterbug75 on Pixabay

Recently, a very fancy type of data visualisation ignited in various platforms such as YouTube. As long as there are some "entities" with a single measure that changes over time, this type of visualisation can be perfect to illustrate the dynamic ranking between these entities.

That is the Bar Chart Race. It is usually can be seen as a video. The video below is generated by myself using a single Python function. Have a look at it, then you will know what I’m talking about.

Do you believe the above video is generated using a single Python function with a few lines of code? I’ll show you how.

TD; DL

The article is not too short, because I want to provide something that can be followed step by step. I would say that 95% of the time we’re doing data preprocessing here. If you really want to skip and directly to the point, please go the section "Generate Bar Chart Race Video" for the code of generating the video 🙂

Data Preparation

Photo by TiBine on Pixabay
Photo by TiBine on Pixabay

To make sure this tutorial is achievable, I will not just provide the code for generating the bar chart race video, but also how to get the dataset and pre-process it. So, the prepared dataset is actionable and then can be fit in the bar chart race library in Python.

Get the Raw Dataset

The dataset I used was provided by the European Centre for Disease Prevention and Control that is publicly available. Everyone can download the COVID-19 dataset for free. I also used this dataset in one of my previous articles.

3 Lines of Python Code to Create An Interactive Playable COVID-19 Bubble Map

The official website is here:

https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide

There is a link on the above webpage that we can download the dataset in CSV format. In fact, we don’t have to download the dataset to our local machine, because Pandas Dataframe can directly read a remote CSV file as follows.

df = pd.read_csv('https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/data.csv')

Then, we can have a look at what columns we have in the dataset and how the values look like by df.head().

If you have watched the previous video, you might aware that I have used the number of confirmed cases as the measure. So, Let’s start the data cleansing and transforming.

Data Transformation

In the data transformation, we need to reshape the data into something we can use. To generate the bar chart race video, we are going to use the Python library called "bar-chart-race" (Will introduce in the next section). It expected the data frame using the DateTime object as the indexes, and the entity names as the column name. So, let’s transform the data together.

1. Remove unuseful columns

Since we are only interested in the date, countries an the number of cases, we can filter the data frame with only these three columns as follows.

df = df[['dateRep', 'countriesAndTerritories', 'cases']]

2. Convert date string to datetime object

If we check the data types of the columns, we will find that the dateRep is actually of objects, which will be treated as strings rather than datetime object. This will cause some troubles because we can’t expect the sort results based on a date string column.

Therefore, we need to convert it into datetime type.

df['date'] = pd.to_datetime(df['dateRep'], dayfirst=True)

After this, let’s check the dtypes again.

3. Add cumulative cases

You may have already noticed that the number of cases is for the corresponding date. However, we need cumulative total cases for each day so that the data entry for each day can be generated as a frame in the video.

Therefore, let’s generate a column called total_case, so it means the total number of cases so far on that day.

df['total_cases'] = df.sort_values('date').groupby('countriesAndTerritories').cumsum().sort_index()

4. Remove the non-useful columns again and rename the columns

Since we have generated the new date and total_cases columns, we don’t need the dateRep and cases columns. Also, let’s rename the column of country name to keep it simple.

df = df[['date', 'countriesAndTerritories', 'total_cases']]
df = df.rename(columns={'countriesAndTerritories': 'country'})

5. Pivot the table

Remember we need to use the country names as the column names? So, we need to pivot the table. This is very easy to do in Pandas data frame.

df = pd.pivot_table(df, index=['date'], columns=['country'], values=['total_cases'])

We use date as the index, country should be the columns and the values are total_cases. Then, we will get the pivot table as follows.

The multi-level indexes that are generated by the pivot_table function are ugly and potentially will cause problems. So, let’s fix it.

df.index.name = None
df.columns = [col[1] for col in df.columns]

We finished transforming the raw dataset. However, we still need to cleanse the data.

Data Cleansing

Photo by webandi on Pixabay
Photo by webandi on Pixabay

Have you seen that we still have a lot of NaN values in the data frame? That is because there are not cases recorded ever for that country, which makes sense. However, they may cause problems later on.

1. Fill the NaN values with Zero

Since the NaN values mean there are no confirmed cases for the country, it is safe to fill all the NaN values with zeros.

df = df.fillna(0).astype(int)

Please be aware here we also change the data type to integers, because it doesn’t make sense to have float numbers for the number of cases.

2. Drop an exceptional column

Among all the countries that are already columns in our data frame, there is a column called Cases_on_an_international_conveyance_Japan, which is not a country. It refers to the Diamond Princess Cruise. Because it is not actually a country and the passengers are actually from various countries, I would like to remove it from our statistics in this case.

df = df.drop(columns=['Cases_on_an_international_conveyance_Japan'])

3. Remove the underscores in the country names

The country names (our column names) have underscores in between the words. I would like to replace these underscores with whitespaces so that it will be prettier when we see them as the bar chart labels.

df.columns = [col.replace('_', ' ') for col in df.columns]

4. Remove countries that never ranked in the top 10

This step is optional. However, it will improve performance when we generate the video. We have totally 200+ countries in the dataset, but not every one of them has ever ranked in the top 10. Since we are going to create a top 10 bar chart race data visualisation, those countries will never be displayed.

We can remove these countries as follows.

country_reserved = set()
for index, row in df.iterrows():
    country_reserved |= set(row[row > 0].sort_values(ascending=False).head(10).index)
df = df[list(country_reserved)]

Here, we firstly generate a set called country_reserved. Using a set is because it will simply ignore duplicated values. Then, iterate each row of the data frame, append the country names to the set if their total cases are ranked in top 10 on that day.

Finally, we will have a set of countries that ever ranked in the top 10. Convert the set into a list and filter the column will produce a much smaller data frame we will use later.

Now we only have 30 countries left.

Generate Bar Chart Race Video

Photo by StartupStockPhotos on Pixabay
Photo by StartupStockPhotos on Pixabay

OK. We have used a lot of time to wrangle the dataset. In fact, to generate the video is extremely easy as I titled this article.

We just need to download the library called bar-chart-race, pip will do that in a minute.

pip install bar-chart-race

Import the library.

import bar_chart_race as bcr

Simply call the bar_chart_race function as follows.

bcr.bar_chart_race(
    df=df,
    filename='/content/drive/My Drive/covid-19.mp4',
    n_bars=10,
    period_fmt='%B %d, %Y',
    title='COVID-19 Confirmed Cases by Country'
)
  • df is the data frame we have prepared
  • filename is the output file name of the video
  • n_bars is the number of bars we want to show in the bar chart, which can also be considered as the "top n" we want to show.
  • period_fmt is the date format we want to show on the bar chart.
  • title is just the title of the bar chart.

After a while, the video will be generated at the path and the name that we have specified.

Summary

Photo by Bokskapet on Pixabay
Photo by Bokskapet on Pixabay

Hooray! Another amazing Python library!

In this article, I have introduced yet another amazing Python library "Bar Chart Race". With a single function and a few lines of code, we can generate a gorgeous bar chart race video in MP4 format.

It turns out there are more configurable items available in the bar chart race library. It is highly recommended to try them out by yourself. The link to the official documentation has been given in the references!

Join Medium with my referral link – Christopher Tao

If you feel my articles are helpful, please consider joining Medium Membership to support me and thousands of other writers! (Click the link above)

References

European Centre for Disease Prevention and Control

Download the daily number of new reported cases of COVID-19 by country worldwide

Bar Chart Race Documentation

Bar Chart Race

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of Data Science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.


Related Articles