Hands-on Tutorial
Data pre-processing
As we have discussed in the previous story, the plotline can be used as an alternative to data visualization in Python. In this tutorial, we will continue the journey in data visualization that keeps focusing on one specific plot, namely the bar plot. Firstly, download the data of the National Olympic Committee (NOC) from my previous story.
Introduction to Plotnine as the Alternative of Data Visualization Package in Python
# Dataframe manipulation
import pandas as pd
# Linear algebra
import numpy as np
# Data visualization with plotnine
from plotnine import *
import plotnine
# Set the figure size of plotnine
plotnine.options.figure_size = (6.4,4.8)
The athlete event
the dataset has 271,116 records or rows and 15 columns or attributes.
# Import the athlete data
athlete_data = pd.read_csv('datasets/athlete_events.csv')
# Print the dimension of the data
print('Dimension of athlete data:n{}'.format(len(athlete_data)),
'rows and {}'.format(len(athlete_data.columns)),'columns')
athlete_data.head()

The noc region
has 230 records or rows and 3 columns or attributes. The NaN in the notes
column means that they are not recorded.
# Import the region data
regions_data = pd.read_csv('datasets/noc_regions.csv')
# Print the dimension of the data
print('Dimension of region data:n{}'.format(len(regions_data)),
'rows and {}'.format(len(regions_data.columns)),'columns')
regions_data.head()

We conduct the left join with the athlete event
as left table and NOC
column as the index level names to join on. It produces the cleaned data with 271,116 rows and 17 columns.
# Conduct left join between athlete and region data
full_data = athlete_data.merge(regions_data,on='NOC',how='left')
# Print the dimension of the data
print('Dimension of full data:n{}'.format(len(full_data)),
'rows and {}'.format(len(full_data.columns)),'columns')
full_data.head()

# Create cross tabulation
medal_noc = pd.crosstab([full_data['Year'], full_data['NOC']], full_data['Medal'], margins = True).reset_index()
# Remove index name
medal_noc.columns.name = None
# Remove last row for total column attribute
medal_noc = medal_noc.drop([medal_noc.shape[0] - 1], axis = 0)

Count the medals acquired by each country and find out a country that became the general winner each year.
# Data aggregation - group by
medal_noc_year = medal_noc.loc[medal_noc.groupby('Year')['All'].idxmax()].sort_values('Year')

Finally, we are able to count the number of countries that had been a general winner since 1896. This data is used to generate the data viz, like for this tutorial bar plot.
# Top ten countries that won the most Olympic medals
medal_noc_count = pd.DataFrame(medal_noc_year['NOC'].value_counts()).reset_index()
medal_noc_count.columns = ['NOC','Count']

Hands-on with the Plotnine
Bar plot is the most common type of data visualization. It is a graph or plot that represents the categorical data in a bar. In another opportunity, you can also try to code and make visualization using another type of graph or plot.
1 Basic bar plot
It will be the baseline of our bar plot. We just create a basic bar plot without any modification, theme, legend description, color manipulation, etc. We only need the ggplot()
function to create the first layer of background and geom_bar()
to convert our data into a bar plot. The horizontal axis (X) represents the categories or in this case, is NOC and the vertical (Y) represents their frequencies.
For geom_bar()
, the default behaviour is to count the rows for each X value. It doesn’t expect a Y-value, since it’s going to count that up itself – in fact, it will flag a warning if we give it one, since it thinks we’re confused. How aggregation is to be performed is specified as an argument to geom_bar()
, which is stat = 'count'
for the default value. Read more.
# Create a basic bar plot
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
stat='identity')
)

2 Add plot and axis title
In the previous section, our plot is successfully created but it has not enough information for the audience. When we present that to the conference or create a journal, the information must be misleading. What does that plot tell us? The information like a plot title must be included. The labs()
function has an argument title to add the plot title to our plot. Then, we’re able to modify both the horizontal and vertical labels using xlab()
and ylab()
.
Furthermore, in order to manipulate the font style, just add other functions theme()
to our main script. For the whole font setting like font size, font style, etc, it’s provided by an argument text()
.
text()
– for the whole text elementaxis_title()
– a setting for both the horizontal and vertical axis titleaxis_text()
– a setting for the axis labelsplot_title()
– a setting for the plot title
Note: there are a lot of arguments in
theme()
function. To read more about it, please directly click here
# Add plot and axis title
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
stat='identity')+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))
)

3 Order the categories decreasingly
The main objective of data viz is to summarize and represent the data. Data visualization contains both art and science aspects. So, that means we must consider the aesthetics aspect. In artwork, there are a lot of aspects to consider, such as balance. It needs to implement in the data viz in order to help people understand the viz with a quick look.
To reorder the horizontal labels which are categorical data, we can easily use the scale_x_discrete()
function. An argument limits
determines the ordered labels in the list.
# Input
medal_noc_count['NOC'].tolist()
# Output
# ['USA','URS','CAN','GER','GRE','SWE','EUN','FIN','GBR','FRA']
Now, let’s create the bar plot with an ordered label! Does that look more beautiful and easier to understand, right?
# Order the categories decreasingly
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
stat='identity')+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
scale_x_discrete(limits=medal_noc_count['NOC'].tolist())+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))
)

4 Fill the bar with a certain color
As art and also has the objective to make people understand the data, color becomes an important aspect to consider with.
The plotnine
provides an argument fill
to modify the plot’s color. Colors and fills can be specified in the following ways:
- A name, for instance:
'red'
,'blue'
,'black'
,'white'
,'green'
, etc. - An RGB specification
Note: to interact with color in plotnine, there are three argument to learn, such as
colour
,fill
, andalpha
. Read more here
# Fill the bar with a certain colour
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
fill='#c22d6d',
stat='identity')+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
scale_x_discrete(limits=medal_noc_count['NOC'].tolist())+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))
)

5 Highlight the interesting category
As mentioned in the previous section, color is key to data viz. It helps us attract the audience’s attention or focus and guide our data story. Color is often used to highlight the important part of our plot or graph. So, we will remake the color of the plot only to our highlight part, like URS, which is our focus right now.
The argument fill
will be divided into two-part, URS and others. The URS will be '#c22d6d'
and '#80797c'
for others. The np.where()
function determines the colors based on our criteria above.
# Highlight the interesting category
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
fill=np.where(medal_noc_count['NOC'] == 'URS','#c22d6d','#80797c'),
stat='identity')+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
scale_x_discrete(limits=medal_noc_count['NOC'].tolist())+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))
)

6 Change the plot theme
Plotnine provides us with a lot of themes. Which theme will we use for our data viz is our consideration? Each theme has both advantages and disadvantages. But to switch from one to the other, it’s quite simple as click and drag (we only need to write one row of code).
The grammar of graphics themes:
theme_gray
theme_bw
theme_linedraw
theme_light
theme_dark
theme_minimal
theme_classic
theme_void


Let’s create a data viz with a theme_minimal()
function!
# Change the plot theme according to the needs
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
fill=np.where(medal_noc_count['NOC'] == 'URS','#c22d6d','#80797c'),
stat='identity')+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
scale_x_discrete(limits=medal_noc_count['NOC'].tolist())+
theme_minimal()+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))
)

7 Annotate each category
To annotate the categories, there are several options to try, such as using geom_text()
or geom_label()
. We will demonstrate them in this tutorial and try to find out the differences. The geom_text()
adds text directly to the plot while geom_label()
draws a rectangle underneath the text, making it easier to read. There is an argument template for both geom_text()
or geom_label()
to use as follows.
x
: the position of labels in the X-axisy
: the position of labels in the Y-axislabels
: labels to annotate the categoriessize
: font size of labelsnudge_x
: horizontal adjustment to nudge labelsnudge_y
: vertical adjustment to nudge labels
# Annotate each category using geom_text
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
fill=np.where(medal_noc_count['NOC'] == 'URS','#c22d6d','#80797c'),
stat='identity')+
geom_text(aes(x='NOC',
y='Count',
label='Count'),
size=10,
nudge_y=0.5)+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
scale_x_discrete(limits=medal_noc_count['NOC'].tolist())+
theme_minimal()+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))
)

Now, let’s try the other option to annotate the categories, namely geom_label()
!
# Annotate each category using geom_label
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
fill=np.where(medal_noc_count['NOC'] == 'URS','#c22d6d','#80797c'),
stat='identity')+
geom_label(aes(x='NOC',
y='Count',
label='Count'),
size=10,
nudge_y=0.75)+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
scale_x_discrete(limits=medal_noc_count['NOC'].tolist())+
theme_minimal()+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))
)

Note: to explore about technique to annotate text using plotnine, read this article here
8 Flip the coordinate
Sometimes, for some data viz, that will be great if we can rotate the horizontal axis to vertical and vice versa or in other words, flip the coordinate. The plotnine gives a function coord_flip()
to do that thing easily.
And in this, we are talking about how to get the audience’s attention by showing them a highlighted part, for instance, giving a text annotation only for the URS. To do so, we might filter the data for only the URS rows and assign the label into geom_text()
function.
Note: because we have already assign the medal_noc_count in
ggplot()
function, we need to mapping the filtered URS data ingeom_text()
using argumentmapping
# Flip the coordinate
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
fill=np.where(medal_noc_count['NOC'] == 'URS','#c22d6d','#80797c'),
position = position_dodge(0.9),
stat='identity')+
geom_text(data=medal_noc_count[medal_noc_count['NOC'] == 'URS'],
mapping=aes(x='NOC',
y='Count',
label='Count'),
size=10,
nudge_y=0.25)+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
scale_x_discrete(limits=medal_noc_count['NOC'].tolist())+
theme_minimal()+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))+
coord_flip()
)

Did you get something wrong? Yes, it’s about the order. We must rearrange the category based on their numbers (increasingly). It’s too tricky but quite simple to do. We just need to sort the numbers increasingly instead of decreasingly.
# Reverse order the categories for flipping the coordinate
medal_noc_count.sort_values('Count',ascending=True,inplace=True)

When the data is already sorted increasingly, we can visualize it!

9 Add a description for the highlighted part
When the URS data becomes the highlighted part for this tutorial, it’s recommended to annotate some text to it. It might be the findings or insights. To do so, we use geom_text()
function. The X
is assigned to 8
because of the URS position in the vertical axis (from the bottom) and the y
is assigned to 8
because of the URS label. The statement y=8
directly makes the centre paragraph of annotation to the 8
.
Note: the
n
is an escape character which means insert a newline in the text
# Add the description about the highlighted bar
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
fill=np.where(medal_noc_count['NOC'] == 'URS','#c22d6d','#80797c'),
position=position_dodge(0.9),
stat='identity')+
geom_text(data=medal_noc_count[medal_noc_count['NOC'] == 'URS'],
mapping=aes(x='NOC',
y='Count',
label='Count'),
size=10,
nudge_y=0.25)+
geom_text(x=8,
y=8,
label='The URS has won the Olympics nevent for 8 times',
size=10,
nudge_y=0.25)+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
scale_x_discrete(limits=medal_noc_count['NOC'].tolist())+
theme_minimal()+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))+
coord_flip()
)

10 Add the data description in an empty space
The last section about how to create scientific data viz is a description of the data. It helps the audience understand the data source, its periods, sampling methods, etc. To make it tidy and clear, we are able to try geom_label()
function to it. The label position will be too tricky because we need to conduct trial and error to make sure the description is in the right position.
# Add the description about the data in an empty space
(
ggplot(data=medal_noc_count)+
geom_bar(aes(x='NOC',
y='Count'),
fill=np.where(medal_noc_count['NOC'] == 'URS','#c22d6d','#80797c'),
position = position_dodge(0.9),
stat='identity')+
geom_text(data=medal_noc_count[medal_noc_count['NOC'] == 'URS'],
mapping=aes(x='NOC',
y='Count',
label='Count'),
size=10,
nudge_y=0.25)+
geom_text(x=8,
y=8,
label='The URS has won the Olympics nevent for 8 times',
size=10,
nudge_y=0.25)+
geom_label(x=1.5,
y=12,
label='This is a historical datasetn on the modern Olympic Games,n including all the Games fromn Athens 1896 to Rio 2016',
nudge_y=3,
family='DejaVu Sans',
size=10,
fontstyle='italic',
boxstyle = 'round')+
labs(title='Top ten countries that won the most Olympic medals')+
xlab('Country')+
ylab('Frequency')+
scale_x_discrete(limits=medal_noc_count['NOC'].tolist())+
theme_minimal()+
theme(text=element_text(family='DejaVu Sans',
size=10),
axis_title=element_text(face='bold'),
axis_text=element_text(face='italic'),
plot_title=element_text(face='bold',
size=12))+
coord_flip()
)

Conclusion
And finally, our data visualization in this last section is better than the first section. People will easier understand the data, we get the people’s attention through the highlighted part, using text annotation to tell the point of interest about our part, and sticking a little data description to tell the people about the background of our data. It is because data viz is part of science and art. Entertain the people using impressive but simple data viz without losing the analytics aspect.

References
[1] H. Wickham, W. Chang, L. Henry, T. Pedersen, K. Takahashi, C. Wilke, K. Woo, H. Yutani, D. Dunnington. Modify components of a theme, 2020. https://ggplot2.tidyverse.org/.
[2] J. Janssens. Plotnine: Grammar of Graphics for Python A translation of the visualisation chapters from "R for Data Science" to Python using Plotnine and Pandas, 2019. https://datascienceworkshops.com/.