How to draw a bar graph for your scientific paper with python

Yefeng Xia
Towards Data Science
6 min readSep 17, 2020

--

photo by Trey Hurley

A bar graph šŸ“Š(also known as a bar chart or bar diagram) is a visual tool with that readers can compare data showed by bars among categories. In this story, I try to introduce how can we draw a clear bar plot with python.

As a student or researcher, you have to publish your efforts and results in scientific papers where the research data should be easily readable, accessible. Sharing research data is something we are increasingly encouraged, or even mandated. Meanwhile, that is something I am passionate aboutšŸ˜˜.

Firstly let us get to know this friend again, who (what) is the bar graph?

A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally. A vertical bar chart is sometimes called a column chart. by wikipedia.

First example

I made an easy example to show the scores of subjects of a student X with a vertical bar chart intuitively.

a bar chart of all scores

Thatā€™s made by a very simple code with Matplotlib:

import matplotlib.pyplot as pltfig = plt.figure(figsize=(7,5))ax = fig.add_axes([0,0,1,1])subjects = ['Math', 'Physik', 'English', 'Chemistry', 'History']scores = [90,80,85,72,66]ax.bar(subjects,scores)ax.set_ylabel('scores',fontsize= 12)ax.set_xlabel('subjects',fontsize= 12)ax.set_title('all scores of student X')for i, v in enumerate(scores):ax.text( i ,v + 1, str(v), color='black', fontweight='bold')plt.savefig('barplot_1.png',dpi=200, format='png', bbox_inches='tight')plt.show()

A true case in data analyses

For a small volume of result data, we can define a list to save the whole data just like in our previous example. But sometimes there is too much data, itā€™s inefficient to enter each data into a list again if we have saved the same data in an excel file.

Thatā€™s convenient if we start our process to draw a bar chart by importing an excel file or other file. Therewith we need a DataFrame which is two-dimensional, size-mutable, potentially heterogeneous tabular data, including labeled axes (rows and columns).

In this true case that I wrote in a paper, I need to compare the different modelsā€™ performance with accuracy. Accuracy (ACC) is the closeness of the measurements to a specific value, which is one metric for evaluating classification models. Formally, accuracy has the following definition:

Accuracy = Number of correct predictions / Total number of directions

Furthermore, top-N accuracy which is usually used in multi-classification models, means that the correct class gets to be in the Top-N probabilities for it to count as ā€œcorrectā€. In the paper, I adopted the top-1, top-3, and top-5 accuracy to evaluate predictions. I have entered all top-N ACCs in an excel file with the name ā€œmodel_accs.xlsxā€.

data, saved in excel table (b) and pandas data frame (c)

With pandas, we can read data from an excel file in python. Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

import pandas as pd
df= pd.read_excel("model_accs.xlsx",header=0) # read excel file

With the function pd.read_excel(), we can read our result excel file. If the data are saved in .csv file, then we can use pd.read_csv(), which is similar to pd.read_excel().

import numpy as nptop_1 = df.iloc[:,3]top_3 = df.iloc[:,2]top_5 = df.iloc[:,1]xlabels = df.iloc[:,0]N =5ind = np.arange(N)  # the x locations for the groupswidth = 0.2       # the width of the barsfig, ax = plt.subplots(figsize=(12,8))rects1 = ax.bar(ind, top_1, width, color='b')rects2 = ax.bar(ind + width, top_3, width, color='orange')rects3 = ax.bar(ind + 2*width, top_5, width, color='g')ax.set_xticks(ind + width)ax.set_xticklabels(xlabels,fontsize=10)ax.set_xlabel("models", fontsize=12)ax.set_ylabel("Top-N ACC/%", fontsize=12)ax.set_title('Top-N ACC for 5 different models')ax.legend((rects1[0], rects2[0],rects3[0]),('top-1 acc', 'top-3 acc','top-5 acc'),bbox_to_anchor=(1.13, 1.01))def labelvalue(rects):for rect in rects:height = rect.get_height()ax.text(rect.get_x() + rect.get_width()/2., 1.01*height,'%d' % int(height),ha='center', va='bottom')labelvalue(rects1)labelvalue(rects2)labelvalue(rects3)plt.savefig('barplot_2.png',dpi=200, format='png', bbox_inches='tight')plt.show()

With the above code, we have saved a beautiful bar graph for our 5 modelsā€™ top-N ACC, which looks like the following:

bar graph for the 5 modelsā€™ top-N ACC

Progress

However, we are not well satisfied with this bar plot, since there are too many bars. Can we show the data simplifiedšŸ¤”? perhaps with fewer bars, e.g. we combine top-1, top-3, top-5 acc into a bar, which means that we need only 5 bars to show the top-N acc of 5 models. The information in the plot should be compressed. Here we can make it by preprocessing the input data with follows, where we subtract top-5 acc and top-3 acc to get the difference, repeat the same operation for top-3 acc and top-1 acc.

And we draw a new bar graph with the following code:

ax = df[['top-1 acc','top-3 acc','top-5 acc']].plot(kind='bar', title ="Top-N ACC for 5 different models", figsize=(12, 8), legend=True, fontsize=12,stacked=True)top_1 = df.iloc[:,3]top_3 = df.iloc[:,2]top_5 = df.iloc[:,1]xlabels = df.iloc[:,0]ax.set_xticklabels(xlabels,fontsize=10)ax.set_xlabel("models", fontsize=12)ax.set_ylabel("Top-N ACC/%", fontsize=12)for i, v in enumerate(top_1):ax.text( i ,v - 2, str(v), color='black')for i, v in enumerate(top_3):ax.text( i ,v - 3, str(v), color='black')for i, v in enumerate(top_5):ax.text( i ,v + 1, str(v), color='black')plt.savefig('barplot_2.png',dpi=200, format='png', bbox_inches='tight')plt.show()

After running the code, we get a simple bar chart, which integrates the top-1 acc, top-3 acc, and top-5 acc into a single bar.

a more intuitive bar graph for the 5 modelsā€™ top-N ACC

The last bar graph looks very intuitive and obvious, though it contains the same information as the penultimate bar graph with 15 bars. Additionally, the code for the last bar graph is shorter than that of the penultimate bar graph. Therefore, I have chosen to use the new bar graph in my paperšŸ“œ.

Conclusion

In this story, I introduced my own method to draw a bar graph with the Python based on my experience. For both students and researchers, itā€™s important to master the skill of how to illustrate their research data intuitively. And the Bar plot is one of the best and most frequently used illustrations used in scientific papers. I hope my story has helped you with itā›³ļø.

Link to GitHub Repository: https://github.com/Kopfgeldjaeger/Bar_graph_with_Matplotlib

References

Here come two helpful documents for pandas and Matplotlib.

--

--