The world’s leading publication for data science, AI, and ML professionals.

Data Visualization using Matplotlib & Seaborn

DATA VISUALIZATION

A look at customizing options towards enhancing visualizations, and checking out some lesser-known plot functions along the way

Data visualization provides a visual context through maps or graphs. In doing so, it translates the data to a more natural form for the human mind to comprehend and pick out patterns or points of interest. A good visualization facilitates the conveyance of information or calls to action as part of the data presentation (storytelling).

Image by Author | Good for laughs, use at own risk for professional presentations.
Image by Author | Good for laughs, use at own risk for professional presentations.

The idea for the underlying project was partially inspired while maintaining records on iaito orders some years back. I chose this data because I felt this could be a way to bridge Data Science (visualization and analytics) with traditional art (Iaido), and as a visualization practice. The dataset is data on iaito order details for the years 2013–2021 and the iaito specifications and cost (JPY). For privacy reasons, sensitive particulars such as the delivery address, name, etc. are not captured in the data from the onset.

Iaito is an imitation katana for practice. Koshirae refers to the fittings on the iaito – (Fuchi, Kashira, Kojiri). Tsuba refers to the sword guard. The system of measurement for length is based on Japanese units (i.e. Shaku-Sun-Bu for length measurement); these are subsequently converted to metric units for easier interpretation. Models make from beginner to advanced are Shoden-Chuden-Okuden.


Content

  • Libraries
  • Matplotlib & Seaborn Approaches
  • Visualization Examples with Code
  • Final Thoughts

Libraries

Matplotlib and Seaborn are among the common workhorses for visualizing data in Python. In this article, a few visualizations are generated using both libraries, and the plotting functions are briefly covered.


Matplotlib & Seaborn Approaches

When using matplotlib, there are two approaches: the functional interface and the object-oriented interface. As beginners, we generally see more instances of the former. As one gains familiarity with plotting using this library, the latter approach comes into play, for scalability and more intricate customizations.

# Example
x = smthg
y = smthg
# plt interface
plt.figure(figsize=(9,7)
plt.plot(x,y)
# object-oriented interface
fig, ax = plt.subplots()  
ax.plot(x, y)

For seaborn, there are also two approaches. They are axes-level functions and figure-level functions. Axes-level functions take an explicit ax argument and return an Axes object. For figure-level functions, these need to have overall control over the figure plotted. Hence, there is a need to pass in a "figure-level" argument. An example would be customizing the figure size via the height and aspect values.

# Typical object-oriented style (seaborn)
f, (ax1, ax2) = plt.subplots(2) 
sns.regplot(x, y, ax=ax1) 
sns.kdeplot(x, ax=ax2)
# Figure-level
sns.catplot(data=df, x='x_variable', y='y_variable',  hue='hue', height=8, aspect=12)

Not realizing which level function a particular seaborn plot is might have tripped up many during (I’m certainly one). Check the plot types if stuck.

Examples of the plots for respective functions:

  • "Axes-level" functions: regplot, boxplot, kdeplot.
  • "Figure-level" functions: relplot, catplot, displot, pairplot, jointplot.

Visualization Examples with Code

The examples are arranged to show the various approaches for creating plots along with specific customizations. Initially, I adopted the mindset of "if it works, it’s good". I refactored the code a few times for maintainability, and consistency while experimenting with new concepts. For example, consistent code structure for the title, axes, tick labels, and a legend was generally practised across all plots. In addition, I also considered the potential audiences, and ways of making the visualizations pop. Some of these ways include the use of colour schemes and shapes. On colours, there are three main types of colour palettes available:

  • Qualitative
  • Sequential
  • Diverging

To facilitate the generation of color schemes, I used the tool from Data color picker. A general placeholder for font sizes was also created for re-use.

# create placeholder for fontsizing
# x & y label fontsize
xls = 14
yls = 14
# x & y tick fontsize
xts = 13
yts = 13
# title fontsize
ts = 15
# legend fontsize
ls = 13

Plotting with the pyplot interface is relatively straightforward. For the following line plot example, the fig.autofmt_date() provides an alternative for customizing axis tick labels. Common reasons to rotate tick labels are either too long a unit name (which could be mitigated by choosing an abbreviated unit of representation) or because of dates. As the function name implies, it does the auto rotation and alignment of date tick labels (change to ‘ydate’ for y tick labels).

# Establish the size of the figure.
fig = plt.figure(figsize=(16, 6))
plt.plot(df_fedex['postage'], linewidth=2,color='blue')
# Customize
plt.title('Postage over time',fontsize=ts)
plt.ylabel('Price (JPY)', fontsize=yls)
plt.yticks(size=yts)
plt.xticks(size=12)
# Rotate and align the tick labels so they look better.
fig.autofmt_xdate()
plt.show()
Image by Author
Image by Author

Pie charts are generally overused and could distort the information presented (i.e. there’s no scale).

Image by Author | Yes, scale is added for context info for these charts. | Tool: imgflip.com
Image by Author | Yes, scale is added for context info for these charts. | Tool: imgflip.com

Nevertheless, it has its place in visualizations – for presenting a handful of categories and their counts/ percentages. Used sparingly and leveraging annotations, colour schemes, & shapes, it could still be impactful. For the following pie chart, I adopted a reusable colour scheme and added the percentages along with absolute order quantity in chart annotations. The addition of a circle shape converted the pie into a doughnut chart for enhancing visuals.

# Create figure
fig, ax = plt.subplots(figsize=(6,6), subplot_kw=dict(aspect="equal"))
# Standarizing color scheme for the Iaito models
palette = ['#6d91ad','#004c6d','#416e8c','#99b6ce','#c6ddf1' ]
def make_autopct(count):
    def my_autopct(pct):
        total = sum(count)
        val = int(round(pct*total/100.0))
        return f'{pct:.1f}% ({val:d})'
    return my_autopct
ax = df_pie['count'].plot(kind='pie',fontsize=ls,autopct=make_autopct(count),pctdistance=0.6, colors=palette)
ax.set_ylabel("")
ax.set_title("Distribution of Iaito models",fontsize=ts)
# Add a white circle to center of pie chart
centre_circle = plt.Circle((0,0),0.80,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.show()
Image by Author
Image by Author

Continuing to the object-oriented approach for plotting, the following boxplots incorporates this approach and the afore-mentioned color scheme concept. Default box plots are vertical but can be re-arranged by swapping the x and y inputs.

# boxplot, models are organised in ascending order
palette_box = ['#c6ddf1','#99b6ce','#6d91ad','#416e8c','#004c6d']
fig, (ax1,ax2) = plt.subplots(2,1,figsize=(14,8), sharey=True)
sns.boxplot(data=df2, y='model', x='model_price',ax=ax1, palette=palette_box)
sns.boxplot(data=df2, y='model', x='iaito_total_price',ax=ax2, palette=palette_box)
ax1.set_ylabel('')
ax2.set_ylabel('')
ax1.set_xlabel('')
ax2.set_xlabel('price (JPY)',fontsize = yls)
ax1.set_title('Base price by model', fontsize = ts)
ax2.set_title('Overall price by model', fontsize = ts)
ax1.tick_params(axis='x', labelsize='large')
ax1.tick_params(axis='y', labelsize='large')
ax2.tick_params(axis='x', labelsize='large')
ax2.tick_params(axis='y', labelsize='large')
plt.tight_layout
plt.show()
Image by Author
Image by Author

The color scheme can also be used in place of the color hues in seaborn plots. In the following example, the lmplot (a Figure-level function) color scheme (invoked through the combination of hue for categorical classes & plt.use.style(ggplot)) can be overwritten through setting the color palette for seaborn like so:

# Set your custom color palette
palette = ['#6d91ad','#004c6d','#416e8c','#99b6ce','#c6ddf1' ]
sns.set_palette(sns.color_palette(palette))
# Consistent grey background
plt.style.use('seaborn-darkgrid')
# Lmplot is a figure-level function plot
ax = sns.lmplot(x='model_price', y='iaito_total_price',data=df,
           fit_reg=False, #no regression line
           hue='model',height=7, aspect=1.2, legend=False,legend_out=False)
plt.title('Iaito model price vs total price', fontsize=ts)
plt.ylim(30000,None)
plt.yticks(np.arange(35000, 210000, step=10000), fontsize=yts)
plt.ylabel('Iaito total price (JPY)',fontsize=yls)
plt.xticks(rotation=45, fontsize=xts)
plt.xlabel('Iaito model base price (JPY)',fontsize=xls)
plt.legend(fontsize=ls)
plt.show()
Images by Author
Images by Author

Violin plots can be used to represent the comparison of a variable distribution (or sample distribution) across different "categories" (e.g. in this case, Kojiri & No kojiri) across the various iaito models.

ax= sns.catplot(x="model", y="custom_design(cost)", data=df_ko, hue="kojiri", 
                kind="violin", split=True, inner="stick", palette="mako", linewidth=1,
                height=4,aspect=2.5)
ax.fig.suptitle('Kojiri & associated additional costs across models', fontsize=ts)
ax.fig.subplots_adjust(left=0.1, top=0.9)
ax.set_xlabels('')
ax.set_ylabels('price (JPY)', fontsize=yls)
ax.set_xticklabels(fontsize=xts)
ax.set_yticklabels(fontsize=yts)
ax._legend.set_title('Kojiri')
plt.show()
Image by Author
Image by Author

The versatility of matplotlib and seaborn support each other. Grouped Bar charts can be plotted from plt.bar or plt.bar() or seaborn (e.g. catplot). In the case of matplotlib, this can be done by offsetting an axis index by a specified bar width.

# seaborn catplot (horizontal grouped bar chart)
ax = sns.catplot(y="kojiri", data=df_ko,
                 kind="count", hue="model", 
                 palette=palette_box, edgecolor=".6",
                 height=5, aspect=2,legend=False)
ax.fig.suptitle('Iaito with & without kojiri (by model count)', fontsize=ts)
ax.fig.subplots_adjust(left=0.15, top=0.9)
label = np.arange(0,20)
ax.set_xlabels('')
ax.set_xticklabels(fontsize=xts)
ax.set_ylabels('Kojiri', fontsize=yls)
ax.set_yticklabels(fontsize=yts)
ax.add_legend(fontsize=ls)
plt.show()
Image by Author
Image by Author
# Setting the data
years = [2013,2014,2015,2016,2017,2018,2019,2020,2021]
Sho = [0,0,0,0,0,0,1,1,1]
Chu_M = [0,0,0,0,1,2,2,2,2]
Chu_S = [1,2,4,7,8,18,18,20,22]
Oku_Nosyu = [0,0,0,1,1,3,4,4,4]
Oku_Shin = [0,0,0,0,3,4,8,10,10]
# set bar width
width=0.15
# axis index
years_index = np.arange(len(years))
# clear reset plt style
plt.style.use('default')
plt.style.use(['ggplot'])
plt.figure(figsize=(8,5))
# plt interface approach for bar charts
plt.bar(years_index-2*width, Sho, color='#c6ddf1', label='Sho', width=width, linewidth=0.4,edgecolor='darkgrey')
plt.bar(years_index-width, Chu_M, color='#99b6ce', label='Chu_M', width=width)
plt.bar(years_index, Chu_S, color='#6d91ad', label='Chu_S', width=width)
plt.bar(years_index+width, Oku_Nosyu, color='#416e8c', label='Oku_Nosyu', width=width)
plt.bar(years_index+2*width, Oku_Shin, color='#004c6d', label='Oku_Shin', width=width)
plt.legend()
plt.title('Iaito model ownership',fontsize=ts)
plt.xlabel('Year',fontsize=xls)
plt.xticks(ticks=years_index, labels=years)
plt.ylabel('Count',fontsize=yls)
plt.yticks(list(np.arange(df_cumsum['count'].max()+1)))
plt.show()
Image by Author
Image by Author

Depending on the message to convey from the data, the characteristics of the data, and the target audience, certain plots might be more suited. For example, the following catplot might convey the number of saya designs across the five models and their additional costs. But as the number of data points increases, that may not be scalable. It would be better to split the message, for example, the popularity of saya designs by quantity and distribution of saya design prices. This was what I did with a radial plot for the latter direction. The code snippets are after the catplot.

ax = sns.catplot(x="model", y="saya(cost)", data=df_saya1,
                 hue="saya", palette= 'tab20b', kind="swarm", s=10,
                 height=4.5, aspect=2.5)
ax.fig.suptitle('Saya across models',fontsize=ts)
ax.fig.subplots_adjust(left=0.1, top=0.9)
ax.set_xlabels('')
ax.set_ylabels('price (JPY)', fontsize=yls)
ax.set_xticklabels(fontsize=xts)
ax.set_yticklabels(fontsize=yts)
ax._legend.set_title('Saya')
plt.show()
Image by Author | Scalability is an issue as data points increases
Image by Author | Scalability is an issue as data points increases
# replicating saya color is impractical from experience. Use a sequential color scheme
palette_saya = ['#7eedff','#6dd7ed','#5dc2dc','#4dadc9','#3e99b7','#3085a5','#217192','#125e7f','#004c6d',
               'slateblue','rebeccapurple','purple','indigo']
# initialize the figure
plt.figure(figsize=(10,10))
ax = plt.subplot(111, polar=True);
plt.axis('off')
plt.title('Saya ranked by price (JPY)',y=.9,fontsize=ts)
# set coordinate limits
upperlimit = 100
lowerlimit = 30
# compute max and min of dataset
max_ = df_saya2['saya(cost)'].max()
min_ = df_saya2['saya(cost)'].min()
# compute heights (conversion of saya_charge into new coordinates)
# 0 will be converted to lower limit (30)
# max_ converted to upper limit (100)
slope = (max_ - lowerlimit)/max_
heights = slope * df_saya2['saya(cost)'] + lowerlimit
# width of each bar
width = 2*np.pi / len(df_saya2.index)
# compute angle each bar is centered on
indexes = list(range(1, len(df_saya2.index)+1))
angles = [element * width for element in indexes]
# draw
bars = ax.bar(x=angles, height=heights, width=width, bottom=lowerlimit,
              linewidth=1,edgecolor="white",color=palette_saya)
# padding between bar and label
labelPadding = 1000
# label
for bar, angle, height, label in zip(bars,angles,heights,df_saya2['saya_']):
    # specify rotation in degrees
    rotation = np.rad2deg(angle)

    #flip some labels upside down for readability
    alignment = ""
    if angle >= np.pi/2 and angle < 3*np.pi/2:
        alignment = "right"
        rotation += 180
    else:
        alignment = "left"

    # add label
    ax.text(x=angle, y=lowerlimit + bar.get_height() + labelPadding,
            s=label, ha=alignment, va='center', rotation=rotation, rotation_mode="anchor",size=12)
plt.show()
Image by Author | Unconventional but works in this case
Image by Author | Unconventional but works in this case

Last but not least, xkcd styles. xkdc plots with their balance of humour, scientific jargon, and perspectives hold a special place in the minds of many. This plotting style can be called out from matplotlib via plt.xkcd() like so. The output is the first image for this article. It comes as a surprise that a simple-looking plot in xkcd’s style requires a fair amount of planning and thought, particularly annotation locations and arrow positioning.

# create a dataframe for 2019-11-01 date
# anchoring annotations and like later
data = {'date': ['2019-11-01','2017-01-01'], 
        'value': [22600,17500]}
df_dt = pd.DataFrame(data, columns = ['date', 'value'])
df_dt['date'] = pd.to_datetime(df_dt['date'])
# Create the postage chart in xkcd style
plt.style.use('default')
plt.xkcd()
# create yticks for labeling
yticks = np.arange(5000,25000,2500)
# Establish the size of the figure.
fig, ax = plt.subplots(figsize=(12, 6))
plt.plot()
# Scatter plot
ax.scatter(df_fedex['date'],df_fedex['postage'], s=40,color='blue')
# vertical line
ax.vlines(df_dt['date'][0],5000,df_dt['value'][0],linestyle='-.',linewidth=1,color='r')
# annotate
ax.text(df_dt['date'][0],7500,'covid emerges',fontsize=18)
ax.annotate('4.5x increase!', fontsize=18,
            xy=(df_fedex['date'][df_fedex.index[-1]],22516), 
            xytext=(df_dt['date'][1],17500),
            arrowprops=dict(arrowstyle='-|>',
                           connectionstyle='angle3,angleA=0,angleB=-90')
           )
# Title &amp; labels
ax.set_title('One more reason to mask up',fontsize=18)
ax.set_ylabel('Postage (JPY)', fontsize=yls)
ax.set_yticks(ticks=yticks)
ax.set_yticklabels(labels=yticks,size=yts)
plt.xticks(size=14)
# Rotate and align the tick labels so they look better.
fig.autofmt_xdate()
plt.tight_layout()

plt.show()

Final Thoughts

Something I have discovered through this project is that the structuring of plots is highly situational – aspects such as target audience, nature of the data, and colouring all come into play. The organization of plots throughout the exploratory data analysis and for blogs may potentially follow a very different train of thought as well. All in all, the entire project is an iterative process with data discovery, processing data cleaning, analysis, code refactoring, and experimentation. The process also uncovered several issues upstream of the data capture. More Japanese language translation and tagging to item codes (e.g. tsuba codes) are needed to improve data quality, which has been completed (see afternote). Further project extensions could be on analysis on tsuba-related topics. Though the examples and approaches are by no means exhaustive, hopefully, you would have gained some insights to mitigate potential blockers in customizing pyplot & seaborn visualizations.

Thanks for reading!

_Afternote: 22 May 2021 – Details on tsuba furnished. The dataset and code can be accessed here_


Related Articles