Customizing Plots with Python Matplotlib

Better insights through beautiful visualizations

Published in

Towards Data Science

6 min readJul 9, 2018

A central part of Data Science and Data Analysis is how you visualize the data. How you make use of visualizations tools has an important role in defining how you communicate insights.

My language of choice to explore and visualize data is Python.

In this article, I want to walk you through my framework for going from visualizing raw data to having a beautiful plot that is not just eye-catching but emphases the core insights you want to convey.

In this example I'm going to be using a dataset of workout sessions used in a previous article. It looks like this

Workout Dataset, where day category = 0/1 corresponds to weekday/weekend

A bare bones scatter plot would look like this

Which you can replicate with the following code

import pandas as pd
import matplotlib.pyplot as plt#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]
def scatterplot(df, x_dim, y_dim):
  x = df[x_dim]
  y = df[y_dim]  fig, ax = plt.subplots(figsize=(10, 5))
  ax.scatter(x, y)  plt.show()scatterplot(df, ‘distance_km’, ‘duration_min’)

The usual next step for me is to label the axes and add a title so each plot is appropriately labeled.

The code change is minimal, but definitely makes a difference.

import pandas as pd
import matplotlib.pyplot as plt#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]
def scatterplot(df, x_dim, y_dim):
  x = df[x_dim]
  y = df[y_dim]  fig, ax = plt.subplots(figsize=(10, 5))
  ax.scatter(x, y)
 
  #adds a title and axes labels
  ax.set_title('Distance vs Workout Duration')
  ax.set_xlabel('Distance (Km)')
  ax.set_ylabel('Workout Duration (min)')  plt.show()scatterplot(df, ‘distance_km’, ‘duration_min’)

What about removing that box?

In order to change the default box around the plot, we have to actually remove some of the plot's borders.

import pandas as pd
import matplotlib.pyplot as plt#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]
def scatterplot(df, x_dim, y_dim):
  x = df[x_dim]
  y = df[y_dim]  fig, ax = plt.subplots(figsize=(10, 5))
  ax.scatter(x, y)
 
  #adds a title and axes labels
  ax.set_title('Distance vs Workout Duration')
  ax.set_xlabel('Distance (Km)')
  ax.set_ylabel('Workout Duration (min)')
 
  #removing top and right borders
  ax.spines['top'].set_visible(False)
  ax.spines['right'].set_visible(False)  plt.show()scatterplot(df, ‘distance_km’, ‘duration_min’)

Major Gridlines

Something that I usually like to add to my plots are major gridlines. It helps with readability by reducing the amount of white background. You can play around with the its width linewidth and transparency alpha.

import pandas as pd
import matplotlib.pyplot as plt#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]
def scatterplot(df, x_dim, y_dim):
  x = df[x_dim]
  y = df[y_dim]  fig, ax = plt.subplots(figsize=(10, 5))
  ax.scatter(x, y)
 
  #adds a title and axes labels
  ax.set_title('Distance vs Workout Duration')
  ax.set_xlabel('Distance (Km)')
  ax.set_ylabel('Workout Duration (min)')
 
  #removing top and right borders
  ax.spines['top'].set_visible(False)
  ax.spines['right'].set_visible(False)  #adds major gridlines
  ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)  plt.show()scatterplot(df, ‘distance_km’, ‘duration_min’)

Aesthetics

You can see that some of the dots in the plot overlap. To improve readability even more, we can adjust the dots' transparency — alpha.

import pandas as pd
import matplotlib.pyplot as plt#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]def scatterplot(df, x_dim, y_dim):
  x = df[x_dim]
  y = df[y_dim]  fig, ax = plt.subplots(figsize=(10, 5))  #customizes alpha for each dot in the scatter plot
  ax.scatter(x, y, alpha=0.70)
 
  #adds a title and axes labels
  ax.set_title('Distance vs Workout Duration')
  ax.set_xlabel('Distance (Km)')
  ax.set_ylabel('Workout Duration (min)')
 
  #removing top and right borders
  ax.spines['top'].set_visible(False)
  ax.spines['right'].set_visible(False)  #adds major gridlines
  ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)  plt.show()scatterplot(df, ‘distance_km’, ‘duration_min’)

There is still a bit of overlap, but at least the transparency improved the readability of the majority of the dots.

Colors

Since we have the day category we can also try identifying each dot in our plot with a different color.

For that you can choose from two different approaches:

Pick the colors yourself using tools like Adobe Kuler’s color wheel
Use Python's color maps

#1 Defining your own color palette

import pandas as pd
import matplotlib.pyplot as plt#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]def scatterplot(df, x_dim, y_dim):
  x = df[x_dim]
  y = df[y_dim]fig, ax = plt.subplots(figsize=(10, 5))
 
  #defining an array of colors  
  colors = ['#2300A8', '#00A658']  #assigns a color to each data point
  ax.scatter(x, y, alpha=0.70, color=colors)
 
  #adds a title and axes labels
  ax.set_title('Distance vs Workout Duration')
  ax.set_xlabel('Distance (Km)')
  ax.set_ylabel('Workout Duration (min)')
 
  #removing top and right borders
  ax.spines['top'].set_visible(False)
  ax.spines['right'].set_visible(False)#adds major gridlines
  ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)plt.show()scatterplot(df, ‘distance_km’, ‘duration_min’)

#2 Using Python Color Maps

To paint each dot according to its day category I need to introduce a few new components in the code

Import the color map library
Take the day category as a parameter, so the corresponding color can be mapped
Use parameterc from the scatter method to assign the color sequence
Use parameter cmap to assign the color map to be used. I'm going to use the brg color map

import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]def scatterplot(df, x_dim, y_dim, category):
  x = df[x_dim]
  y = df[y_dim]  fig, ax = plt.subplots(figsize=(10, 5))  #applies the custom color map along with the color sequence
  ax.scatter(x, y, alpha=0.70, c= df[category], cmap=cm.brg)
 
  #adds a title and axes labels
  ax.set_title('Distance vs Workout Duration')
  ax.set_xlabel('Distance (Km)')
  ax.set_ylabel('Workout Duration (min)')
 
  #removing top and right borders
  ax.spines['top'].set_visible(False)
  ax.spines['right'].set_visible(False)  #adds major gridlines
  ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)  plt.show()scatterplot(df, ‘distance_km’, ‘duration_min’, ‘day_category’)

Legends

So far, we've been using the native scatter method to plot each data point. In order to add a legend, we'll have to change the code a little bit.

We'll have to

Take the day category as a parameter, so we have our labels
Convert the numerical (0,1) labels into categorical labels (weekday, weekend)
Iterate through the dataset in order to assign a label to each data point

import pandas as pd
import matplotlib.cm as cm
import matplotlib.pyplot as plt#loading dataset
df = pd.read_csv(‘workout_log.csv’)
df.columns = [‘date’, ‘distance_km’, ‘duration_min’, ‘delta_last_workout’, ‘day_category’]def scatterplot(df, x_dim, y_dim, category):
   x = df[x_dim]
   y = df[y_dim]   #converting original (numerical) labels into categorical labels
   categories = df[category].apply(lambda x: 'weekday' if x == 0 else 'weekend')   fig, ax = plt.subplots(figsize=(10, 5))   #assigns a color to each data point
   colors = ['#2300A8', '#00A658']   #iterates through the dataset plotting each data point and assigning it its corresponding color and label
   for i in range(len(df)):
     ax.scatter(x.ix[i], y.ix[i], alpha=0.70, color = colors[i%len(colors)], label=categories.ix[i])   #adds title and axes labels
   ax.set_title('Distance vs Workout Duration')
   ax.set_xlabel('Distance (Km)')
   ax.set_ylabel('Workout Duration (min)')   #removing top and right borders
   ax.spines['top'].set_visible(False)
   ax.spines['right'].set_visible(False)   #adds major gridlines
   ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
   #adds legend
   ax.legend(categories.unique())
   plt.show()scatterplot(df, 'distance_km', 'duration_min', 'day_category')

And there you have it! A customized scatter plot from which it's now easier to understand the data and draw some insights.

Thanks for reading!