
Pandas is the go-to Python library for Data Analysis and manipulation. It provides numerous functions and methods that expedice the data analysis process.
When it comes to data visualization, pandas is not the prominent choice because there exist great visualization libraries such as matplotlib, seaborn, and plotly.
With that being said, we cannot just ignore the plotting tools of Pandas. They help to discover relations within dataframes or series and syntax is pretty simple. Very informative plots can be created with just one line of code.
In this post, we will cover 6 plotting tools of pandas which definitely add value to the exploratory data analysis process.
The first step to create a great machine learning model is to explore and understand the structure and relations within the data.
These 6 plotting tools will help you understand the data better:
- Scatter matrix plot
- Density plot
- Andrews curves
- Parallel coordinates
- Lag plots
- Autocorrelation plot
I will use a diabetes dataset available on kaggle. Let’s first read the dataset into a pandas dataframe.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("/content/diabetes.csv")
print(df.shape)
df.head()

The dataset contains 8 numerical features and a target variable indicating if the person has diabetes.
1. Scatter matrix plot
Scatter plots are typically used to explore the correlation between two variables (or features). The values of data points are shown using the cartesian coordinates.
Scatter plot matrix produces a grid of scatter plots with just one line of code.
from pandas.plotting import scatter_matrix
subset = df[['Glucose','BloodPressure','Insulin','Age']]
scatter_matrix(subset, figsize=(10,10), diagonal='hist')

I’ve selected a subset of the dataframe with 4 features for demonstration purposes. The diagonal shows the histogram of each variable but we can change it to show kde plot by setting diagonal parameter as ‘kde‘.
2. Density plot
We can produce density plots using kde() function on series or dataframe.
subset = df[['Glucose','BloodPressure','BMI']]
subset.plot.kde(figsize=(12,6), alpha=1)

We are able to see the distribution of features with one line of code. Alpha parameter is used to adjust the darkness of lines.
3. Andrews curves
Andrews curves, named after the statistician David F. Andrews, is a tool to plot multivariate data with lots of curves. The curves are created using the attributes (features) of samples as coefficients of Fourier series.
We get an overview of clustering of different classes by coloring the curves that belong to each class differently.
from pandas.plotting import andrews_curves
plt.figure(figsize=(12,8))
subset = df[['Glucose','BloodPressure','BMI', 'Outcome']]
andrews_curves(subset, 'Outcome', colormap='Paired')

We need to pass a dataframe and name of the variable that hold class information. Colormap parameter is optional. There seems to be a clear distinction (with some exceptions) between 2 classes based on the features in subset.
4. Parallel coordinates
Parallel coordinates is another tool for plotting multivariate data. Let’s first create the plot and then talk about what it tells us.
from pandas.plotting import parallel_coordinates
cols = ['Glucose','BloodPressure','BMI', 'Age']
plt.figure(figsize=(12,8))
parallel_coordinates(df,'Outcome',color=['Blue','Gray'],cols=cols)
We first import parallel_coordinates from pandas plotting tools. Then create a list of columns to use. Then a matplotlib figure is created. The last line creates parallel coordinates plot. We pass a dataframe and name of the class variable. Color parameter is optional and used to determine colors for each class. Finally cols parameter is used to select columns to be used in the plot. If not specified, all columns are used.

Each column is represented with a vertical line. The horizontal lines represent data points (rows in dataframe). We get an overview of how classes are separated according to features. "Glucose" variable seems to a good predictor to separate these two classes. On the other hand, lines of different classes overlap on "BloodPressure" which indicates it does not perform well in separating the classes.
5. Lag plot
Lag plots are used to check the randomness in a data set or time series. If a structure is displayed in lag plot, we can conclude that the data is not random.
from pandas.plotting import lag_plot
plt.figure(figsize=(10,6))
lag_plot(df)

There is no structure in our data set that indicates randomness.
Let’s see an example of non-random data. I will use the synthetic sample in pandas documentation page.
spacing = np.linspace(-99 * np.pi, 99 * np.pi, num=1000)
data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(spacing))
plt.figure(figsize=(10,6))
lag_plot(data)

We can clearly see a structure on lag plot so the data is not random.
6. Autocorrelation plot
Autocorrelation plots are used to check the randomness in time series. They are produced by calculating the autocorrelations for data values at varying time lags.
Lag is the time difference. If the autocorrelations are very close to zero for all time lags, the time series is random.
If we observe one or more significantly non-zero autocorrelations, then we can conclude that time series is not random.
Let’s first create a random time series and see the autocorrelation plot.
noise = pd.Series(np.random.randn(250)*100)
noise.plot(figsize=(12,6))

This time series is clearly random. The autocorrelation plot of this time series:
from pandas.plotting import autocorrelation_plot
plt.figure(figsize=(12,6))
autocorrelation_plot(noise)

As expected, all autocorrelation values are very close to zero.
Let’s do an example of non-random time series. The plot below shows a very simple upward trend.
upward = pd.Series(np.arange(100))
upward.plot(figsize=(10,6))
plt.grid()

The autocorrelation plot for this time series:
plt.figure(figsize=(12,6))
autocorrelation_plot(upward)

This autocorrelation clearly indicates a non-random time series as there are many significantly non-zero values.
It is very easy to visually check the non-randomness of simple upward and downward trends. However, in real life data sets, we are likely to see highly complex time series. We may not able see the trends or seasonality in those series. In such cases, autocorrelation plots are very helpful for time series analysis.
Pandas provide two more plotting tools which are bootstap plot and RadViz. They can also be used in exploratory data analysis process.
Thank you for reading. Please let me know if you have any feedback.