The world’s leading publication for data science, AI, and ML professionals.

5 Powerful Visualisation with Pandas for Data Preprocessing

"One look is worth a thousand words." Exploratory data analysis (EDA) to analyse data points is the right steps to formulate hypotheses

before data modelling for algorithms.

Plotted by the author using Python code mentioned in the article
Plotted by the author using Python code mentioned in the article

One of the most common pitfalls I observe repeatedly among relatively junior data scientist and machine learning professionals is spending hours in finding the best algorithm for their project and not spending enough time to understand the data first.

A structured way to approach Data Science and machine learning projects starts with the project objective. The same set of data points can infer meaningful information about several things. Based on what we are looking for, we need to focus on a different aspect of the data. Once we are clear on the objective, we should start thinking about the data points we require for the same. This will enable us to focus on the most pertinent sets of information, and ignore the data sets which may not be important.

In real life, most of the time data collected from several sources have blank values, typo errors and other anomalies. It is vital to clean the data before any data analysis.

In this article, I will discuss five powerful data visualisation options which instantly provides a sense of the data characteristics. Performing an EDA conveys a lot about the data and relationship among the features even before formal modelling or hypothesis testing task.

In the next article of this series I have discussed Advanced Visualisation for Exploratory data analysis (EDA)

Step 1- __ We will import the packages pandas, matplotlib, seaborn and NumPy, which we are going to use for our analysis.

We require the scatter_matrix,autocorrelation_plot, lag_plot and parallel_coordinates in pandas for plotting.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot
import seaborn as sns
from pandas.plotting import scatter_matrix
from pandas.plotting import autocorrelation_plot
from pandas.plotting import parallel_coordinates
from pandas.plotting import lag_plot

Step 2- In the Seaborn package, there is a small inbuilt dataset. We will use the "mpg", "tips" and "attention" data for our visualization. The datasets are loaded using load_dataset method in seaborn.

"""Download the datasets used in the program """
CarDatabase= sns.load_dataset("mpg")
MealDatabase= sns.load_dataset("tips")
AttentionDatabase= sns.load_dataset("attention")

Hexbin plots

We often use a scatter plot to get a quick grasp of the relationship between the variables. It is really helpful to get an insight as long as plots are not overcrowded with densely populated data points.

In the below code, we have plotted the scatter plot between the "Acceleration" and "Horsepower" data points in "mpg" dataset.

plt.scatter(CarDatabase.acceleration ,CarDatabase.horsepower,marker="^")
plt.show()

Points are densely populated in the scatter plot, and it is a bit difficult to get meaningful information from it.

Plotted by the author using Python code mentioned in the article
Plotted by the author using Python code mentioned in the article

Hexbins are a very good alternative to address the overlapping points scatter plot. Each point is not plotted individually in a hexbin plot.

In the below code, we plot a hexbin with the same dataset between the "Acceleration" and "Horsepower".

CarDatabase.plot.hexbin(x='acceleration', y='horsepower', gridsize=10,cmap="YlGnBu")
plt.show()

We can deduce the acceleration and horsepower value range concentration clearly in hexbin plot and a negative linear relationship between the variables. The size of the hexagon is dependent on the "grid size" parameter.

Self-exploration: I would encourage you to alter the grid size parameter and observe the changes in the hexbin plot.

Plotted with the code mentioned in the article
Plotted with the code mentioned in the article

Heatmaps

Heatmaps are my personal favourite to view the correlation among different variables. Those of who follow me on Medium may have observed that I use it often.

In the below code, we are calculating the pairwise correlation among all the variables in seaborn "mpg" dataset and plotting it as a heatmap.

sns.heatmap(CarDatabase.corr(), annot=True, cmap="YlGnBu")
plt.show()

We can see that "cylinders" and "horsepower" are closely positively related( as expected in a car) and weight is inversely related to acceleration. We can understand the indicative relationship among all different variables quickly with just a couple of lines of code.

Plotted with the code mentioned in the article
Plotted with the code mentioned in the article

Autocorrelation plot

Autocorrelation plots are a quick litmus test to ascertain whether the data points are random. In case the data points are following a certain trend, then one or more of the autocorrelations will be significantly non-zero. The dotted line in the plot shows 99%, confidence band.

In the code below, we are checking whether the total_bill amount in the "tips" database is random.

autocorrelation_plot(MealDatabase.total_bill)
plt.show()

We can see that the autocorrelation plot is moving very close to zero for all time-lags suggesting that the total_bill data points are random.

Plotted with the code mentioned in the article
Plotted with the code mentioned in the article

When we plot the autocorrelation plot for data points following a particular order, we can see that the plot is significantly non-zero.

data = pd.Series(np.arange(12,7000,16.3))
autocorrelation_plot(data)
plt.show()
Plotted with the code mentioned in the article
Plotted with the code mentioned in the article

Lag Plots

Lag plots are also helpful to verify if the dataset is a random set of values or follows a certain trend.

When the lag plot of "total_bills" value from "tips" dataset is plotted, as in the autocorrelation plot, the lag plot suggests it as random data with values all over the place.

lag_plot(MealDatabase.total_bill)
plt.show()
Plotted with the code mentioned in the article
Plotted with the code mentioned in the article

When we lag plot a non-random data series, as shown in the code below, we get a nice smooth line.

data = pd.Series(np.arange(-12*np.pi,300*np.pi,10))
lag_plot(data)
plt.show()
Plotted with the code mentioned in the article
Plotted with the code mentioned in the article

Parallel coordinates

It is always a challenge to wrap our head around and visualize more than 3-dimensional data. To plot higher dimension dataset parallel coordinates are very useful. Each dimension is represented by a vertical line.

In parallel coordinates, "N" equally spaced vertical lines represents "N" dimensions of the dataset. The position of the vertex on the n-th axis corresponds to the n-th coordinate of the point.

Confusing!

Let us consider a small sample data with five features for small and large size widgets.

A vertical line represents each feature of the widget. A continuous series of line segments represent "small" and "large" widgets’ feature values.

Plotted with the code mentioned in the article
Plotted with the code mentioned in the article

Below code plots the parallel coordinates for "attention" dataset in seaborn. Please note that points that cluster appears closer together.

parallel_coordinates(AttentionDatabase,"attention",color=('#556270', '#C7F464'))
plt.show()
Plotted with the code mentioned in the article
Plotted with the code mentioned in the article

I hope you will start using these out of box plots to perform the exploratory data analysis if you already are not using it. I would love to hear your favourite visualization plots for EDA.

Learn about the interactive visualisation of composite estimator and pipeline introduced by Scikit-Learn in May 2020.

Read my article on Advanced Visualisation for Exploratory data analysis (EDA) to learn more on this topic.

If you would like to learn about Tensor board and visualization in deep learning models then read the article Accuracy Visualisation In Deep Learning

In case, you would like to learn a structured approach to identify the appropriate independent variables to make accurate predictions then read my article "How to identify the right independent variables for Machine Learning Supervised.

"""Full code"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot
import seaborn as sns
from pandas.plotting import scatter_matrix
from pandas.plotting import autocorrelation_plot
from pandas.plotting import parallel_coordinates
from pandas.plotting import lag_plot
CarDatabase= sns.load_dataset("mpg")
MealDatabase= sns.load_dataset("tips")
AttentionDatabase= sns.load_dataset("attention")
plt.scatter(CarDatabase.acceleration ,CarDatabase.horsepower, marker="^")
plt.show()
CarDatabase.plot.hexbin(x='acceleration', y='horsepower', gridsize=10,cmap="YlGnBu")
plt.show()
sns.heatmap(CarDatabase.corr(), annot=True, cmap="YlGnBu")
plt.show()
autocorrelation_plot(MealDatabase.total_bill)
plt.show()
data = pd.Series(np.arange(12,7000,16.3))
autocorrelation_plot(data)
plt.show()
lag_plot(MealDatabase.total_bill)
plt.show()
data = pd.Series(np.arange(-12*np.pi,300*np.pi,10))
lag_plot(data)
plt.show()
parallel_coordinates(AttentionDatabase,"attention",color=('#556270', '#C7F464'))
plt.show()

Related Articles