PYTHON. DATA VISUALIZATION. ANALYTICS. DATA SCIENCE
Crash Course on Exploratory Data Analysis in Python

Exploratory Data analysis (EDA) is the process of exploring data and investigating its structure to discover patterns and spot anomalies from said patterns.
EDA would then involve summarizing the data with the use of statistics and visualization methods to spot non-numerical patterns.
Ideally, EDA should bring out insights and realizations from data that cannot be obtained through formal modeling and hypothesis testing.
When done properly, EDA can dramatically simplify or advance your Data Science problem and may even solve it!
THE GOALS OF THE EDA PROCESS
A proper EDA hopes to accomplish several goals:
- To question the data and determine if there are problems inherent in the dataset;
- To determine if the data on hand is sufficient to answer a particular research question or whether additional feature engineering is required;
- To develop a framework for answering the research question;
- And to refine the questions and/or research problem based on what you have learned about the data.
As one can see, EDA is an iterative process itself.
Unlike what most people think, EDA is not a ‘formal process’. In fact, it should be obvious that anything with the word "exploratory" is not bound by strict rules and are therefore non-formal.
Rather, EDA is free-form. It is in this stage that you start to chase ideas you may have and see which ones lead somewhere. Some of these ideas may not necessarily work-out and you can therefore start pursuing or branching from other viable ideas.
EDA, therefore, get you thinking critically about how ‘answerable’ your problem is and if it matches your expectations.
EDA CHEAT SHEET
Exploratory Data Analysis is generally classified into two ways: graphical or non-graphical and univariate or multivariate.

Some references argue that one should do the non-graphical methods before doing the graphical ones. The argument is that doing the non-graphical methods ensures that you become familiarized with the dataset which makes visualization easier such as which variables are quantitative and qualitative.
PRE-EDA
While some sources would consider this as part of the EDA process, some sources would treat this as data preprocessing and as a requirement for an effective EDA. Let’s go with the latter.
Before implementing the four (4) categories, data scientists should do the following:
- Identify the dataset dimensions – how many observations there are as opposed to the number of features (columns). Some algorithms fail to produce desired outputs, as in optimization problems, when the number of features exceeds the number of observations.
- Identify data types – Particularly for Python, EDA methods, especially the graphical ones, work best with certain data types. While the form of data may appear similar, for example, for numeric columns, some methods would not work if the data type is classified as an "object" vs "float" or "int".
- Identify the target or the output variable – In some of the graphical methods, it will be important to identify the target or output variable to visualize the individual variables.
A good way to accomplish this is to use the .info() method of pandas:

It likewise provides you the "non-null" information, which is part of the univariate non-graphical method.
UNIVARIATE NON-GRAPHICAL METHOD
For univariate non-graphical EDA, we want to know the range of values and frequency of each value.
For quantitative variables, we want to be looking at the location (mean, median), spread(IQR, std. dev, range), modality (mode), shape (skewness, kurtosis), and outliers.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Aesthetics
sns.set_style("darkgrid")
%matplotlib inline
# Loading built-in Datasets:
tips = sns.load_dataset('tips')
#Quantitative
tips.total_bill.describe()

Pandas has a range of methods to provide the other information we want to know such as mode, skewness, and kurtosis.

For IQR, we need to do some calculations:
#IQR
#calculate interquartile range
q3, q1 = np.percentile(tips.total_bill, [75 ,25])
iqr = q3 - q1
#display interquartile range
iqr

The describe method of pandas is most commonly used for numeric columns but by adding the parameter "include=[‘O’]", we can use this on categorical values.
tips.sex.describe(include=['O'])
UNIVARIATE GRAPHICAL METHOD
One of my favorite graphs to use is the histogram as it provides us an idea of the distribution of data.
On that note, an approximation of data distribution requires that it be of quantitative nature.
sns.histplot(data=iris, x="sepal_length")

A density plot is a smoother version of a histogram:
sns.kdeplot(data=iris, x="sepal_length")

For categorical variables, bar charts are useful:
sns.barplot(x="day", y='tip', data=tips)

For Seaborn, the y-axis represents the "mean" value. Some sources recommend that we represent the "count" or the number of observations but this would not be any different from an individual histogram, where you label the x-axis with the categorical variable you wanted to count:
sns.histplot(data=tips, x="day")

One of the most useful visualizations there is for numeric data is the boxplot.
sns.boxplot(y=tips["total_bill"])

Boxplots (aka box and whiskers plot)are excellent in representing information regarding central tendency, symmetry, and skew as well as outliers. The only thing they may lack is information on multimodality.

The lower and upper whiskers are drawn out to the most extreme data points that are less than 1.5 IQRs beyond the corresponding hinges. So the whiskers represent the maximum and minimum values of data excluding outliers. The dots after the whiskers represent these outlier values.
Symmetry can be assessed by the location of the median. If it cuts the box in such a way that the length of the whiskers is equal, then data distribution is symmetrical. In the case of a skewed dataset, the median is pushed towards the shorter whisker. The longer whisker means that the high (or extreme)value is found in this direction. A longer upper whisker means that the dataset is positively skewed (right-skewed) while a longer lower whisker means that the dataset is negatively skewed (left-skewed).
For kurtosis, the presence of many outliers may indicate one or two things:
- presence of fat tails (or positive kurtosis) or
- data entry errors.
If the dataset is huge, the presence of short-whiskers may indicate negative kurtosis.
Last, but not least, the Normal-Quantile plot can be used to detect left or right skew, negative or positive kurtosis, and bimodality.
For this purpose, the statmodels package can help us:
import statsmodels.api as sm
sm.qqplot(iris.sepal_length)
![The Normal-Quantile plot plots the probability of two distributions, specifically their quantiles. If the two distributions are similar, the points will return a perfect line on the y=x line. In our example above, while the probability distribution of the iris['sepal_length] is compared to the normal distribution.](https://towardsdatascience.com/wp-content/uploads/2021/02/1gcvs_Kyqi731HgbMQWCMzA.png)
For the other visualization, we can view them as an extension of the "multivariate" cases.
MULTIVARIATE NON-GRAPHICAL
For one categorical and one quantitative variable, ideally, we want to present the "standard univariate nongraphical statistics for the quantitative variables separately for each level of the categorical variable". Below is an example of this:
#Separate the categories into a list and the quantitative variables in another list:
category = ['sex', 'smoker', 'day', 'time']
quantitative = ['total_bill', 'tip', 'size']
for i in category:
for j in quantitative:
print(f'--------------{i} vs {j} ---------------')
display(tips.groupby(i)[j].describe().reset_index())


For quantitative variables, the following methods can be used to generate the correlation, covariance, and descriptive statistics:
tips.cov()

tips.corr()

tips.describe()

For missing data detection, we can use the .info() method as described earlier.
MULTIVARIATE GRAPHICAL
For two categorical variables, the most used EDA plot is the grouped bar plot. As with the regular barplot, the target/output/dependent variable must be represented in the y-axis:
sns.catplot(
data=tips, kind="bar",
x="day", y="tip", hue="sex",
ci="sd", palette="dark", alpha=.6, height=6
)

If boxplot is one of the most useful EDA plots for univariate analysis, then the same can be said of its multivariate counterpart. The side-by-side boxplot is used to analyze one quantitative and categorical variable:
sns.boxplot(x="day", y="total_bill", data=tips)

For two quantitative variables, a scatter plot would be ideal. In some cases, it may easily reveal some pattern like whether a linear, polynomial (even exponential) or other relationship exists between the two quantitative variables in the plot.
sns.scatterplot(data=tips, x="total_bill", y="tip")

The pairplot method of seaborn provides the possible combination of scatterplots between permutations of any two quantitative variables. As a bonus, it adds the histograms on the main diagonal of individual quantitative variables:
sns.pairplot(tips)

You might want to go a little further by adding categories in the pairplot:
sns.pairplot(tips, hue='sex')

For missing data, it may be difficult to fully grasp the missing info for multiple variables. Luckily, a package in Python can help us easily visualize this.
pip install missingno
import missingno as msno #A simple library to view completeness of data
Trying out this dataset I got from the Family Income and Expenditure Survey for the Philippines:
msno.matrix(df, labels=True) #labels=True is important if the number of columns exceed 50

Correlation, while primarily used for the non-graphical method, may likewise be visualized. This helps when we have many features that it is difficult to take note of the multiple correlations.
For these cases, we may make use of the heatmap visualization:
fig = plt.figure(figsize=(18,14))
corr = tips.corr() #.corr gets the pairwise correlation
c = plt.pcolor(corr)
fig.colorbar(c) #displays the range used in the colorbar

FINAL REMARKS
While we tried to make our tools and methodologies, the list that we have is in no way complete.
For example, geospatial and temporal features require different EDA principles and processes and should be explored when necessary.
New EDA techniques are regularly being developed. Data scientists should be encouraged to learn them and understand if the insights derived from these new techniques cannot be covered by the range of techniques we have covered.
Regardless, data scientists should not be afraid to add and explore more as the purpose of EDA is to gain familiarity with your data and advance the research problem at hand.
Codes can be found on my GitHub page.
REFERENCES
https://r4ds.had.co.nz/exploratory-data-analysis.html
https://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf
Ozdemir, S., 2016, Principles of Data Science, Birmingham: Packt Publishing.