
Exploratory Data Analysis, also known as EDA, has become an increasingly hot topic in Data Science. Just as the name suggests, it is the process of trial and error in an uncertain space, with the goal of finding insights. It usually happens at the early stage of the data science lifecycle. Although there is no clear-cut between the definition of data exploration, data cleaning, or feature engineering. EDA is generally found to be sitting right after the data cleaning phase and before feature engineering or model building. EDA assists in setting the overall direction of model selection and it helps to check whether the data has met the model assumptions. As a result, carrying out this preliminary analysis may save you a large amount of time for the following steps.
In this article, I have created a semi-automated EDA process that can be broken down into the following steps:
- Know Your Data
- Data Manipulation and Feature Engineering
- Univariate Analysis
- Multivariate Analysis
Feel free to jump to the part that you are interested in, or grab the full code from my website if you find it helpful.
1. Know Your Data
Firstly, we need to load the python libraries and the dataset. For this exercise, I am using several public datasets from the Kaggle community, feel free to have exploration on these amazing data using the link below:
Restaurant Business Rakings 2020
Import Libraries
I will be using four main libraries: Numpy – to work with arrays; Pandas – to manipulate data in a spreadsheet format that we are familiar with; Seaborn and matplotlib – to create data visualization.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np from pandas.api.types
import is_string_dtype, is_numeric_dtype
Import Data
Create a data frame from the imported dataset by copying the path of the dataset and use df.head(5)
to take a peek at the first 5 rows of the data.


Before zooming into each field, let’s first take a bird’s eye view of the overall dataset characteristics.
info()
It gives the count of non-null values for each column and its data type.


describe( )
This function provides basic statistics of each column. By passing the parameter "include = ‘all’", it outputs the value count, unique count, the top-frequency value of the categorical variables and count, mean, standard deviation, min, max, and percentile of numeric variables


If we leave it empty, it only shows numeric variables. As you can see, only columns being identified as "int64" in the info() output are shown below.

Missing Value
Handling missing values is a rabbit hole that cannot be covered in one or two sentences. If you would love to know how to address missing values in the model lifecycle and understand different types of missing data, here are some articles that may help:
In this article, we will focus on identifying the number of missing values. isnull().sum()
returns the number of missing values for each column.

We can also do some simple manipulations to make the output more insightful. Firstly, calculate the percentage of missing values.

Then, visualize the percentage of the missing value based on the data frame "missing_df". The for loop is basically a handy way to add labels to the bars. As we can see from the chart, nearly half of the "body" values from the "reddit_wsb" dataset are missing, which leads us to the next step "feature engineering".

2. Feature Engineering
This is the only part that requires some human judgment, thus cannot be easily automated. Don’t be afraid of this terminology. I think of feature engineering as a fancy way of saying transforming the data at hand to make it more insightful. There are several common techniques, e.g. change the date of birth into age, decomposing date into year, month, day, and binning numeric values. But the general rule is that this process should be tailored to both the data at hand and the objectives to achieve. If you would like to know more about these techniques, I found this article "Fundamental Techniques of Feature Engineering for Machine Learning" brings a holistic view of feature engineering in practice. And if you would like to know more about feature selection and feature engineering techniques, you may find these helpful:
For the "reddit_wsb" dataset, I simply did three manipulations on the existing data.
1. title → title_length;
df['title_length'] = df['title'].apply(len)
As a result, the high-cardinality column "title" has been transformed into a numeric variable which can be further used in the correlation analysis.
2. body → with_body
df['with_body'] = np.where(df['body'].isnull(), 'Yes', 'No')
Since there is a large portion of missing values, the "body" field is transformed into either with_body = "Yes" and with_body = "No", thus it can be easily analyzed as a categorical variable.
3. timestamp→ month
df['month'] = pd.to_datetime(df['timestamp']).dt.month.apply(str)
Since most data are gather from the year "2021", there is no point comparing the year. Therefore I kept the month section of the "date", which also helps to group data into larger subsets.
In order to streamline the further analysis, I drop the columns that won’t contribute to the Eda.
df = df.drop(['id', 'url', 'timestamp', 'title', 'body'], axis=1)
For the "restaurant" dataset, the data is already clean enough, therefore I simply trimmed out the columns with high cardinality.
df = df.drop(['Restaurant', 'City'], axis=1)
Furthermore, the remaining variables are categorized into numerical and categorical, since univariate analysis and multivariate analysis require different approaches to handle different data types. "is_string_dtype" and "is_numeric_dtype" are handy functions to identify the data type of each field.


After finalizing the numerical and categorical variables lists, the univariate and multivariate analysis can be automated.
3. Univariate Analysis
The describe() function mentioned in the first section has already provided a univariate analysis in a non-graphical way. In this section, we will be generating more insights by visualizing the data and spot the hidden patterns through graphical analysis.
Have a read of my article on "How to Choose the Most Appropriate Chart" if you are interested in knowing which chart types are most suitable for which data type.
Categorical Variables → Bar chart
The easiest yet most intuitive way to visualize the property of a categorical variable is to use a bar chart to plot the frequency of each categorical value.
Numerical Variables → histogram
To graph out the numeric variable distribution, we can use histogram which is very similar to bar chart. It splits continuous numbers into equal size bins and plot the frequency of records falling between the interval.

I use this for loop to iterate through the columns in the data frame and create a plot for each column. Then use a histogram if it is numerical variables and a bar chart if categorical.


4. Multivariate Analysis
Multivariate analysis is categorized into these three conditions to address various combinations of numerical variables and categorical variables.
1. Numerical vs. Numerical → heat map or pairplot
Firstly, let’s use the correlation matrix to find the correlation of all numeric data type columns. Then use a heat map to visualize the result. The annotation inside each cell indicates the correlation coefficient of the relationship.


Secondly, since the correlation matrix only indicates the strength of linear relationship, it is better to plot the numerical variables using Seaborn function sns.pairplot(). Notice that, both the sns.heatmap() and sns.pairplot() function ignore non-numeric data type.


Pair plot or scatterplot is a good complementary to the correlation matrix, especially when nonlinear relationships (e.g. exponential, inverse relationship) might exist. For example, the inverse relationship between "Rank" and "Sales" observed in the restaurant dataset may be mistaken as strong linear relationship if we simply look at number "- 0.92" the correlation matrix.
2. Categorical vs. Categorical → countplot with hue

The relationship between two categorical variables can be visualized using grouped bar charts. The frequency of the primary categorical variables is broken down by the secondary category. This can be achieved using sns.countplot().
I use a nested for loop, where the outer loop iterates through all categorical variables and assigns them as the primary category, then the inner loop iterate through the list again to pair the primary category with a different secondary category.

Within one grouped bar chart, if the frequency distribution always follows the same pattern across different groups, it suggests that there is no dependency between the primary and secondary category. However, if the distribution is different then it indicates that it is likely that there is a dependency between two variables.


Since there is only one categorical variable in the "restaurant" dataset, no graph is generated.
3. Categorical vs. Numerical → boxplot or pairplot with hue

Box plot is usually used when we need to compare how numerical data varies across groups. It is an intuitive way to graphically depict if the variation in categorical features contributes to the difference in values, which can be additionally quantified using ANOVA analysis. In this process, I pair each column in the categorical list with all columns in the numerical list and plot the box plot accordingly.

In the "reddit_wsb" dataset, no significant difference is observed across different categories.



On the other hand, the "restaurant" dataset gives us some interesting output. Some states (e.g. "Mich.") seem to jump all around the plots. Is it just because of the relatively smaller sample size for these states, which might worth further investigation.




Another approach is built upon the pairplot that we performed earlier for numerical vs. numerical. To introduce the categorical variable, we can use different hues to represent. Just like what we did for countplot. To do this, we can simply loop through the categorical list and add each element as the hue of the pairplot.

Consequently, it is easy to visualize whether each group forms clusters in the scatterplot.



Hope you enjoy my article :). If you would like to read more of my articles on Medium, please contribute by signing up Medium membership using this affiliate link (https://destingong.medium.com/membership).
Take-Home Message
This article covers several steps to perform EDA:
- Know Your Data: have a bird’s view of the characteristics of the dataset.
- Feature Engineering: transform variables into something more insightful.
- Univariate Analysis: 1) histogram to visualize numerical data; 2) bar chart to visualize categorical data.
- Multivariate Analysis: 1) Numerical vs. Numerical: correlation matrix, scatterplot (pairplot); 2) Categorical vs. Categorical: Grouped bar chart; 3) Numerical vs. Categorical: pairplot with hue, box plot.
Feel free to grab the code from my website. As mentioned earlier, other than the feature engineering part, the rest of the analysis can be automated. However, it is always better when the automation process is accompanied by some human touch, for example, experimenting on the bin size to optimize the histogram distribution. As always, I hope you find this article helpful and I encourage you to give it a go with your own dataset 🙂
More Related Articles
Level Up 7 Data Science Skills Through YouTube
Originally published at https://www.visual-design.net on February 28, 2021.