
Data Science has no recipe. Don’t think that there’s a template you can follow for each and every dataset. There is not.
What I am going to present in this series of posts is just a suggestion, a place to start. From there, obviously, you will be dragged by your data to perform other checks that will fit the needs of your project. It should not be understood as a model to follow or a set of rules, but simply something to get you moving and helping you to extract the first insights from your data.
Summary
Exploratory Data Analysis (EDA) is the art of understanding your dataset and extracting the first insights from it, so you can prepare it for the next phases of the Data Science flow – e.g. data cleaning, data formatting, feature engineering and modeling.
These are the topics included in this series:
- Libraries import and loading the data
- Checking data types
- Looking for null or missing values
- Descriptive statistics profiling
- Univariate Analysis
- Correlations
- Multivariate Analysis
Dataset and Libraries
I like to use the dataset tips from the seaborn package when I am creating some examples because I find it really educational. We have different data types – float, integer, string, category – what helps us showing concepts. So let’s load it.
I also usually import other basic modules to help me working with the data.
# Basics
import Pandas as pd
import numpy as np
# Visualizations
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
# Loading the dataset
df = sns.load_dataset('tips')
Now loading a dataset can be, by itself, part of the analysis. I have a nice set of code snippets in this post that helps you to go next level when loading your data to the Python session:
The first thing to do is to look at the dataset. That is important to actually see it, if the import went well, seeing if the dataset is large or long, have a sense of what’s in each variable.
This is done with head
, bringing the first 5 rows of the dataset.
df.head()

So we see that the data is apparently good, it was imported correctly, there are no weird values at least on the first few rows. You could also use df.tail()
if you’d like to look at the last 5 rows if you want.
Checking Data Types
Knowing the data types is crucial. Why?
Data typing in software acts as a signal to the software on how to process the data. (Bruce, Peter and Bruce, Andrew – Practical Statistics for Data Scientists, 2017)
Well, when you look at the dataset head, automatically your mind detects that _totalbill, tip and size are numbers and the other variables are text, being some of them probably binary (only two values possible). Therefore, you know that the numeric columns can have calculations. But if we don’t check if the computer is understanding them correctly, you will certainly see errors when you try to create statistics for the numbers, for example.
There are some ways to do that task.
# Info will show the shape and data types.
df.info()

See that it brings some good information like number of rows, columns, data types and also if there are null values (which we see none here!).
# Another way is to look at it separately
#Shape
df.shape
(244, 7) #(number of rows, number of columns)
#Types
df.dtypes
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
dtype: object
Looking for null or Missing Values
Haven’t we just looked at it?
Yes, we saw that there is now. But it could have. Thus I will show here some good ways to detect those values.
First way is the most basic. We run a command that returns boolean values (True or False) and, knowing that True = 1 and False = 0 in Python, we can sum the values. Where we see a value higher than zero, it means data missing.
# Checking for missing values
df.isnull().sum()
total_bill 0
tip 0
sex 0
smoker 0
day 0
time 0
size 0
dtype: int64
A visual way is interesting too. We can use the seaborn’s heatmap to see where data is missing. I have inserted a row with NAs there just so you can see how it appears in the visual.
# Heatmap of null values
sns.heatmap(df.isnull(), cbar=False)

Neat! I personally love that heatmap visual. It makes so easy to look at whole and see where are the null or missing values.
Now, I also added some ‘?’ on purpose. And those are not detected when you run df.isnull()
. Not always the null values will be like NaN
. Sometimes they will come as ? or something else.

One good way to check that is using regular expressions and the string contains method from Pandas.
# Checking possible null values - [?.,:*] only on text columns
for col in df2.select_dtypes(exclude='number').columns:
print(f"{col}: {df2[col].str.contains('[?.,:*]').sum()} possible null(?.,:*) values")
[OUT]:
sex: 0 possible null(?.,:*) values
smoker: 0 possible null(?.,:*) values
day: 5 possible null(?.,:*) values
time: 0 possible null(?.,:*) values
You can replace the ‘?’ with NaN
using the replace
method.
# If you want to replace the '?' values
df2.replace({'?':np.nan})
Once you know where the missing values are, you can use many techniques to replace them. That will be a decision you will have to take based on your project. A good rule of thumb says that if it is less that 5% of your dataset, you’re good to drop the NAs. But you can also use mean, median, most frequent value, use missingpy to predict values. For that task, you can use df.fillna(value)
The easiest is just to drop those.
# Find the most common value for variables
most_common_smoker = df.smoker.value_counts().index[0]
most_common_day = df.day.value_counts().index[0]
# Fill NA Values with the most common value
df2.fillna(value={'smoker':most_common_smoker, 'day':most_common_day})
# Drop the NA values
df2.dropna(inplace=True)
Descriptive Statistics Profiling
Following the flow, once your data is clean of missing values, it’s time to start checking the descriptive statistics. Some people like to check the stats before cleaning missing values, because if you drop some values, your distribution will change. However, if you are dropping just a couple of rows, I don’t think it will change much. Also, with clean data, you can already have good stats that will show how the data is distributed after the changes.
That section is very necessary so you will know how the data is distributed and you can start to get the first insights. With a simple command, you can print the stats on screen. Use the parameter include='all'
to see the categorical data if you want.
# Since the fake NAs were all deleted, I will come back to using df.
df.describe(include='all')

And there is a lot we can read from here. For example:
- Most of the observations are on Saturday and Dinner time.
- The dataset has more Males and non-smokers.
- The standard deviations are fairly high, showing the data is spread and there are probably outliers.
- Average tip is $2.99.
- Average people per table is between 2 and 3.
I believe you were able to see how powerful those stats are to understand your data.
Before You Go
Ok, that wraps up this part 1. We covered half of the content and now you have an idea of how to load a dataset, check data types, understand why that is important, find the missing values and a couple of ways to clean that from the dataset.
Full code of this post.
Next post we will continue the process.
If this is interesting content to you, follow my blog for more.