The world’s leading publication for data science, AI, and ML professionals.

Exploratory Data Analysis with Python – Part 1

A template to follow to get you started analyzing data with Python and Pandas.

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

Data Science has no recipe. Don’t think that there’s a template you can follow for each and every dataset. There is not.

What I am going to present in this series of posts is just a suggestion, a place to start. From there, obviously, you will be dragged by your data to perform other checks that will fit the needs of your project. It should not be understood as a model to follow or a set of rules, but simply something to get you moving and helping you to extract the first insights from your data.

Summary

Exploratory Data Analysis (EDA) is the art of understanding your dataset and extracting the first insights from it, so you can prepare it for the next phases of the Data Science flow – e.g. data cleaning, data formatting, feature engineering and modeling.

These are the topics included in this series:

  • Libraries import and loading the data
  • Checking data types
  • Looking for null or missing values
  • Descriptive statistics profiling
  • Univariate Analysis
  • Correlations
  • Multivariate Analysis

Dataset and Libraries

I like to use the dataset tips from the seaborn package when I am creating some examples because I find it really educational. We have different data types – float, integer, string, category – what helps us showing concepts. So let’s load it.

I also usually import other basic modules to help me working with the data.

# Basics
import Pandas as pd
import numpy as np
# Visualizations
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
# Loading the dataset
df = sns.load_dataset('tips')

Now loading a dataset can be, by itself, part of the analysis. I have a nice set of code snippets in this post that helps you to go next level when loading your data to the Python session:

The Next Level of Pandas read_csv( )

The first thing to do is to look at the dataset. That is important to actually see it, if the import went well, seeing if the dataset is large or long, have a sense of what’s in each variable.

This is done with head , bringing the first 5 rows of the dataset.

df.head()
df.head( ) of the dataset Tips. Image by the author.
df.head( ) of the dataset Tips. Image by the author.

So we see that the data is apparently good, it was imported correctly, there are no weird values at least on the first few rows. You could also use df.tail() if you’d like to look at the last 5 rows if you want.

Checking Data Types

Knowing the data types is crucial. Why?

Data typing in software acts as a signal to the software on how to process the data. (Bruce, Peter and Bruce, Andrew – Practical Statistics for Data Scientists, 2017)

Well, when you look at the dataset head, automatically your mind detects that _totalbill, tip and size are numbers and the other variables are text, being some of them probably binary (only two values possible). Therefore, you know that the numeric columns can have calculations. But if we don’t check if the computer is understanding them correctly, you will certainly see errors when you try to create statistics for the numbers, for example.

There are some ways to do that task.

# Info will show the shape and data types.
df.info()
df.info( ) information. Image by the author.
df.info( ) information. Image by the author.

See that it brings some good information like number of rows, columns, data types and also if there are null values (which we see none here!).

# Another way is to look at it separately
#Shape
df.shape
(244, 7) #(number of rows, number of columns)
#Types
df.dtypes
total_bill     float64 
tip            float64 
sex           category 
smoker        category 
day           category 
time          category 
size             int64 
dtype: object

Looking for null or Missing Values

Haven’t we just looked at it?

Yes, we saw that there is now. But it could have. Thus I will show here some good ways to detect those values.

First way is the most basic. We run a command that returns boolean values (True or False) and, knowing that True = 1 and False = 0 in Python, we can sum the values. Where we see a value higher than zero, it means data missing.

# Checking for missing values
df.isnull().sum()
total_bill    0 
tip           0 
sex           0 
smoker        0 
day           0 
time          0 
size          0 
dtype: int64

A visual way is interesting too. We can use the seaborn’s heatmap to see where data is missing. I have inserted a row with NAs there just so you can see how it appears in the visual.

# Heatmap of null values
sns.heatmap(df.isnull(), cbar=False)
NAs inserted for example purpose: See on 'smoker' variable. Image by the author
NAs inserted for example purpose: See on ‘smoker’ variable. Image by the author

Neat! I personally love that heatmap visual. It makes so easy to look at whole and see where are the null or missing values.

Now, I also added some ‘?’ on purpose. And those are not detected when you run df.isnull(). Not always the null values will be like NaN. Sometimes they will come as ? or something else.

'?' null value added for this example. Image by the author.
‘?’ null value added for this example. Image by the author.

One good way to check that is using regular expressions and the string contains method from Pandas.

# Checking possible null values - [?.,:*] only on text columns
for col in df2.select_dtypes(exclude='number').columns:
    print(f"{col}: {df2[col].str.contains('[?.,:*]').sum()} possible   null(?.,:*) values")
[OUT]:
sex: 0 possible null(?.,:*) values 
smoker: 0 possible null(?.,:*) values 
day: 5 possible null(?.,:*) values 
time: 0 possible null(?.,:*) values

You can replace the ‘?’ with NaN using the replace method.

# If you want to replace the '?' values
df2.replace({'?':np.nan})

Once you know where the missing values are, you can use many techniques to replace them. That will be a decision you will have to take based on your project. A good rule of thumb says that if it is less that 5% of your dataset, you’re good to drop the NAs. But you can also use mean, median, most frequent value, use missingpy to predict values. For that task, you can use df.fillna(value)

The easiest is just to drop those.

# Find the most common value for variables
most_common_smoker = df.smoker.value_counts().index[0]
most_common_day = df.day.value_counts().index[0]
# Fill NA Values with the most common value
df2.fillna(value={'smoker':most_common_smoker, 'day':most_common_day})
# Drop the NA values
df2.dropna(inplace=True)

Descriptive Statistics Profiling

Following the flow, once your data is clean of missing values, it’s time to start checking the descriptive statistics. Some people like to check the stats before cleaning missing values, because if you drop some values, your distribution will change. However, if you are dropping just a couple of rows, I don’t think it will change much. Also, with clean data, you can already have good stats that will show how the data is distributed after the changes.

That section is very necessary so you will know how the data is distributed and you can start to get the first insights. With a simple command, you can print the stats on screen. Use the parameter include='all' to see the categorical data if you want.

# Since the fake NAs were all deleted, I will come back to using df.
df.describe(include='all')
df.describe(include='all') results. Image by the author.
df.describe(include=’all’) results. Image by the author.

And there is a lot we can read from here. For example:

  • Most of the observations are on Saturday and Dinner time.
  • The dataset has more Males and non-smokers.
  • The standard deviations are fairly high, showing the data is spread and there are probably outliers.
  • Average tip is $2.99.
  • Average people per table is between 2 and 3.

I believe you were able to see how powerful those stats are to understand your data.

Before You Go

Ok, that wraps up this part 1. We covered half of the content and now you have an idea of how to load a dataset, check data types, understand why that is important, find the missing values and a couple of ways to clean that from the dataset.

Full code of this post.

Next post we will continue the process.

If this is interesting content to you, follow my blog for more.

gustavorsantos


Related Articles