The world’s leading publication for data science, AI, and ML professionals.

A Practical Guide for Exploratory Data Analysis – Churn Dataset

Using Pandas, Matplotlib, and Seaborn.

Photo by Nils Nedel on Unsplash
Photo by Nils Nedel on Unsplash

Exploratory Data Analysis (EDA) is an essential part of the data science or the machine learning pipeline. In order to create a robust and valuable product using the data, you need to explore the data, understand the relations among variables, and the underlying structure of the data.

In this post, we will explore a customer churn dataset using Pandas, Matplotlib, and Seaborn libraries. The dataset is available here on Kaggle.

The first step is to read the dataset into a pandas dataframe.

import pandas as pd
import numpy as np
df = pd.read_csv("/content/Churn_Modelling.csv")
df.shape
(10000, 14)
df.head()

The dataset contains 10000 customers (i.e. rows) and 14 features about the customers and their products at a bank. The goal here is to predict whether a customer will churn (i.e. exited = 1) using the provided features. Thus, in terms of Machine Learning, we aim to build a supervised learning algorithm to perform a classification task.

We should never just dump the raw data into a machine learning model. Garbage in, garbage out! That is the reason why a thorough EDA process is highly important.


Cleaning the data

Let’s first check if there are any missing values.

df.isna().sum()

This dataset does not have any missing value which is not typical with real-life datasets. Handling missing values is an important part of the EDA process. If there are very few missing values compared to the size of the dataset, we may choose to drop rows that have missing values. Otherwise, it is better to replace them with appropriate values. Pandas fillna function can be used to handle this task.

Important note: If you choose to impute missing values based on the non-missing values in a column (e.g. fill missing values with the mean value of a column), you should do it after splitting your dataset into train and test subsets. Otherwise, you leak data to the machine learning model from the test set which is supposed to be new, previously unseen data.

Tasks like churn prediction and email spam detection are likely to have an imbalance class distribution. The number of customers who churned (i.e. left) is usually much less than the number of customers who did not churn. We can check the distribution of values with the value_counts function.

df.Exited.value_counts()
0    7963 
1    2037 
Name: Exited, dtype: int64

There is an imbalance in the target variable ("Exited"). It is important to eliminate the imbalance. Otherwise, the machine learning model is likely to be biased towards the dominant class. There are different techniques to handle class imbalance such as undersampling and oversampling.

We should also make sure the data stored with appropriate data types. For instance, the numerical values should not be stored as "object". Dtypes function returns the data type of each column.

df.dtypes

The data types are appropriate. The next step is to get rid of redundant features. "RowNumber" column is just an index. "CustomerId" and "Surname" columns are obviously useless for a machine learning model. Thus, we should drop them.

df.drop(['RowNumber','CustomerId','Surname'], axis=1, inplace=True)

We just pass the list of labels to be dropped. The axis parameter tells the drop function if we are dropping rows (0) or columns (1). The inplace parameter is set as true to save the changes.


The more exciting part!

It is time to dive deep into the dataset now.

Let’s check how "Gender" and "Geography" are related to customer churn. One way is to use the groupby function of pandas.

df[['Geography','Gender','Exited']].groupby(['Geography','Gender']).agg(['mean','count'])

Finding: In general, females are more likely to "exit" than males. The exit (churn) rate in Germany is higher than in France and Spain.

Another common practice in the EDA process is to check the distribution of variables. Distribution plots, histograms, and boxplots give us an idea about the distribution of variables (i.e. features).

fig , axs = plt.subplots(ncols=2, figsize=(12,6))
fig.suptitle("Distribution of Balance and Estimated Salary", fontsize=15)
sns.distplot(df.Balance, hist=False, ax=axs[0])
sns.distplot(df.EstimatedSalary, hist=False, ax=axs[1])

Most of the customers have zero balance. For the remaining customers, the "Balance" has a normal distribution. The "EstimatedSalary" seems to have a uniform distribution.

Since there are lots of customers with zero balance, We may create a new binary feature indicating whether a customer has zero balance. The where function of pandas will do the job.

df['Balance_binary'] = df['Balance'].where(df['Balance'] == 0, 1)
df['Balance_binary'].value_counts()
1.0    6383 
0.0    3617 
Name: Balance_binary, dtype: int64

Approximately one-third of customers have zero balance. Let’s see the effect of having zero balance on churning.

df[['Balance_binary','Exited']].groupby('Balance_binary').mean()

Finding: Customers with zero balance are less likely to churn.

Another important statistic to check is the correlation among variables.

Correlation is a normalization of covariance by the standard deviation of each variable. Covariance is a quantitative measure that represents how much the variations of two variables match each other. To be more specific, covariance compares two variables in terms of the deviations from their mean (or expected) value.

By checking the correlation, we are trying to find how similarly two random variables deviate from their mean.

The corr function of pandas returns a correlation matrix indicating the correlations between numerical variables. We can then plot this matrix as a heatmap.

It is better if we convert the values in the "Gender" column to numeric ones which can be done with the replace function of pandas.

df['Gender'].replace({'Male':0, 'Female':1}, inplace=True)
corr = df.corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr, cmap='Blues_r', annot=True)
The correlation matrix
The correlation matrix

Finding: The "Age", "Balance", and "Gender" columns are positively correlated with customer churn ("Exited"). There is a negative correlation between being an active member ("IsActiveMember") and customer churn.

If you compare "Balance" and "Balance_binary", you will notice a very strong positive correlation since we created one based on the other.

Since "Age" turns out to have the highest correlation values, let’s dig in a little deeper.

df[['Exited','Age']].groupby('Exited').mean()

The average age of churned customers is higher. We should also check the distribution of the "Age" column.

plt.figure(figsize=(6,6))
plt.title("Boxplot of the Age Column", fontsize=15)
sns.boxplot(y=df['Age'])

The dots above the upper line indicate outliers. Thus, there are many outliers on the upper side. Another way to check outliers is comparing the mean and median.

print(df['Age'].mean())
38.9218
print(df['Age'].median())
37.0

The mean is higher than the median which is compatible with the boxplot. There are many different ways to handle outliers. It can be the topic of an entire post.

Let’s do a simple one here. We will remove the data points that are in the top 5 percent.

Q1 = np.quantile(df['Age'],0.95)
df = df[df['Age'] < Q1]
df.shape
(9474, 14)

The first line finds the value that distinguishes the top 5 percent. In the second line, we used this value to filter the dataframe. The original dataframe has 10000 rows so we deleted 526 rows.

Please note that this is not acceptable in many cases. We cannot just get rid of rows because data is a valuable asset and the more data we have the better models we can build. We are just trying to see if outliers have an effect on the correlation between age and customer churn.

Let’s compare the new mean and median.

print(df['Age'].mean())
37.383681655055945
print(df['Age'].median())
37.0

They are pretty close. It is time to check the difference between the average age of churned customers and those who did not churn.

df[['Exited','Age']].groupby('Exited').mean()

Our finding still holds true. The average age of churned customers is higher.


There is no limit on exploratory data analysis. Depending on our task or goal, we can approach the data from a different perspective and dig deep to explore. However, the tools used in the process are usually similar. It is very important to practice a lot in order to get well in this process.

Thank you for reading. Please let me know if you have any feedback.


Related Articles