Whether we are creating a dashboard, doing predicting analytics, or working on any other machine learning task, we first need to explore the data at hand. We should obtain a thorough understanding of the data and the relationships among variables.
There are many tools and packages that can be used to analyze data. What they all have in common is that the best way to learn them is through practice.
In this practical article, we will explore a dataset that contains information about the customers of a bank. The ultimate task is to predict whether a customer will leave the credit card services of the bank.
We will be using Pandas for Data Analysis and manipulation and Seaborn to create visualizations.
The first step is to import the libraries.
import numpy as np
import Pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
Let’s create a dataframe by reading the provided csv file.
churn = pd.read_csv("/content/BankChurners.csv", usecols=list(range(21)))
I have excluded the first column and the last two columns by providing a list of indices of columns to be included in the dataframe. The usecol parameter is used to select only certain columns. We can pass the names or indices of the columns to be included.
The first column is client number which does not add any value to the analysis. The last two columns were not relevant as indicated by the dataset provider.
The shape method returns the size of the dataframe in terms of the number of rows and columns.
print(churn.shape)
(10127, 20)

There are 20 columns. The screenshot above only includes 7 columns for demonstration purposes. We can view the entire list of columns by using the "columns" method.
Before starting the analysis, we should check if there is any missing value in the columns. The isna function of Pandas returns true if a value is missing. We can apply sum functions to count the number of missing values in each column or entire dataframe.
churn.isna().sum().sum()
0
There is no missing value in the dataset.
The first column (Attrition_Flag) is the target variable which indicates if a customer is attrited (i.e. churned or left the company).
churn['Attrition_Flag'].value_counts(normalize=True)
Existing Customer 0.83934
Attrited Customer 0.16066
Only 16 percent of the customers churned which I think is high. This is the reason why the bank is investigating the issue and trying to understand why these customers churned.
We will be analyzing the other features (i.e. columns) to find the patterns that indicate or help to understand customer churn.
It makes it easier for analysis to have numbers instead of strings for values of the target variable. Thus, we will replace "attrited customer" values with 0 and "existing customer" with 0.
churn['Attrition_Flag'].replace({'Existing Customer':0, 'Attrited Customer':1}, inplace=True)
churn['Attrition_Flag'].value_counts(normalize=True)
0 0.83934
1 0.16066
One of the functions that are widely-used in exploratory data analysis is the group by function. It gives us an overview of how a numerical values changes based on the groups in a categorical variable.
For instance, we can check the churn rate based on the categories in the gender and marital status columns.
churn[['Attrition_Flag','Gender','Marital_Status']].
groupby(['Gender','Marital_Status']).mean().round(2)

The churned customers are indicated by 1 in the attrition flag column. Thus, the higher the average value, the more likely the customer churn is.
I have used the round function to round up the floats to two decimals because the difference after the second decimal is too small to consider in our case.
- Finding 1: Females are slightly more likely to churn than males.
The target variable is discrete as it only takes two different values. Thus, we can use the group by function to compare the churned customers with others based on numerical features.
For instance, we can check the average age and number of dependents for groups in the attrition flag column.
churn[['Attrition_Flag','Customer_Age','Dependent_count']].
groupby(['Attrition_Flag']).mean().round(2)

The values are quite close so there seems to be no significant difference between churned and not-churned customers based on these features.
However, in some cases, it might be misleading to just check the average values. There might be outliers that significantly change the mean value. Thus, it is better to also check the distribution of variables.
We can use histogram to check the distribution of a variable.
sns.displot(data=churn, kind='hist',
x='Customer_Age', hue='Attrition_Flag',
height=7, aspect=1.2)

The distribution of customer age is similar between churned and not-churned customers. We can conclude that customer age does not seem to be an important factor in customer churn.
We can also use boxplots to check the distribution of variables. Let’s create a boxplot of the credit limit feature and separate it based on attribution flag.
plt.figure(figsize=(10,6))
sns.boxplot(data=churn, y='Credit_Limit', x='Attrition_Flag',
width=0.5)

The distribution of churned and not-churned customers seem pretty similar.
The number of inactive months for customer might be substantial in churn analysis. We can apply multiple aggregations with group by function. Let’s check both the average churn rate and the number of customers of each category in the inactive months column.
churn[['Attrition_Flag','Months_Inactive_12_mon']].
groupby(['Months_Inactive_12_mon']).agg(['mean','count']).round(2)

There is a positive correlation between the number of inactive months and customer churn. We clearly see that the churn rate increases as the number of inactive months increases (excluding the categories with very few customers).
- Finding 2: Churn rate increases as the number of inactive months increases.
Another common technique used in exploratory data analysis is to check the correlations among variables. The corr function of Pandas creates a dataframe of correlation coefficients between variables.
We can check the correlations on the dataframe or visualize them using a heatmap. I prefer the latter because it provides a more structured and informative overview.
corr = churn.corr().round(2)
plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True, cmap="YlGnBu")

The first row of the heatmap is what we are mostly interested in. It shows the correlation coefficients between the target variable (Attrition_Flag) and other variables.

I have sorted the correlation coefficients based on the absolute value because we are interested in all correlations, not only the positive ones.
As we can see from the screenshot above, total transaction count and change in the transaction count are the first two features with regards to the correlation with target variable.
- Finding 3: Total transaction count and change in the transaction count are the first two features with regards to the correlation with target variable.
In the heatmap, we also see that total transaction amount and total transaction count are highly correlated. We can visualize these two variables on a scatter plot for further analysis.
sns.relplot(data=churn, kind='scatter', x='Total_Trans_Amt', y='Total_Trans_Ct',hue='Attrition_Flag', height=7)

We can see the positive correlation on the scatter plot as well. Another interesting finding is that there is no churned customer above a certain level of transaction amount or count.
- Finding 4: Customers who have done more than 100 transactions do not churn.
Let’s also create a boxplot of the total transaction amount.
plt.figure(figsize=(10,6))
sns.boxplot(data=churn, y='Total_Trans_Amt', x='Attrition_Flag',
width=0.5)

We clearly see that the total transaction amount is higher for not-churned customers.
Conclusion
We have covered some commonly used functions and techniques in exploratory data analysis. It is important to have an idea of how to approach the data.
We are, of course, not done with this task yet. We should further investigate the data and the variables using the similar techniques.
Thank you for reading. Please let me know if you have any feedback.