The world’s leading publication for data science, AI, and ML professionals.

Working with Pandas in Python

Data selection, aggregation and statistics with Bank Churn Modeling data

Photo by Burst on Pexels
Photo by Burst on Pexels

Pandas is a python library that is used for wrangling data, generating statistics, aggregating data and much more. In this post we will discuss how to perform data selection, aggregation and statistical analysis using the Pandas library.

Let’s get started!

For our purposes we will be working with the Bank Churn Modeling data set. The data can be found here.

To start, let’s import the Pandas library, relax display limits and print the first five rows of data:

import pandas as pd
df = pd.read_csv("Bank_churn_modelling.csv")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print(df.head())

DATA SELECTION

Here we will consider data selection using a Pandas dataframe. We can use ‘[]’ to select columns of data. For example, if we’d like to select ‘CreditScore’, Gender’, ‘Age’ and ‘Exited’ we can do the following:

df_select = df[['CreditScore', 'Gender', 'Age',  'Exited']]
print(df_select.head())

We can also filter the original dataframe by column value. Let’s filter the original data to only include customers above the age of 40:

df_age_gt_40 = df[df['Age'] > 40]
print(df_age_gt_40.head())

We can also filter for customers 40 years old or younger:

df_age_lte_40 = df[df['Age'] <=40]
print(df_age_lte_40.head())

or for only 40 year old customers:

df_age_e_40 = df[df['Age'] ==40]
print(df_age_e_40.head())

We can also filter by category. For example, we can select data where the ‘Geography’ value is France:

df_france = df[df['Geography'] == 'France']
print(df_france.head())

We can also use the ‘.loc[]’ operator to achieve the same tasks:

df_france_loc = df.loc[df.Geography == 'France']
print(df_france_loc.head())

We can also filter on multiple categorical values. Let’s only include customers from Germany and Spain:

geography_list = ['Germany', 'Spain']
df_germany_spain = df[df['Geography'].isin(geography_list)]
print(df_germany_spain.head())

STATISTICS

We can also generate some basic statistics using Pandas. For example, if we’d like to calculate the mean and standard deviation in credit score we can do the following:

mean_credit_score = df['CreditScore'].mean()
print('Mean credit Score: ', mean_credit_score)
std_credit_score = df['CreditScore'].std()
print('Standard Deviation in Credit Score: ', std_credit_score)

We can also look at the minimum and maximum:

min_credit_score = df['CreditScore'].min()
print('Min credit Score: ', min_credit_score)
max_credit_score = df['CreditScore'].max()
print('Standard Credit Score: ', max_credit_score)

We can also calculate the correlation of features and plot a heat map. Let’s look at the correlation for ‘Age’, ‘CreditScore’, ‘EstimatedSalary’, and ‘Tenure’:

corr = df[['Age', 'CreditScore', 'EstimatedSalary', 'Tenure']].corr()
print(corr)

And we can plot the heat map of correlation values using seaborn:

import seaborn as sns 
sns.heatmap(corr)

DATA AGGREGATION

We can also aggregate data using Pandas. Specifically, we can use the ‘groupby’ method to generate category level statistics. For example, we can generate a new dataframe from our original data that contains the average ‘CreditScore’ per ‘Geography’ value. This will allow us to analyze and compare average credit scores of France, Spain, and Germany:

df_groupby_mean = df.groupby('Geography')['CreditScore'].mean()
print(df_groupby_mean.head())

We can also look at standard deviation in credit score:

df_groupby_std = df.groupby('Geography')['CreditScore'].std()
print(df_groupby_std.head())

Let’s take a look at another numerical column. Let’s get the mean age and standard deviation in age for each country:

df_groupby_age_mean = df.groupby('Geography')['Age'].mean()
print(df_groupby_age_mean.head())
df_groupby_age_std = df.groupby('Geography')['Age'].std()
print(df_groupby_age_std.head())

We can also group by multiple columns:

df_groupby_multiple_category = df.groupby(['Geography', 'Gender'])['Age'].mean()
print(df_groupby_multiple_category.head())

I’ll stop here but I encourage you to play around with the data and code yourself.

CONCLUSIONS

To summarize, in this post we discussed how to work with the Pandas library to perform tasks such as data selection, data aggregation and statistical analysis. If you are interested learning about the basics of python Programming, data manipulation with Pandas, and machine learning in python check out _Python for Data Science and Machine Learning: Python Programming, Pandas and Scikit-learn Tutorials for Beginners._ I hope you found this post useful/interesting. The code in this post is available on GitHub. Thank you for reading!


Related Articles