The world’s leading publication for data science, AI, and ML professionals.

Bank Institution Term Deposit Predictive Model

A step by step approach

Photo by Carlos Muza on Unsplash
Photo by Carlos Muza on Unsplash

Courtesy of the 10 Academy training program, I’ve been introduced to many Data Science concepts by working on different projects, each of them challenging in their own way.

Bank Institution Term Deposit Predictive Model is a project I found interesting. Its main objective is to build a model that predicts the customers that would or would not subscribe to bank term deposits, and this article aims at sharing my step by step approach of building the model.

Contents

  • The Data
  • Exploratory Data Analysis
  • Data Preprocessing
  • Machine Learning Model
  • Comparing Results
  • Prediction
  • Conclusion
  • Further Study

The Data

The dataset (Bank-additional-full.csv) used in this project contains bank customers’ data. The dataset, together with its information, can be gotten here. The first step to take when performing data analysis is to import the necessary libraries and the dataset to get you going.

# importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
#importing the dataset
dataset = pd.read_csv('bank-additional-full.csv', sep=';')
dataset.name = 'dataset'
dataset.head()
Image by author
Image by author

Exploratory Data Analysis (EDA)

EDA is an essential part of machine learning model development because it helps us in understanding our data and extract useful insights that will help in feature engineering. Some of the EDA performed in this project includes but not limited to the following;

  • Shape and size of dataset
# function to check the shape of a dataset
def data_shape(data):
    print(data.name,'shape:',data.shape)
# function to check the size of a dataset
def data_size(data):
    print(data.name,'size:',data.size)
# Getting the shape of the dataset
data_shape(dataset)
# Getting the size of the dataset
data_size(dataset)

dataset shape: (41188, 21) dataset size: 864948

.shape returns the number of rows and columns of the dataset.

.size returns the number of elements in the data i.e the number of rows times number of columns.

  • Information and Statistical summary
# function to ckeck the information of a dataset
def data_info(data):
    print(data.name,'information:')
    print('---------------------------------------------')
    print(data.info())
    print('---------------------------------------------')
# Getting the information of the dataset
data_info(dataset)
Image by author
Image by author

.info() is used to get concise summary of the dataset.

# Getting the statistical summary
dataset.describe().T
Image by author
Image by author

.describe() is used to view some basic statistical details like percentile, mean, std etc. of numerical columns in the dataset.

  • Unique and missing values
# function to get all unique values in the categorical variables
def unique_val(data):
    cols = data.columns
    for i in cols:
        if data[i].dtype == 'O':
            print('Unique values in',i,'are',data[i].unique())
            print('----------------------------------------------')
# Getting the unique values in the categorical columns
unique_val(dataset)
Image by author
Image by author

.unique() returns the unique values in a categorical column of the dataset.

# function to check for missing values
def missing_val(data):
    print('Sum of missing values in', data.name)
    print('------------------------------')
    print(data.isnull().sum())
    print('------------------------------')
# Getting the missing values in the dataset
missing_val(dataset)
Image by author
Image by author

.isnull().sum() returns the sum of missing values in each column of the dataset. Luckily for us, our dataset does not have missing values.

  • Categorical and numerical variables
# Categorical variables
cat_data = dataset.select_dtypes(exclude='number')
cat_data.head()
# Numerical variables
num_data = dataset.select_dtypes(include='number')
num_data.head()
Categorical variables
Categorical variables
Numerical variables
Numerical variables

.select_dtypes(exclude=’number) returns all the columns that does not have a numerical data type.

.select_dtypes(exclude=’number) returns all the columns that has a numerical data type.

  • Univariate and Bivariate Analysis

I made use of tableau (a data visualization tool) for the univariate and bivariate analysis and the tableau story can be found here.

  • Correlation
# using heatmap to visualize correlation between the columns
fig_size(20,10)
ax = sns.heatmap(dataset.corr(), annot=True, fmt='.1g', 
                 vmin=-1, vmax=1, center= 0)
# setting the parameters
fig_att(ax, "Heatmap correlation between Data Features", 
        "Features", "Features", 35, 25, "bold")
plt.show()
Image by author
Image by author

Correlation shows the relationship between variables in the dataset.

  • Outliers

Seaborn boxplot is one of the ways of checking a dataset for outliers.

# Using boxplot to identify outliers
for col in num_data:
    ax = sns.boxplot(num_data[col])
    save(f"{col}")
    plt.show()

The code above visualizes the numerical columns in the dataset and outliers detected were treated using the Interquartile Range (IQR) method. The code can be found in this GitHub repository.

In the course of the EDA, I found out that our target variable ‘y’ – has the client subscribed to a Term Deposit? (binary: ‘yes’,’no’), is highly imbalanced and that can affect our prediction model. This will be taken care of shortly and this article gives justice to some techniques of dealing with class imbalance.

Data Preprocessing

When building a machine learning model, it is important to preprocess the data to have an efficient model.

# create list containing categorical columns
cat_cols = ['job', 'marital', 'education', 'default', 'housing',
            'loan', 'contact', 'month', 'day_of_week', 'poutcome']
# create list containing numerical columns
num_cols = ['duration', 'campaign', 'emp.var.rate',"pdays","age",       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'previous']

The following preprocessing was done in this stage:

  • Encoding Categorical columns

Machine learning algorithms only read numerical values, which is why we need to change our categorical values to numerical values. I made use of pandas get_dummies method and type-casting to one-hot encode the columns.

# function to encode categorical columns
def encode(data):
    cat_var_enc = pd.get_dummies(data[cat_cols], drop_first=False)
    return cat_var_enc
# defining output variable for classification
dataset_new['subscribed'] = (dataset_new.y == 'yes').astype('int')
Image by author
Image by author
  • Rescaling Numerical columns

Another data preprocessing method is to rescale our numerical columns; this helps to normalize our data within a particular range. Sklearn preprocessing StandardScaler() was used here.

# import library for rescaling
from sklearn.preprocessing import StandardScaler
# function to rescale numerical columns
def rescale(data):
    # creating an instance of the scaler object
    scaler = StandardScaler()
    data[num_cols] = scaler.fit_transform(data[num_cols])
    return data
Image by author
Image by author
  • Specifying Dependent and Independent Variables

To proceed in building our prediction model, we have to specify our dependent and independent variables.

Independent variables – are the input for a process that is being analyzed.

Dependent variable – Dependent variable is the output of the process.

X = data.drop(columns=[ "subscribed", 'duration'])
y = data["subscribed"]

The column ‘duration’ was dropped because it highly affects the output target (e.g., if duration=0 then y=’no’).

  • Splitting the Dataset

It is reasonable to always split the dataset into train and test set when building a machine learning model because it helps us to evaluate the performance of the model.

# import library for splitting dataset
from sklearn.model_selection import train_test_split
# split the data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.1,random_state=1)
  • Dimensionality Reduction

In a case whereby we have a large number of variables, it is advisable to consider reducing these variables by keeping the most important ones, and there are various techniques for doing this, such as; PCA, TSNE, autoencoders, etc. For this project, we will be considering PCA.

# import PCA
from sklearn.decomposition import PCA
# create an instance of pca
pca = PCA(n_components=20)      
# fit pca to our data
pca.fit(X_train)
pca_train = pca.transform(X_train)
X_train_reduced = pd.DataFrame(pca_train)
Image by author
Image by author
  • Class Imbalance

As earlier stated, we have a highly imbalanced class, and this can affect our prediction if not treated.

Image by author
Image by author

In this project, I made use of SMOTE (Synthetic Minority Oversampling Technique) for dealing with class imbalance.

# importing the necessary function 
from imblearn.over_sampling import SMOTE
# creating an instance
sm = SMOTE(random_state=27)
# applying it to the training set
X_train_smote, y_train_smote = sm.fit_sample(X_train_reduced, y_train)

Note: It is advisable to use SMOTE on the training data.

Machine Learning Model

Whew!, we finally made it to building the model; data preprocessing can be such a handful when trying to build a machine learning model. Let’s not waste any time and dive right in.

The machine learning algorithm that was considered in this project includes;

  • Logistic Regression
  • XGBoost
  • Multi Layer Perceptron

and the cross validation (this is essential especially in our case where we have an imbalanced class) method used includes;

  • K-Fold: K-Fold splits a given data set into a K number of sections/folds where each fold is used as a testing set at some point.
  • Stratified K-Fold: This is a variation of K-Fold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
# import machine learning model libraries
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
# import libraries for cross validation
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
metrics = ['accuracy', 'roc_auc', f1', 'precision', 'recall']
# function to build machine learning models
def model(model, cv_method, metrics, X_train, X_test, y_train):
    if (model == 'LR'):
        # creating an instance of the regression
        model_inst = LogisticRegression()
        print('Logistic Regressionn----------------------')
    elif (model == 'XGB'):
        # creating an instance of the classifier
        model_inst = XGBClassifier()
        print('XGBoostn----------------------')
    elif (model == 'MLP'):
        # creating an instance of the classifier
        model_inst = MLPClassifier()
        print('Multi Layer Perceptronn----------------------')

    # cross validation
    if (cv_method == 'KFold'):
        print('Cross validation: KFoldn--------------------------')
        cv = KFold(n_splits=10, random_state=100)
    elif (cv_method == 'StratifiedKFold'):
        print('Cross validation: StratifiedKFoldn-----------------')
        cv = StratifiedKFold(n_splits=10, random_state=100)
    else:
        print('Cross validation method not found!')
    try:
        cv_scores = cross_validate(model_inst, X_train, y_train, 
                                   cv=cv, scoring=metrics)   
        # displaying evaluation metric scores
        cv_metric = cv_scores.keys()
        for metric in cv_metric:
            mean_score = cv_scores[metric].mean()*100
            print(metric+':', '%.2f%%' % mean_score)
            print('')

    except:
        metrics = ['accuracy', 'f1', 'precision', 'recall']
        cv_scores = cross_validate(model_inst, X_train, y_train, 
                                   cv=cv, scoring=metrics)
        # displaying evaluation metric scores
        cv_metric = cv_scores.keys()
        for metric in cv_metric:
            mean_score = cv_scores[metric].mean()*100
            print(metric+':', '%.2f%%' % mean_score)
            print('')
    return model_inst

Evaluation Metrics

  • Accuracy: The number of correctly predicted data points. This can be a misleading metric for an imbalanced dataset. Therefore, it is advisable to consider other evaluation metrics.
  • AUC (Area under the ROC Curve): It provides an aggregate measure of performance across all possible classification thresholds.
  • Precision: It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted.
  • Recall: It refers to the percentage of total relevant results correctly classified by your algorithm.
  • F1 score: This is the weighted average of Precision and Recall.
K-Fold Cross Validation Evaluation Metrics
K-Fold Cross Validation Evaluation Metrics
Stratified K-Fold Evaluation Metrics
Stratified K-Fold Evaluation Metrics

Comparing Results

  • K-Fold vs Stratified K-Fold

As can be seen from the table above, Stratified K-Fold presented a much better result compared to the K-Fold cross validation. The K-Fold cross validation failed to provide the AUC score for the Logistic Regression and XGBoost model. Therefore, for further comparison, Stratified K-Fold results would be used.

  • Machine Learning Models

From the result gotten, XGBoost proves to be a better prediction model than Logistic Regression and MLP because it has the highest percentage values in 4/5 of the evaluation metrics.

Prediction

XGboost, being the best performing model, is used for prediction.

# fitting the model to the train data
model_xgb = xgb.fit(X_train_smote, y_train_smote)
# make predictions
y_pred = xgb.predict(X_test_pca)

Conclusion

The main objective of this project is to build a model that predicts customers that would subscribe to a bank term deposit, and we were able to achieve that by considering three different models and using the best one for the prediction. We also went through rigorous steps of preparing our data for the model and choosing various evaluation metrics to measure the performance of our models.

In the result gotten, we observe that XGBoost was the best model with high percentage values in 4/5 of the evaluation metrics.

Further Study

In this project, I used only three machine learning algorithms. However, algorithms such as; SVM, Random Forest, Decision Trees, etc. can be explored.

A detailed code for this project can be found in this GitHub repository.

I know this was a very long ride, but thank you for sticking with me to the end. I also appreciate 10 Academy once again, and my fellow learners for the wonderful opportunity to partake in this project.

Reference


Related Articles