The world’s leading publication for data science, AI, and ML professionals.

Developing a Multi-Layer Perceptron for a Bank Marketing Dataset

Predicting Whether a Client Will Subscribe to a Term Deposit by Analyzing a Bank Marketing Dataset

Photo by Chris Liverani on Unsplash
Photo by Chris Liverani on Unsplash

In this article, I’ll be explaining step by step how to develop a Multi-Layer Perceptron model. To do that, I will be taking a dataset related to direct marketing campaigns (phone calls) of a Portuguese banking institution (which is accessible from the UCI Machine Learning Repository) and will be predicting whether a client will subscribe to a term deposit or not.

Prerequisites for this Tutorial

In this tutorial, I assume that you are aware of what a neural network is, what a neural network is made of and how the learning process happens in a neural network (NN): by constantly calculating the error value for a particular training round and updating the learning rate.

Therefore, in this article, I directly go to explaining what the Multilayer Perceptron model (MLP) is, and how to develop an MLP to solve a real-world problem.

In this tutorial, I also assume that you know how to create a basic neural network model using Python from Google Colab or any other tool.

A link to the code described in this article is given at the end.


This article is categorized under several sections as follows. Therefore, if you are already aware of a particular section(s), you can directly dive into the section you need to learn about.

  1. Background Knowledge Needed for the Tutorial
  2. Overview of the Dataset
  3. Developing the MLP Neural Network Model

1. Background Knowledge Needed for the Tutorial

Before stepping into understanding how to develop the MLP model, let us first be sure that we are on the same page regarding these concepts.

What is a Multi-Layer Perceptron Model in Artificial Neural Networks?

A multilayer perceptron (MLP) is a neural network connecting multiple layers in a directed graph, which means that the paths connecting nodes in layers only go one way. Each node, apart from the input nodes, has a nonlinear activation function. An MLP uses backpropagation as a supervised learning technique [1] so that the error value can be updated in a much successful manner when manner, considering what the model has already learned.

Domain Knowledge on Bank Marketing & Term Deposits

A term deposit sometimes referred to as a fixed deposit is a form of investment at a bank; it is a lump-sum amount that is deposited at an agreed rate of interest for a fixed period of time at a bank. Customers tend to place term/fixed deposits in order to make sure that their money gets more value in years to come, and that they would be able to save large, rather than maintaining a savings account.

On the other hand, it is beneficial for banks to offer term deposits to customers, as it is a way that banks get more money into their cycle. Instead, the bank offers customers rather better interest rates than savings accounts.

Banks carry out promotional activities to lure in customers to open term deposits with them; such as telephone, email, advertising campaigns.

2. Overview of the Dataset

This dataset, as mentioned above is obtained from the UCI Machine Learning Repository, and has the following features.

Screenshot by Author - A glimpse of the dataset
Screenshot by Author – A glimpse of the dataset

There are 17 variable, including 16 features and the class variable. A description of all the features in the dataset is well given in the data source which can be found from the UCI Machine Learning Repository.

When observing the dataset, we can see that the dataset contains categorical nominal values (such as job, marital status, housing loan availability etc.), categorical ordinal values (such as education, where an order can be identified from lowest point to higher point), and finally, discrete and continuous values.

Checking the class value counts to get an idea how many classes are there and how many values are there for each class label in the dataset.

Screenshot by Author - Checking value counts of class label
Screenshot by Author – Checking value counts of class label

By considering the counts of the class variable, we could see that the dataset is an imbalanced dataset, since the number of ‘no’ are almost 8 times the ‘yes’ class. In this tutorial, I will not be handling the class imbalance, but can be handled by techniques such as SMOTE.

3. Developing the MLP Neural Network Model

When developing a Neural Network model, there are three main stages on which we will have to focus on. They are;

a. Preprocessing the dataset as specified in the data mining process

b. Perform feature engineering if necessary

c. Divide the dataset into training and testing set. Train the model, test and predict the desired outcome.

Now we shall look into how to carry out each stage mentioned above.


a. Preprocessing the Dataset as Specified in the Data Mining Process

Preprocessing data includes handling missing values and outliers, applying feature coding techniques if needed, scale & standardize features.

Checking for Missing Values

When using Pandas, we can find the standard missing values (missing values that Pandas can detect) using isnull() and get a summary of the missing values using isnull().sum(). [2] [3]

Screenshot by Author - Summary of the Missing Values
Screenshot by Author – Summary of the Missing Values

The other type of missing values is non-standard missing values, which Pandas cannot find on its own and that needs our assistance [2]. When manually scanning through the dataset, we can see that there are 4 fields that contains the value ‘unknown’.

Screenshot by Author - Summary of the Missing Values 'Unknown'
Screenshot by Author – Summary of the Missing Values ‘Unknown’

Handling Outliers

Outliers are datapoints that deviate a lot from the standard dataset. Having outliers in our dataset when training and building a model effects the ultimate accuracy. Therefore, we have to find and remove such outliers.

We can only check for outliers in numerical features. Therefore, i have gone through each numerical feature one by one, drawing boxplots to identify outliers and have removed them.

The numerical features i have checked for outliers are ‘age’, ‘balance’, ‘day’, ‘duration’, ‘campaign’, ‘pdays’, and ‘previous’. After successfully removing 119 outliers, the new number of datapoints in the dataset is 45092 (previously it was 45211).

The box plots are one by one shown in the notebook/Colab file.

Feature Encoding

In this process, the categorical data are encoded into numerical data.

  1. The LabelEncoder is used to encode the class values.
  2. OrdinalEncorder was used encode the ‘education’ feature, because when considering the values present in this field, an order can be seen as secondary, tertiary etc.
  3. OneHotEncoding was used to encode features that had other categorical values. (features – ‘job’, ‘marital’, ‘contact’, ‘month’, ‘poutcome’)
  4. Manual binary encoding (using a dictionary) was used to encode the rest of the features that had values as yes/no. (‘default’, ‘housing’, ‘loan’)

After this point, i have encoded all the values in the dataset into numerical values

Splitting the Data

I have split the data with a test size of 20%.

Feature Scaling

After encoding categorical data, the dataset consists of features with different data ranges. These values are standardized and feature scaling is done as follows. Numerical features were scaled by removing the mean and by scaling to unit variance (StandardScaler) as follows.


b. Feature Engineering

Feature Selection is one of the core concepts in machine learning which hugely impacts the performance of your model. The data features that you use to train your machine learning models have a huge influence on the performance you can achieve. Irrelevant or partially relevant features can negatively impact model performance.

Drawing the Correlation Matrix

Therefore I will be performing the Correlation Coefficient checking mechanism in order to check the relationship between the different features with the output.

Each of those correlation types can exist in a spectrum represented by values from 0 to 1 where slightly or highly positive correlation features can be something like 0.5 or 0.7. If there is a strong and perfect positive correlation, then the result is represented by a correlation score value of 0.9 or 1. [4]

Screenshot by Author - Correlation Matrix
Screenshot by Author – Correlation Matrix

The correlation matrix I generated has a lot of elements because, after the OneHotEncoding, the number of columns in the dataset was increased.

After generating the correlation matrix, we can see that to the right side of the matrix, there are features that has a very high correlation. We usually remove such features that have high correlations because, they are some what linearly dependent with other features. These features contribute very less in predicting the output but increases the computational cost. [5]

It is clear that correlated features means that they bring the same information, so it is logical to remove one of them. [6]

In order to find the exact columns that has the high correlation values, i perform the below code. I am checking the upper triangle of the correlation matrix because the upper and lower triangles are mirrors of each other that are divided by the diagonal in the correlation matrix. Here i am checking the columns that has correlations values more than 0.95 with the hope of removing them.

However, after performing the above code, we can see that there are no columns that has more than 0.95 correlation and that therefore, there are no columns to be removed.

Applying PCA

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. [7]

Here, I have not manually set the n_components of the PCA model. We want the explained variance to be between 95–99%. Therefore, i have set the PCA’s n_components to 0.95 [8]


c. Developing the MultiLayer Perceptron Model

I have made the MLP model using the MLPClassifier available in the scikit learn package. [9]

Generating the Confusion Matrix

Using confusion matrix, we can find how many true positives, false postives, false negatives and true negatives are there.

Screenshot by Author - Confusion matrix for the model I've built
Screenshot by Author – Confusion matrix for the model I’ve built

The above confusion matrix shows that there are 340 true positives and 7400 false negatives, which is still good for an imbalanced dataset. The number of false positives are 570 and true negatives are 750.

At the end, i have listed down the Mean Square Error value, the Training set score, test set score etc. as follows.

Screenshot by Author - Output of training testing scores
Screenshot by Author – Output of training testing scores

We can see that this model that i have created, has a training accuracy of 97.369%, while the testing accuracy is 85.386%.

The code written for this task can be found from here.

References

[1] Multilayer Perceptron (MLP) – by Technopedia

[2] Data Cleaning with Python and Pandas: Detecting Missing Values

[3] Working with Missing Data in Pandas

[4] Why Feature Correlation Matters…. A Lot!

[5] How to drop out highly correlated features in Python? – DeZyre

[6] In supervised learning, why is it bad to have correlated features?

[7] A STEP-BY-STEP EXPLANATION OF PRINCIPAL COMPONENT ANALYSIS (PCA)

[8] PCA – how to choose the number of components?

[9] Scikit Learn – MLPClassifier


Related Articles