Credit Risk Management: EDA & Feature Engineering

This part starts off with how to clean and pre-process data using EDA and Feature Engineering techniques for a Classification problem, especially with and without touching the “target” variable.

Andrew Nguyen

Published in

Towards Data Science

10 min readAug 2, 2020

Image credit: https://unsplash.com/photos/g7NfqV6C074

Context

What are the common use cases in the financial industry that Data Science can be of great help to?

Credit Score Cards are one of the common risk control methods in the financial industry which uses personal information and transactional records to identify and evaluate the creditworthiness of existing and potential customers. There are a number of different use cases leveraging this measure such as loan management, credit card approval, credit limit extension, etc.

That said, this project’s applicability varies depending on the problem a financial institution is facing. The core engine which makes this project usable is the processing and transformation of input data to engender the output of high predictability from the existing/new input that best addresses the issues.

This end-to-end project is divided into 3 parts:

Explanatory Data Analysis (EDA) & Feature Engineering
Feature Scaling and Selection (Bonus: Imbalanced Data Handling)
Machine Learning Modelling (Classification)

Note: As the project aims to boost my capability in Data Science, in short for the sake of self-study and self-improvement, instead of only applying the best performing technique, the project will divide the dataset into 2 smaller sub-sets to test which produces a better result.

So let’s kick off with the 1st part of the project: EDA & Feature Engineering

A. Explanatory Data Analysis (EDA)

Let’s import the necessary libraries and the two datasets:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsapplication = pd.read_csv("application_record.csv")
credit = pd.read_csv("credit_record.csv")

As shared above, while the Application dataset provides all data points from the personal information submitted by the existing banking customers (e.g. id, gender, income, etc.), the Credit dataset maps each corresponding id with his/her loan repayment status (e.g. X stands for no loan of the month, C for paid off and >0 implying the number of payment-overdue months).

For better usability of the Credit information, I cleaned up the dataset by converting the “Status” columns to numeric as well as grouped it by Customer ID and the latest month:

credit.status = credit.status.replace({'X':-2, 'C': -1})
credit.status = credit.status.astype('int')
credit.status = credit.status.apply(lambda x:x+1) credit_month = credit.groupby('id').months_balance.max().reset_index()record = pd.merge(credit_month, credit, how="inner", on=["id", "months_balance"])record.head()

When all had been set, I then combined the newly processed dataset with the Application utilizing “inner merge”. On top of this, if you refer back to the original dataset, “Birth_date” and “Employment” are the number of days counted backwards from today, which is a bit difficult to understand initially. As such, I decided to convert these variables into positive numbers and years instead.

df['age'] = df.birth_date.apply(lambda x: round(x/-365,0))
df['year_of_employment'] = df.employment.apply(lambda x: round(x/-365,0) if x<0 else 0)df = df.drop(columns=["birth_date","employment"])

Moving on, two highlights of every EDA that I suggest you to never ignore are (1) checking null values and (2) handling outliers. The former ensures that we have a 100% clean dataset before processing and plugging into modelling while the latter helps avoid your dataset getting overly skewed as a result of marginally extreme outliers.

df.isnull().sum()
df.occupation_type = df.occupation_type.fillna("Others")

“Occupation Type” is the only variable that has null values (NaN), so I went ahead filling those values with “Others”.

Using df.describe and sns.boxplot, I was able to find out that “Annual Income” and “Fam Members” are two variables that have outliers in the dataset, visually as below:

To remove outliers, I wrote a function which can be easily applied across variables with similar issues at once:

def remove_outlier(col):
    q25 = col.quantile(0.25)
    q75 = col.quantile(0.75)
    iqr = q75 - q25
    cutoff = iqr*1.5
    lower = q25 - cutoff
    upper = q75 + cutoff
    return lower, upper#Remove outliers for Annual Incomelower_1, upper_1 = remove_outlier(df.annual_income)
df = df.loc[(df.annual_income > lower_1) & (df.annual_income < upper_1)] #Remove outliers for Fam Memberslower_2, upper_2 = remove_outlier(df.fam_members)
df = df.loc[(df.annual_fam_members > lower_2) & (df.fam_members < upper_2)]

We’re almost done with EDA, coming down to define the “target” variable, which you might have heard somewhere in most classification lessons.

Going back to the topic of this project, Credit Risk Management, we need to determine how we should handle the loan repayment status of the customers. With this dataset, I defined “target = 0” for those who didn’t have any loan or paid off that month while the remaining data, any overdue loan, was mapped to “target = 1”.

df['target'] = None
df.loc[df.status < 1,'target']=0
df.loc[df.status >= 1,'target']=1
df.target = pd.to_numeric(df.target)

B. Feature Engineering

What is Feature Engineering and what does it do to help pre-process the data prior to modelling?

According to Wikipedia,

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.

In fact, Feature Engineering requires not only domain knowledge but also the understanding of the dataset and the goal of achievement. Particularly, with our dataset, there are quite a number of different features, which we call as independent variables, which have the correlation with the repayment status, which is the target variable (0 or 1). As such, in order to tune out an insightful and actionable model, we need to “engineer” those features by transforming existing and/or adding supporting data, which makes Feature Engineering different from EDA.

As I have mentioned from the beginning, we will never know which is a better approach until we test it. That being said, I decided to test out 2 scenarios, WITH and WITHOUT touching the target variable, and see if there is any significant difference in the result produced later on.

Before diving into the implementation, we should be aware that there is no “one-size-fit-all” technique for Feature Engineering as it depends on how you process your features. In this project, I leveraged “Category Encoding” in my dataset, since the majority of the data is categorical, which should be converted to numerical for easier processing in most machine learning models.

df_a = df #for encoding without target
df_b = df #for encoding with targetx_a = df_a.iloc[:, 1:-1]
y_a = df_a.iloc[:, -1]from sklearn.model_selection import train_test_splitx_a_train, x_a_test, y_a_train, y_a_test = train_test_split(x_a, y_a, test_size=0.3, random_state=1)

Two separate dataset were created for two scenarios, so that we are able to manipulate each dataset without fearing it getting mixed up.

Also, an important highlight to note before processing is that we are highly recommended to split the dataset to train and test sets to avoid data leakage. Essentially, if we split after processing, the data of the test set has been exposed, hence not being objective enough to compare against the train set in the modelling phase.

1. Category Encoding WITHOUT Target

Depending on the variable type, we will apply the suitable technique to each.

If you refer back to the dataset, there are 3 types of variable: (1) binary, (2) nominal and (3) continuous. While binary and continuous variables are pretty much self-explanatory, nominal variable refers to a group of different categories that has no intrinsic order to each other.

1.1. Binary Variables

For binary variables in our dataset (e.g. gender, car, property), we can choose either Label Encoder or Label Binarizer from sklearn library, which will map the original data to 0 or 1:

#Option 1: Label Encoder (applied to >2 categories per variable)
from sklearn.preprocessing import LabelEncoder, LabelBinarizerle = LabelEncoder()
gender_le = le.fit_transform(x_a_train.gender)#Option 2: LabelBinarizer (applied to 2 categories per variable)bn = LabelBinarizer()
gender_bn = np.array(x_a_train.gender).reshape(-1,1)
gender_bn = bn.fit_transform(gender_bn)

1.2. Nominal Variables

Nominal variables (e.g. income type, education, family status, housing type, occupation type) are categorical which needs converting to numerical prior to modelling. Two common methods of encoding a category is (1) Dummy Encoding and (2) One Hot Encoder, which basically creates n columns as n unique categories within that variable and assigns 0 or 1 depending on the absence/presence of each category in each column.

Image credit: https://deepai.org/machine-learning-glossary-and-terms/dummy-variable

The difference between these methods is that Dummy Encoding coverts into n-1 sub-variables while One Hot Encoder converts into n sub-variables.

#Option 1: Dummy Encoding: kn - k variablesincome_type_dummy = pd.get_dummies(x_a_train.income_type)#Option 2: OneHotEcnoder: kn variablesfrom sklearn.preprocessing import OneHotEncoderonehot = OneHotEncoder(sparse=False, drop='first', handle_unknown='error')income_type_onehot = onehot.fit_transform(x_a_train.income_type.to_numpy().reshape(-1,1))
income_type_onehot = pd.DataFrame(income_type_onehot, columns=onehot.get_feature_names(['income_type']))

Dummy Encoding can easily be done via pd.get_dummies() as it’s a part of pandas library already. For One Hot Encoder, we need to import it from sklearn library and transform each variable, individually or all at the same time.

One Hot Encoder was designed to keep the consistency in the number of categories across train and test sets (e.g. handling categories that do not appear in either), so it’s more highly recommended than Dummy Encoding due to easier control with “handle_unknown= “error””.

However, one of the drawbacks of One Hot Encoder is multicollinearity, which refers to variables or sub-variables highly linearly related to one another and hence reduces the accuracy of our models. This can be rectified or avoided by assigning the parameter of “drop= ‘first’” which helps remove one of the sub-variables after encoding.

1.3. Continuous Variables

Continuous variables are numeric variables that have an infinite number of values between any two values. Essentially, it takes forever to count! Let’s look at the distribution of each continuous variable in our data set visually:

The left figure illustrates the range of age of the customers while the right shows the distribution of different income buckets. Two of the common methods to handle this type of variable is (1) Fixed-width Binning and (2) Adaptive Binning. Particularly, the former creates sub-categories from the pre-defined bins (e.g. age — 10–20, 20–30, 30–40, etc) while the latter relies on the distribution of the data.

The pros and cons of Fixed-width Binning is that it’s easy and simple to encode the variable yet relatively subjective without taking into consideration the data itself. As such, I suggest opting for Adaptive Binning which closely looks at the data distribution. From what I observed, instead of converting into 2-bin categories, I decided to go for “quantiles” since the original distribution is widely ranged, after which Label Encoding was applied.

#Convert each variable into 5 equal categories/eachx_a_train['age_binned'] = pd.qcut(x_a_train.age, q=[0, .25, .50, .75, 1])
x_a_train['annual_income_binned'] = pd.qcut(x_a_train.annual_income, q=[0, .25, .50, .75, 1])#Apply Label Encoder to assign the label to each category without biasx_a_train['age'] = le.fit_transform(x_a_train['age_binned'])
x_a_train['annual_income'] = le.fit_transform(x_a_train['annual_income_binned'])

Tada! We have engineered all necessary variables without ever touching the target variable! Now, let’s move to the other prior to fitting each to the modelling phase.

2. Category Encoding WITH Target

As the correlation with the target variable is leveraged in this method, we should encode all independent variables in the same way for better objectivity.

Again, pre-requisite has to be followed: train_test_split before processing

x_b = df_b.iloc[:, 1:-1]
y_b = df_b.iloc[:, -1]from sklearn.model_selection import train_test_splitx_b_train, x_b_test, y_b_train, y_b_test = train_test_split(x_b, y_b, test_size=0.3, random_state=1)

Among the techniques for category encoding with the involvement of the target, I found 3 common options which are widely used: (1) Weight-of-Evidence Encoder (WOE), (2) Target Encoder, and (3) Leave-One-Out Encoder (LOO).

In brief,

WOE Encoder: Weight-of-evidence encoding is a widely used technique in credit risk modelling which gets the maximum difference among the unique categories in each variable related to the target. This can be easily understood with its mathematical calculation as below — the natural log of % Good (in this case, target = 0) / % Bad (target = 1):

Target Encoder & LOO Encoder: while the former method replaces a categorical value with the mean of the target variable across all rows in the dataset, the latter does the same but excludes the row itself. The reason being is to avoid direct target leakage from using too much information before modelling.

#Option 1: WOE Encoder
import category_encoders as cewoe = ce.WOEEncoder()def woe_encoder(col, target):
    for i in range(len(x_b_train.columns)):
        col.iloc[:,i] = woe.fit_transform(col, target)
    return coldf_woe_train = woe_encoder(x_b_train, y_b_train)
#Option 2: Target Encoder
from category_encoders import TargetEncoderte = TargetEncoder()def target_encoder(col, target):
    for i in range(len(x_b_train.columns)):
        col.iloc[:,i] = te.fit_transform(col, target)
    return coldf_te_train = target_encoder(x_b_train, y_b_train)

After testing both methods, it seems that the newly encoded dataset between the two is not much of a difference to each other. Hence, I opted for the dataset with WOE Encoder for the following steps of this project. However, please test out on other datasets, which might produce a different result possibly due to different data distribution.

Voila! That’s a wrap for the 1st part of this project!

As shared above, the project was created as a repository of learning notes and highlights along the journey of improving my Data Science skillsets. Hence, I have tested out multiple methods per each section in order to find out the best performing technique for use moving forward.

Do look out for the next 2 parts which covers Feature Scaling/Selection and Machine Learning Modelling! In the meantime, let’s connect:

Github: https://github.com/andrewnguyen07
LinkedIn: www.linkedin.com/in/andrewnguyen07

Thanks!