
In this article I will try to describe an end-to-end scorecard development for the banking industry, leveraging machine learning library PyCaret. My first encounter with scorecard development happened almost twelve years ago when I developed a propensity scorecard where the objective is to predict customers who are more likely to take up a particular banking product. I ran a logistic regression model leveraging SAS and SAS EMINER and the whole process took almost three weeks!!!
With the advent of sophisticated machine learning algorithms, I started using different R/Python packages and wrote lengthy codes to get the best model for scorecard development. However, the challenge lay in doing different types of data preparation for varied kinds of algorithms.
When it comes to developing a machine learning algorithm-driven scorecard with model interpretability- Pycaret is the savior. This low-code library can be used to perform complex machine learning tasks – recently built a scorecard which took only an hour to complete.0
Practical Use case – developing a scorecard where a lower score would imply a higher likelihood of credit card default by the customer:
To develop the solution, the dataset used is from Kaggle: here (though the data set contains 25 columns, in actual use cases, more than 2000 features are considered. Keeping this in mind, let’s look at the approach below)
Step 1: Install the required packages for the exercise:
pip install llvmlite -U --ignore-installed
pip install -U setuptools
pip install -U pip
pip install pycaret==2.3.1
pip install pandasql
pip install matplotlib
pip install shap
pip install seaborn
pip install sweetviz
from sklearn.metrics import roc_auc_score,balanced_accuracy_score, f1_score, accuracy_score
from itertools import combinations, chain
from pandas._libs.lib import is_integer
from pycaret.Classification import *
import matplotlib.patches as patches
import matplotlib.ticker as mtick
import matplotlib.pyplot as plt
import seaborn as sns
import sweetviz as sv
import pandasql as ps
import pandas as pd
import numpy as np
import shap
import math
Step 2: Import the data ( here from a google bucket) and EDA
path='gs://pt-test/UCI_Credit_Card.csv'
raw = pd.read_csv(path, encoding = 'cp1252')
print(raw.shape)
##output
(30000, 25)
##Let's drop the variable gender ( Sex) as we don't want to discriminate between male and female -
dataset_raw = raw.drop(columns =['SEX'])
print(dataset_raw.shape)
##output
(30000, 24)
Run the EDA with one line of code and generate the EDA report using sweetviz :
feature_config = sv.FeatureConfig(skip=['ID']) # remove the feature that you dont want to include in the EDA
my_report = sv.analyze(dataset_raw, "default.payment.next.month", feature_config)
my_report.show_html()

Step3: Data preprocessing and setting up for Pycaret
- identify the numeric and categorical features
- impute numeric missing by mean
- impute categorical missing by mode
- remove outliers – put a threshold of 5%
- take 80% for training data and 20% for test data
- remove multicollinearity
cat_feat = list(dataset_raw.select_dtypes(include=['object']).columns)
int_feat = list(dataset_raw.select_dtypes(include=['int64','float64','float32']).columns)
int_feat.remove('default.payment.next.month')
print(cat_feat)
##output
[] - here we dont have categorical feature
#setting up the environment:
clf = setup(dataset_raw
,target = 'default.payment.next.month'
,ignore_features = ['ID'] #ignored from model training
,numeric_imputation = 'mean'
,categorical_imputation = 'mode'
,categorical_features = cat_feat
,numeric_features = int_feat
,remove_outliers = True
,train_size = 0.80
,session_id = 1988
)

Step 4: Run the compare models and select the top n features
This is step is currently run to primarily determine the most important features. In general, we have more than 2000 features (like customer demographics, banking, competitor information etc)to start with- we run compre_model option to get the best model and use that do the feature selection. Feature selection can be done at CLF step as well however it takes significant time to get the result without gpu.
base_model = compare_models(fold = 5,sort = 'AUC', n_select = 1)

Now given the best model is Gradient Boosting Classifier on the basis of AUC, we will leverage this to get the top n features – generally, for Scorecard perspective, it ranges from 25–30 features ( we started with 2000 features). In this dummy dataset, we will select the top 10 features.
n=10
X_train=get_config('X_train')
feature_names = pd.DataFrame(X_train.columns)
feature_names['var'] = feature_names
feature_imp = pd.DataFrame(base_model.feature_importances_)
feature_imp['imp'] = feature_imp
var_imp = pd.concat([feature_names,feature_imp],axis=1)
var_imp = var_imp[['var', 'imp']]
var_imp = var_imp.sort_values(['imp'],ascending=False)
var_imp_fin=var_imp['var'].head(n).values.tolist()
print(var_imp_fin)
##output
['PAY_0', 'PAY_2', 'PAY_3', 'BILL_AMT1', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_6', 'PAY_4', 'PAY_AMT3', 'PAY_AMT2']
Step 5: Re-run the CLF again and then fine tune model
- subset the data with the selected significant variables
- run the compare_model
- tune the model and if need be custom grid search
- score the model and evaluate
dataset_raw = raw[var_imp_fin + ['ID','default.payment.next.month']]
cat_feat = list(dataset_raw.select_dtypes(include=['object']).columns)
int_feat = list(dataset_raw.select_dtypes(include=['int64','float64','float32']).columns)
int_feat.remove('default.payment.next.month')
clf = setup(dataset_raw
,target = 'default.payment.next.month'
,ignore_features = ['ID']
,numeric_imputation = 'mean'
,categorical_imputation = 'mode'
,categorical_features = cat_feat
,numeric_features = int_feat
,remove_outliers = True
,train_size = 0.80
,session_id = 1988
)
base_model2 = compare_models(fold = 5,sort = 'AUC', n_select = 1)

Gradient Boosting Classifier has again topped the table in terms of AUC however one point to be noted the AUC has come down from 0.7788 to 0.7687- this is due to a reduction in a number of features from 25 to 10. However, in a larger aspect, this might also help you to determine how many features you want to finally keep in the model as you don’t want to lose out to much on AUC.
Auto tuning of hyperparameter of the model as compare_model runs the model with predefined hyperparameter:
model_tune_gbc = tune_model(base_model2, n_iter=5, optimize='AUC')

Step 6: Score the train, test and whole dataset and compare the Gini
def score(main_data,model_name):
train = get_config('X_train').index
test = get_config('X_test').index
predict = predict_model(model_name,main_data,raw_score=True)
predict['odds'] = predict['Score_1']/predict['Score_0']
predict['score'] = 200 + 28.8539*np.log(predict['odds'])
predict['score'] = predict.score.round(0).astype(int)
predict_train = predict.loc[train]
predict_test = predict.loc[test]
auc_train = roc_auc_score(predict_train['default.payment.next.month'], predict_train['Score_1'])
print('Gini_train: %.3f' % (2*auc_train-1))
auc_test = roc_auc_score(predict_test['default.payment.next.month'], predict_test['Score_1'])
print('Gini_test: %.3f' % (2*auc_test-1))
return predict,predict_train,predict_test
#call the function
scr_all1,scr_train,scr_test = score(dataset_raw,base_model2)
#output
Gini_train: 0.636
Gini_test: 0.565
If you see the above result the difference between train and test Gini is more than 10% and therefore it makes sense to do a custom grid search to reduce the gap below 10%
a) first print the final model hyperparameter-
print(model_tune_gbc)
##output
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,learning_rate=0.1, loss='deviance', max_depth=6,
max_features='log2', max_leaf_nodes=None, min_impurity_decrease=0.002, min_impurity_split=None, min_samples_leaf=4, min_samples_split=5, min_weight_fraction_leaf=0.0, n_estimators=70, n_iter_no_change=None, presort='deprecated', random_state=1988, subsample=0.35, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)
After looking at the hyperparameter values, I am going to play around with n_estimators and learning_rate
params = {'n_estimators':[30,40,50,60],
'learning_rate':[0.05,0.2]
}
gbc_custom = tune_model(model_tune_gbc,custom_grid=params)

Re-run the score again and check the gini difference –
scr_all,scr_train,scr_test = score(dataset_raw,gbc_custom)
##output
Gini_train: 0.593
Gini_test: 0.576
As you can see the difference between Train and Test gini is less than 3%
Step 7: Save all the relevant datasets and model object
#final model
scr_all1,scr_train,scr_test = score(dataset_raw,gbc_custom)
scr_all1.to_csv('scr_all_gbc_custom.csv')
scr_train.to_csv('scr_train_gbc_custom.csv')
scr_test.to_csv('scr_test_gbc_custom.csv')
save_model(gbc_custom,'gbc_custom')
The model would be saved as pipeline and the output will be like below:

In the next part, I will talk about details of the model scoring, model evaluation, different metrics as part of model documentation like stability, gini,gains-matrix, rank ordering across the train and test datasets.