The world’s leading publication for data science, AI, and ML professionals.

Scorecard Development for Finance Industry Using PyCaret – Part 2

Evaluating a classification model with light-coding workflow

In Part 1, I explained how model development/training can be leveraged using the PyCaret Framework. In this part, we will focus on how to evaluate the developed model and compare it with ‘Train/Test’ and ‘Out-of -Time Validation Dataset’. In the PyCaret library, one line of code can get you all the necessary model evaluation metrics which will help in finalizing the model.

photo by Ameen Fahmy (Unsplash)
photo by Ameen Fahmy (Unsplash)

Model evaluation can be broadly categorized into Stability and Predictive Power. And Stability and Predictive Power have different components under it. The different metrics of model evaluation are explained below.

Key Metrics for Model Evaluation
Key Metrics for Model Evaluation

Population Stability Index – Population Stability Index measures the difference in distribution between Development and Validation/Test Data Set. It always takes the positive value and the higher the value of the index, the higher the difference gets.

(development% – validation%) log( development % / validation %)100

Rule of Thumb: values <10 means PSI is in green ( hardly any difference between distribution). Values between 10 to 25 is considered as Amber (which means some changes have been observed in distribution and requires investigation). PSI > 25 indicates the distributions are different between Development and Validation Dataset.

From our previous part, I have saved the scr_train and scr_test file containing customer list, score variable, predicted values of the event, actual event and important features that came significant in the model. Now, to get the model we will use the following code:

from sklearn.metrics import roc_auc_score,balanced_accuracy_score, f1_score, accuracy_score
from itertools import combinations, chain
from pandas._libs.lib import is_integer
from pycaret.Classification import *
import matplotlib.patches as patches
import matplotlib.ticker as mtick
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
import sweetviz as sv
import pandasql as ps
import pandas as pd
import numpy as np
# import shap
import math
def psi(X,Y):

    X['pentile'] = pd.qcut(X['score'], 5, labels=False) + 1

    ##Finding the boundary conditions for each pentile
    X_tile = pd.DataFrame(X.groupby("pentile").agg({"score": [np.min, np.max]})).reset_index()
    X_tile.columns = ['pentile','min','max']

    ##Fixing lowest and highest value for min and max respectively
    X_tile.loc[0, 'min'] = -10000
    X_tile.loc[4, 'max'] = 10000

    ##joining based on pentile conditions
    sqlcode2 = '''
    select c.pentile, c.cnt as X_count, c.X_tot, d.cnt as Y_cnt, d.Y_tot
    from
        (select a.*, b.*
        from
            (select b.pentile, count(*) as cnt
            from X a
            left join X_tile b
            on a.score>=b.min and a.score<=b.max
            group by b.pentile) a
        cross join
            (select count(*) as X_tot from X) b ) c
    left join
        (select a.*, b.*
        from
            (select b.pentile, count(*) as cnt
            from Y a
            left join X_tile b
            on a.score>=b.min and a.score<=b.max
            group by b.pentile) a
        cross join
                (select count(*) as Y_tot from Y) b ) d
    on c.pentile=d.pentile
    '''

    psi_stg0 = ps.sqldf(sqlcode2,locals()) 
    psi_stg0['X_perc'] = psi_stg0['X_count']/psi_stg0['X_tot']
    psi_stg0['Y_perc'] = psi_stg0['Y_cnt']/psi_stg0['Y_tot']
    psi_stg1 = psi_stg0.drop(['X_count', 'X_tot', 'Y_cnt','Y_tot'], axis=1)

    ##Final PSI calculation
    psi_stg1['psi'] = (psi_stg1['X_perc'] - psi_stg1['Y_perc'])*np.log((psi_stg1['X_perc']/psi_stg1['Y_perc']))*100
    psi_stg2 = pd.merge(psi_stg1, X_tile,  how='left', left_on=['pentile'], right_on = ['pentile'])
    psi_stg2.loc[0, 'min'] = 'low'
    psi_stg2.loc[4, 'max'] = 'high'
    psi_stg2['score_band'] = psi_stg2['min'].astype(str) + "-" + psi_stg2['max'].astype(str)
    psi = pd.DataFrame(psi_stg2[['score_band','X_perc','Y_perc','psi']])

    return psi
psi_train_test = psi(scr_train, scr_test)
psi_train_test = psi_train_test.rename(columns={'score_band': 'score_band', 'X_perc': 'scr_train_perc', 'Y_perc': 'scr_test_perc', 'psi': 'psi'})
psi_train_test['scr_train_%']=round(psi_train_test['scr_train_perc']*100,2)
psi_train_test['scr_test_%']=round(psi_train_test['scr_test_perc']*100,2)
psi_train_test['psi']=round(psi_train_test['psi'],2)
psi_train_test1=psi_train_test[['score_band','scr_train_%','scr_test_%','psi']]
print(psi_train_test1)
print('PSI - scr_train vs scr_test: ' + str(round(sum(psi_train_test['psi']),2)))
result of PSI- distributions between scr_train & scr_test is similar
result of PSI- distributions between scr_train & scr_test is similar
# To plot the PSI graph-
from  matplotlib.ticker import PercentFormatter
psi_table=psi_train_test[['score_band','scr_train_perc','scr_test_perc']]
psi_table = psi_table.melt('score_band', var_name='cols',  value_name='% population')
g = sns.factorplot(x="score_band", y="% population", hue='cols', data=psi_table)
g.set(ylim=(0, .50))
g.ax.set_title('Population Stability', size = 18 )
g.ax.yaxis.set_major_formatter(PercentFormatter(1))
g.savefig('PSI.png')
Graphical representation of Score Distribution
Graphical representation of Score Distribution

Characteristic Stability Index – This measures the difference between Development and Validation/Test Data at a characteristic/feature/variable level. If psi is in amber or red, it is important to check the distributional difference at a characteristic level to understand the list of variables that leads to this change.

Once we look into the stability of the model, the next step is to evaluate the model strength in terms of predictive power. Using Pycaret, in one line of code, we will get a different evaluation matrix. Let’s look at the below-mentioned examples.

evaluate_model(gbc_custom,use_train_data= True) #this would give the result on the train data
evaluate_model(gbc_custom) #this would give the result on the test data
# we had saved our model as gbc_custom ( refer to the previous part)

Area Under the curve (AUC) – The higher the value, the better the model is in terms of discriminating between event and non-event.

AUC for train. In the top there are different tabs with different evaluation metrics.
AUC for train. In the top there are different tabs with different evaluation metrics.
AUC Test
AUC Test
Precision/Recall/F1/Support across classes of event variable
Precision/Recall/F1/Support across classes of event variable
Confusion matrix across train and test data
Confusion matrix across train and test data
Learning curve and lift chart
Learning curve and lift chart
Feature Importance Plot and Hyper parameter of the model
Feature Importance Plot and Hyper parameter of the model

Once we study all these evaluation metrics, the next step will be to create a Gains Matrix to decide on the score-cut off for customer targeting.

def weighted_qcut(values, weights, q, **kwargs):
    'Return weighted quantile cuts from a given series, values.'
    if is_integer(q):
        quantiles = np.linspace(0, 1, q + 1)
    else:
        quantiles = q
    order = weights.iloc[values.argsort()].cumsum()
    bins = pd.cut(order / order.iloc[-1], quantiles, **kwargs)
    return bins.sort_index()
def gains_matrix(input_gm,target):
    if 'SamplingWeight' not in input_gm.columns:
        input_gm['SamplingWeight'] = 1
    input_gm['mevent']=input_gm[target]
    input_gm['deciles'] = weighted_qcut(input_gm['score'], input_gm['SamplingWeight'], 20, labels=False)
sqlcode3 = '''
    select deciles, mevent, sum(samplingweight) as count
    from input_gm
    group by deciles, mevent
    '''
    gainsfreq = ps.sqldf(sqlcode3,locals())

    transpose = pd.DataFrame(gainsfreq.pivot_table(index=['deciles'], columns='mevent', aggfunc=sum, fill_value=0).reset_index())
    transpose.columns = ['deciles','count_0','count_1']

    transpose.sort_values(by=['deciles'], ascending=False, inplace=True)
transpose['cum_0'] = transpose['count_0'].cumsum()
    transpose['cum_1'] = transpose['count_1'].cumsum()

    transpose['percent_cum_0'] = (transpose['cum_0']/np.sum(transpose.count_0))*100
    transpose['percent_cum_1'] = (transpose['cum_1']/np.sum(transpose.count_1))*100

    transpose['event_rate'] = (transpose['count_1']/(transpose['count_0']+transpose['count_1']))*100
    transpose['cum_event_rate'] = (transpose['cum_1']/(transpose['cum_0']+transpose['cum_1']))*100

    transpose['cum_separation'] = transpose['percent_cum_1']-transpose['percent_cum_0']

    sqlcode4 = '''
    select deciles, min(score) as score
    from input_gm
    group by deciles
    '''
    score = ps.sqldf(sqlcode4,locals())

    result = pd.DataFrame(pd.merge(score, transpose , on='deciles', how='outer'))
    resultn = result.sort_values('deciles', ascending=False)
    resultn['score_band'] = resultn["deciles"].tolist()[::-1]
    resultn['score_band'] = resultn["score_band"]+1
    resultn= resultn.drop(columns=['deciles']) 
    return resultn
train_gain=gains_matrix(scr_train,'default.payment.next.month')
test_gain=gains_matrix(scr_test,'default.payment.next.month')
Gains matrix for train - with 20 score bands
Gains matrix for train – with 20 score bands

Rank Ordering – Across score band, the event rate should be ideally monotonically increasing or decreasing – higher the score, higher the non-default rate and lower the score, lower the non-default rate.

rank_train=train_gain[['score_band','event_rate']]
rank_test=test_gain[['score_band','event_rate']]
rank_train=rank_train.rename(columns={'event_rate': 'train'})
rank_test=rank_test.rename(columns={'event_rate': 'test'})
rank_table = pd.DataFrame.merge(rank_train, rank_test,on=['score_band'],how='outer')
rank_table = rank_table.melt('score_band', var_name='cols',  value_name='event_rate')
g = sns.factorplot(x="score_band", y="event_rate", hue='cols', data=rank_table,grid=False).ax.set_title("Rank Ordering Comparison")
from matplotlib.ticker import FuncFormatter
def to_percent(y, position):
    s = str(y)
    return s + '%'
formatter = FuncFormatter(to_percent)
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()
g.figure.savefig('Rank Orderging.png')
Event rate across score bands in train & test
Event rate across score bands in train & test

Rule of Thumb: In terms of model implementation, following threshold can be considered (however it may vary across industries and use cases):

Model Implementation Criteria - Thumb Rule
Model Implementation Criteria – Thumb Rule

This brings us to the end of Part 2. In the last part, we will cover the Bias and Model Interpretability.


Related Articles