In Part 1, I explained how model development/training can be leveraged using the PyCaret Framework. In this part, we will focus on how to evaluate the developed model and compare it with ‘Train/Test’ and ‘Out-of -Time Validation Dataset’. In the PyCaret library, one line of code can get you all the necessary model evaluation metrics which will help in finalizing the model.

Model evaluation can be broadly categorized into Stability and Predictive Power. And Stability and Predictive Power have different components under it. The different metrics of model evaluation are explained below.

Population Stability Index – Population Stability Index measures the difference in distribution between Development and Validation/Test Data Set. It always takes the positive value and the higher the value of the index, the higher the difference gets.
(development% – validation%) log( development % / validation %)100
Rule of Thumb: values <10 means PSI is in green ( hardly any difference between distribution). Values between 10 to 25 is considered as Amber (which means some changes have been observed in distribution and requires investigation). PSI > 25 indicates the distributions are different between Development and Validation Dataset.
From our previous part, I have saved the scr_train and scr_test file containing customer list, score variable, predicted values of the event, actual event and important features that came significant in the model. Now, to get the model we will use the following code:
from sklearn.metrics import roc_auc_score,balanced_accuracy_score, f1_score, accuracy_score
from itertools import combinations, chain
from pandas._libs.lib import is_integer
from pycaret.Classification import *
import matplotlib.patches as patches
import matplotlib.ticker as mtick
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
import sweetviz as sv
import pandasql as ps
import pandas as pd
import numpy as np
# import shap
import math
def psi(X,Y):
X['pentile'] = pd.qcut(X['score'], 5, labels=False) + 1
##Finding the boundary conditions for each pentile
X_tile = pd.DataFrame(X.groupby("pentile").agg({"score": [np.min, np.max]})).reset_index()
X_tile.columns = ['pentile','min','max']
##Fixing lowest and highest value for min and max respectively
X_tile.loc[0, 'min'] = -10000
X_tile.loc[4, 'max'] = 10000
##joining based on pentile conditions
sqlcode2 = '''
select c.pentile, c.cnt as X_count, c.X_tot, d.cnt as Y_cnt, d.Y_tot
from
(select a.*, b.*
from
(select b.pentile, count(*) as cnt
from X a
left join X_tile b
on a.score>=b.min and a.score<=b.max
group by b.pentile) a
cross join
(select count(*) as X_tot from X) b ) c
left join
(select a.*, b.*
from
(select b.pentile, count(*) as cnt
from Y a
left join X_tile b
on a.score>=b.min and a.score<=b.max
group by b.pentile) a
cross join
(select count(*) as Y_tot from Y) b ) d
on c.pentile=d.pentile
'''
psi_stg0 = ps.sqldf(sqlcode2,locals())
psi_stg0['X_perc'] = psi_stg0['X_count']/psi_stg0['X_tot']
psi_stg0['Y_perc'] = psi_stg0['Y_cnt']/psi_stg0['Y_tot']
psi_stg1 = psi_stg0.drop(['X_count', 'X_tot', 'Y_cnt','Y_tot'], axis=1)
##Final PSI calculation
psi_stg1['psi'] = (psi_stg1['X_perc'] - psi_stg1['Y_perc'])*np.log((psi_stg1['X_perc']/psi_stg1['Y_perc']))*100
psi_stg2 = pd.merge(psi_stg1, X_tile, how='left', left_on=['pentile'], right_on = ['pentile'])
psi_stg2.loc[0, 'min'] = 'low'
psi_stg2.loc[4, 'max'] = 'high'
psi_stg2['score_band'] = psi_stg2['min'].astype(str) + "-" + psi_stg2['max'].astype(str)
psi = pd.DataFrame(psi_stg2[['score_band','X_perc','Y_perc','psi']])
return psi
psi_train_test = psi(scr_train, scr_test)
psi_train_test = psi_train_test.rename(columns={'score_band': 'score_band', 'X_perc': 'scr_train_perc', 'Y_perc': 'scr_test_perc', 'psi': 'psi'})
psi_train_test['scr_train_%']=round(psi_train_test['scr_train_perc']*100,2)
psi_train_test['scr_test_%']=round(psi_train_test['scr_test_perc']*100,2)
psi_train_test['psi']=round(psi_train_test['psi'],2)
psi_train_test1=psi_train_test[['score_band','scr_train_%','scr_test_%','psi']]
print(psi_train_test1)
print('PSI - scr_train vs scr_test: ' + str(round(sum(psi_train_test['psi']),2)))

# To plot the PSI graph-
from matplotlib.ticker import PercentFormatter
psi_table=psi_train_test[['score_band','scr_train_perc','scr_test_perc']]
psi_table = psi_table.melt('score_band', var_name='cols', value_name='% population')
g = sns.factorplot(x="score_band", y="% population", hue='cols', data=psi_table)
g.set(ylim=(0, .50))
g.ax.set_title('Population Stability', size = 18 )
g.ax.yaxis.set_major_formatter(PercentFormatter(1))
g.savefig('PSI.png')

Characteristic Stability Index – This measures the difference between Development and Validation/Test Data at a characteristic/feature/variable level. If psi is in amber or red, it is important to check the distributional difference at a characteristic level to understand the list of variables that leads to this change.
Once we look into the stability of the model, the next step is to evaluate the model strength in terms of predictive power. Using Pycaret, in one line of code, we will get a different evaluation matrix. Let’s look at the below-mentioned examples.
evaluate_model(gbc_custom,use_train_data= True) #this would give the result on the train data
evaluate_model(gbc_custom) #this would give the result on the test data
# we had saved our model as gbc_custom ( refer to the previous part)
Area Under the curve (AUC) – The higher the value, the better the model is in terms of discriminating between event and non-event.









Once we study all these evaluation metrics, the next step will be to create a Gains Matrix to decide on the score-cut off for customer targeting.
def weighted_qcut(values, weights, q, **kwargs):
'Return weighted quantile cuts from a given series, values.'
if is_integer(q):
quantiles = np.linspace(0, 1, q + 1)
else:
quantiles = q
order = weights.iloc[values.argsort()].cumsum()
bins = pd.cut(order / order.iloc[-1], quantiles, **kwargs)
return bins.sort_index()
def gains_matrix(input_gm,target):
if 'SamplingWeight' not in input_gm.columns:
input_gm['SamplingWeight'] = 1
input_gm['mevent']=input_gm[target]
input_gm['deciles'] = weighted_qcut(input_gm['score'], input_gm['SamplingWeight'], 20, labels=False)
sqlcode3 = '''
select deciles, mevent, sum(samplingweight) as count
from input_gm
group by deciles, mevent
'''
gainsfreq = ps.sqldf(sqlcode3,locals())
transpose = pd.DataFrame(gainsfreq.pivot_table(index=['deciles'], columns='mevent', aggfunc=sum, fill_value=0).reset_index())
transpose.columns = ['deciles','count_0','count_1']
transpose.sort_values(by=['deciles'], ascending=False, inplace=True)
transpose['cum_0'] = transpose['count_0'].cumsum()
transpose['cum_1'] = transpose['count_1'].cumsum()
transpose['percent_cum_0'] = (transpose['cum_0']/np.sum(transpose.count_0))*100
transpose['percent_cum_1'] = (transpose['cum_1']/np.sum(transpose.count_1))*100
transpose['event_rate'] = (transpose['count_1']/(transpose['count_0']+transpose['count_1']))*100
transpose['cum_event_rate'] = (transpose['cum_1']/(transpose['cum_0']+transpose['cum_1']))*100
transpose['cum_separation'] = transpose['percent_cum_1']-transpose['percent_cum_0']
sqlcode4 = '''
select deciles, min(score) as score
from input_gm
group by deciles
'''
score = ps.sqldf(sqlcode4,locals())
result = pd.DataFrame(pd.merge(score, transpose , on='deciles', how='outer'))
resultn = result.sort_values('deciles', ascending=False)
resultn['score_band'] = resultn["deciles"].tolist()[::-1]
resultn['score_band'] = resultn["score_band"]+1
resultn= resultn.drop(columns=['deciles'])
return resultn
train_gain=gains_matrix(scr_train,'default.payment.next.month')
test_gain=gains_matrix(scr_test,'default.payment.next.month')

Rank Ordering – Across score band, the event rate should be ideally monotonically increasing or decreasing – higher the score, higher the non-default rate and lower the score, lower the non-default rate.
rank_train=train_gain[['score_band','event_rate']]
rank_test=test_gain[['score_band','event_rate']]
rank_train=rank_train.rename(columns={'event_rate': 'train'})
rank_test=rank_test.rename(columns={'event_rate': 'test'})
rank_table = pd.DataFrame.merge(rank_train, rank_test,on=['score_band'],how='outer')
rank_table = rank_table.melt('score_band', var_name='cols', value_name='event_rate')
g = sns.factorplot(x="score_band", y="event_rate", hue='cols', data=rank_table,grid=False).ax.set_title("Rank Ordering Comparison")
from matplotlib.ticker import FuncFormatter
def to_percent(y, position):
s = str(y)
return s + '%'
formatter = FuncFormatter(to_percent)
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()
g.figure.savefig('Rank Orderging.png')

Rule of Thumb: In terms of model implementation, following threshold can be considered (however it may vary across industries and use cases):


This brings us to the end of Part 2. In the last part, we will cover the Bias and Model Interpretability.