Notes from Industry

If you enjoy my content and want to get more in-depth knowledge regarding data or just daily life as a Data Scientist, please consider subscribing to my newsletter here.
As Data Scientists, developing machine learning models is a part of our daily job and why we are employed in the first place. However, The machine learning model we develop is not just for show but an actual tool to solve the business problem. This is why we need to evaluate our machine learning model – to measure how our model impacts the business.
Many data scientists would measure people in technical metrics such as Accuracy, Precision, F1 Score, ROC-AUC, and many more. It is a necessary metric, but sometimes these metrics did not reflect how the model would do business-wise . Still, business people need to know how our model’s differences would make in the business term compared to random targeting. This is why we use Lift and Gain analysis – To measure how much better our prediction model compared without the model.
Without further ado, let’s get into it.
Lift and Gain Analysis
Lift and Gain analysis is an analysis to evaluate the model prediction and the benefit to the business. It is often used in the marketing target analysis but not restricted.
In a typical Lift and Gain analysis, the analysis result would be presented in the chart below.

Gain and lift charts are visual aids for evaluating the performance of classification models. Unlike the confusion matrix that evaluates the overall population, the Gain and Lift chart evaluates model performance in a portion of the population. This means we evaluate the model in terms of the benefit we could get using the model in a portion of the population.
The Gain and Lift analysis benefit comes from how in the business often a time that our 80% revenue comes from 20% of the customers. This is the main part of the decile analysis used in the Gain and Lift chart calculation. The decile analysis is presented in the chart below.

How Decile Analysis is applicable in the Gain and Lift analysis? Let’s take a few steps back and explain how to Gain and Lift analysis calculated from the beginning.
As I mentioned previously, the Gain and Lift chart is used to evaluate the classification model. For the sake of example, let’s create a prediction model. In this article, I would use the churn data from Kaggle.
import pandas as pd
churn = pd.read_csv('churn.csv')

In this dataset, we have 21 columns with the target is customer churn. This means we would develop a classification prediction model to predict the customer churn. For simplicity, I would clean the data for modelling purposes.
#Drop Customer ID
churn = churn.drop('customerID', axis = 1)
#Change Ordinal data to numerical
for i in ['Partner', 'Dependents', 'PhoneService', 'OnlineSecurity',
'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'Churn']:
churn[i] = churn[i].apply(lambda x: 1 if x == 'Yes' else 0)
#OHE categorical data
churn = pd.get_dummies(churn, columns = ['gender', 'MultipleLines', 'InternetService', 'Contract', 'PaymentMethod'], drop_first = True)
#Change object data into numerical
churn['TotalCharges'] = churn['TotalCharges'].apply(lambda x: 0 if x == ' ' else float(x))
After cleaning the data, we would try to develop the prediction model. For this article, I would use the Logistic Regression model.
#Import the model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Splitting the model
X_train, X_test, y_train, y_test = train_test_split(churn.drop('Churn', axis =1), churn['Churn'], test_size = 0.3,stratify = churn['Churn'], random_state = 101)
model = LogisticRegression()
model.fit(X_train, y_train)
With our model is set, we would start to make our Gain and Lift analysis to evaluate this model. As a comparison, we would evaluate the model using the usual metrics.
from sklearn.metrics import classification_report
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

As we can see from the image above, our model capability to predict the churned customer (class 1) is lower. Would our model still have a benefit if we applied it in the business? Let’s see it using the Gain and Lift analysis.
The first step in the Gain and Lift analysis is to get the model prediction probability of class 1 based on the test data and order it in descending order.
#Getting the prediction probability of class 1 and order it by descending order
X_test['Prob'] = model.predict_proba(X_test)[:,1]
X_test = X_test.sort_values(by = 'Prob', ascending = False)
X_test['Churn'] = y_test

When we obtained the probability and order it descendingly, we would divide the data into deciles. This is similar to the decile analysis I have shown in the above image; we divide the data into 10 sets and label it.
#Divide the data into decile
X_test['Decile'] = pd.qcut(X_test['Prob'], 10, labels=[i for i in range (10, 0, -1)])

After dividing the data by decile, we need to calculate the actual churn (actual class 1, not predicted) in each Decile. This motion I called the Number of Responses.
#Calculate the actual churn in each decile
res = pd.crosstab(X_test['Decile'], X_test['Churn'])[1].reset_index().rename(columns = {1: 'Number of Responses'})
lg = X_test['Decile'].value_counts(sort = False).reset_index().rename(columns = {'Decile': 'Number of Cases', 'index': 'Decile'})
lg = pd.merge(lg, res, on = 'Decile').sort_values(by = 'Decile', ascending = False).reset_index(drop = True)

In the image above, we obtain the Number of Cases (The number of data in the decile) and the Number of Responses (The number of actual positive data in each decile). With this number, we able to calculate the Gain number.
For information. Gain is the ratio between the cumulative number of the Number of Responses (Actual Positive) up to each decile divided by the total number of positive observations in the data. Let’s try to calculate it in our data.
#Calculate the cumulative
lg['Cumulative Responses'] = lg['Number of Responses'].cumsum()
#Calculate the percentage of positive in each decile compared to the total nu
lg['% of Events'] = np.round(((lg['Number of Responses']/lg['Number of Responses'].sum())*100),2)
#Calculate the Gain in each decile
lg['Gain'] = lg['% of Events'].cumsum()

We could see from the Gain image above that the Gain number is increasing for each decile, but the cumulative total decreases with the higher decile. What is the interpretation of the Gain? Gain is the percentage of targets (actual positive) covered at a given decile level. For example, in decile 2, we had a Gain of 50.44. It means 50.44% of targets covered in the top 20% of data based on the model. In the churn model, we can say we can identify and target 50% of customers who are likely to churn by just targeting 20% of total customers. Business-wise, it means with fewer resources, we could potentially avoid a 50% churn event.
Next, we need to calculate the Lift. The lift **** would measures how much better we can expect to do with the predictive model comparing without the model.
lg['Decile'] = lg['Decile'].astype('int')
lg['lift'] = np.round((lg['Gain']/(lg['Decile']*10)),2)

The lift could be interpreted as the gain ratio percentage to the random percentage at a given decile level. In a layman term, in decile 2, we have 2.52 means that when selecting 20% of the data based on the model, we could find the target (actual positive) 2.52 times more than the randomly selected 20% of the data without a model.
Let’s try to visualize the Gain and Lift chart compared to the random picking.

In the image above, we could evaluate the model by measuring the Gain and Lift model compared to the random—the Greater the area, the Better the model. We could see that our model is good in prediction because the chart has shown that the model allowed a greater Gain and Lift than the random picking. Business-wise, the churn is found better with using the model; means fewer resources to spend.
Conclusion
Model evaluation is important for the data scientist work; however, we need to evaluate the model based on the business usefulness. It is not enough to rely on the technical metrics but on how our model impacts the business. In this article, I have explained Lift and Gain analysis useful to evaluate the model in terms of business by using:
- Gain number (Percentage of target covered in each decile)
- Lift number (gain ration percentage to the random percentage at a given decile level)
Visit me on my LinkedIn or Twitter.
If you are not subscribed as a Medium Member, please consider subscribing through my referral.