
Cumulative gains and lift curves are two closely related visual aids for measuring the effectiveness of a predictive Classification model. They have a couple of big benefits over other ways of assessing model performance when you need to explain a model to your business stakeholders, and show how using the model can impact business decisions and strategies.
To see how you can benefit, let’s look at what is usually one of the first model assessment tools people encounter. A popular and effective way to evaluate predictive model performance on a binary classification model is by using Receiver Operating Characteristic curves (ROC). The ROC curve plots a models sensitivity, also referred to as true positive rate, on the vertical axis against 1 minus specificity, or false positive rate, on the x-axis. Performance is better for curves that are closest to the top-left corner. The comparison curve is a diagonal line which represents how you would fare without using a model, by just classifying different cases using the naive rule, where all predictor information is ignored, and each record is simply labeled as a member of the majority class. The hope being that using predictor information can assist in creating a model that will outperform this benchmark. The metric that usually gets pulled out to summarize an ROC is the area under the curve (AUC), which provides an aggregate measure to assess model performance across all possible classification thresholds.

Understand Your Goal
If your only goal is to predict class membership for each new record you want to evaluate, and you want to compare the performance of different models in doing that, ROC and AUC are going to be a big help. But, there are many times this isn’t the case. Business problems will often involve the predictive goal of examining a new set of records to detect which ones are most likely to belong to a particular class you’re interested in, so wouldn’t it be great to start assessing how effectively a classification model can rank new records? There are a lot of business problems that boil down to just figuring out how well a model helps us to select a relatively small number of records that hold a relatively large proportion of what’s important in the context of the problem. Maybe that’s figuring out which customers are most likely to churn, or respond to an marketing offer, or commit fraud, or light their rental car on fire. How much better will your model help stakeholders locate these cases than the magic eight ball that sits on the coffee table in the second floor break lounge? That’s the shift in mindset. From how accurately the next new observation under consideration is classified, to how well the model can create a hierarchy of ranked observations.
If you walk into a meeting with just an ROC AUC number to explain work your probably going to have make everyone sit through a half hour of slides explaining sensitivity, specificity, what AUC represents, how to interpret it, and all the caveats ROC curves bring. And at the end of that you’re still handing them a single number that isn’t in any way intuitive or actionable to them. The AUC isn’t going to help anyone who’s trying to figure out if they might want to use your model determine how or where to apply your work. This is where cumulative gains and lift come in.
Let’s imagine you work at BIG Co., and the cruel overlords who oversee marketing have stifled their campaign outreach by setting a budget that will only allow them to target 5,000 customers out of 25,000 they have fresh leads on. They ask you to build a model that will predict which customers are most likely to respond to a marketing campaign offer. They don’t want to waste precious resources targeting leads that have little chance of working out. After you’re finished they want you to attend a meeting where you will explain why you’re model is better than just selecting random leads with the battle tested magic eight ball from accounting, and it would be really great if you could also estimate how many responsive leads will be reached out of the 5,000 contacts that the budget constraints allow.
You decide to build a classification model that outputs the propensity of a lead to respond to a marketing promotion using, say, logistic regression. Cumulative gains and lift aren’t tied to any one model, they can be used like this with any model that can output the predicted probability that a record will belong to a class of interest. The caveat is that they don’t work well with multiclass classifiers unless you define one class as the most important class to identify and lump everything else into a class that combines everything else. Binary classification is what you’re looking for. But in a case like this where you’re just interested in responders and non-responders you’re good to go. So you train the model parameters on a selection or subset of records, test the performance on a validation set, examine performance measures like the ROC curve and the AUC value to to help with model optimization, and check if it’s over or underfitting until you get a model that your satisfied with.
Nuts and Bolts
The cumulative gains and lift chart are both constructed using the same inputs. You’ll need the predicted probabilities of belonging to the target class for each record output by the model, along with the actual class that the record belongs to from the validation dataset. If you were going to plot this without using a function from some package then all of the observations would need to be ordered according to the output of the model in descending order. Then for each row, the cumulative number of actual class members of interest up to, and including, the current one would be added. This resulting cumulative column is what would get plotted against the number of records. It would look something like this.

The left hand side of the x-axis for both the cumulative gains and lift charts will start with the observations that have the highest probability of belonging to the class of interest according to the model. Those probabilities decrease going in the rightward direction down the axis. The x-axis axis tells us what percentage of the observations are being considered. The y-axis indicates the cumulative amount of all potential observations that are the class of interest as a percentage of the observations currently under consideration. The question you can now ask and answer with this plot is, when the model is applied to the data, and the most likely X percent of the records are selected, what percentage of the actual records you’re interested can you expect to find?
How Do We Know When We’re Awesome?
In order to get an idea of how good the model is it needs to be compared to what we already have, which is nothing. A good baseline for comparison is what would have been produced by someone just randomly assigning propensities to each observation. The cumulative column using that method would increase, on average, by the the total count of the interesting class divided by the total number of observations in each row. For example, if your data set has 26 observations and 13 of them are actually the class your interested in, then you would expect that the probability of randomly selecting a single record that belonged to the class of interest to be 13/26 = 0.5. If 10 records were randomly selected then the expectation is that 10 X 0.5 = 5 records would belong to that class. The plot of this baseline cumulative total will be a diagonal line that goes from point (0,0) to point (total observations, # of class of interest). In this example case from (0,0) to (26,13).
The scikitplot module makes it relatively easy to construct a cumulative gains curve without having to manually sort and calculate cumulative totals, and then work a plot up. You can use the method named plot_cumulative_gain which has two arguments. The first is an array with the true values of the target. The second has the predictions for the observations resulting from the model. The predictions should include output for both classes or targets. The following code constructs a quick logistic regression model without any preprocessing, validation, or optimization using dummy data to try to identify leads who will be the most likely to respond to the marketing campaign. I’m not going to do any validation or tuning, just enough so we can start to interpret the resulting plots. The output from predict_log_proba returns estimates for both classes which are paired in a numpy array, and that array can get passed into the plotting function without and further manipulation.

Meaningful statements that give intuitive insight into why and how this model is useful are now available to share with business stakeholders. The business constraint is that only 5,000 new leads can be targeted with the new campaign, which is 20% of the of the 25,000 available. We can see from the cumulative gains curve for the class of interest, which is labeled Class 1that selecting the top 20% of the leads ranked as most likely to respond to a marketing offer will contain just about 45% of the actual responders. You could use something like df["target"].value_counts()
to see that there are 1187 actual responders in the training data, and can therefore expect 1187 X 0.45 = 534.15 responses after marketing to the 5,000 leads the model ranked as most likely to respond. The expected baseline number without using the model would only be 20% of the 1187 actual responders, or 1187 X 0.2 = 237.4 responses for marketing to those 5,000 leads. Using the model is going to potentially produce 534.15 / 237.4 = 2.25 times better results than random selection when it is applied to new data. This works the reverse way as well. If the business goal was to reach say 60% of the leads that are most likely respond then you can locate that amount on the y-axis and determine that roughly just about 30% of the most likely leads would need to be used to achieve that.
The lift chart provides an easy way to visualize how many times better applying the model is than random selection for any percentage of the ranked records. It just automates the calculation we made earlier to conclude that the using the model produced 2.25 times better results when selecting the top 20% across all possible values of the x-axis. The scikitplot package also provides a implementation for a lift curve. The same y
and X
variables can be recycled as inputs to the plot_lift_curve
function.

Let’s check our earlier work. The value on the y-axis for the lift curve for Class 1 at 20% appears to be roughly 2.25, which agrees with the previous calculation of the models effectiveness over no model at all. The baseline for comparison is the horizontal dashed line at 1.0. The curve gradually decreases as there are fewer and fewer records that are actually the class of interest to add, and there is less and less opportunity for the model to provide an advantage.
A classifier that works really well will produce a high "lift" for the selection of a relatively small percentage of the ranked data. Now you have a elegant metric to share when you present your model that doesn’t require anyone to go through three different Coursera courses before they have an intuitive understanding for it. Now you can offer input like this: This model should perform over twice as well as randomly selecting leads, and we can estimate that about 535 of the 5,000 leads that we market too will respond to the offer. That’s an actionable statement. How much lift is enough is to take action, modify decision making, or deploy more or less resources toward pursuing a business goal doesn’t have any right answer. Presenting a model as an alternative that will perform some multiple better than random decision making will undoubtedly increase the chances that it’s used and appreciated. Lift charts can be even more powerful and informative when the costs and benefits of classification success and error can be estimated. I’ll cover this in my next post!