Performance Curve: More Intuitive than ROC/PRC and Less Assumptive than Threshold Metrics

An evaluation method for binary classifiers that combines the best of both worlds

Tam D Tran-The
Towards Data Science

--

Photo by Jason Goodman on Unsplash

There are two most common families of evaluation metrics used in the context of binary classification:

  • Threshold metrics (e.g.: accuracy, precision, recall, F-measure)
  • Ranking metrics (e.g. receiver operating characteristics (ROC) curve, precision-recall (PR) curve)

Problems with threshold metrics

Although threshold metrics are intuitive and easy to be explained to non-technical audience, they require a specific threshold cutoff to separate between a positive and a negative class. To determine this cutoff, we need to know:

(1) Class distribution of the outcome (a.k.a prevalence):

A lot of the time 0.5 is used as a default cutoff value, but this is not justifiable in cases where the prevalence of outcome is not exactly 50%.

(2) Relative cost of two kinds of misclassification (a.k.a type I and type II errors)

  • Type I error, or false positive, occurs when the model predicts a positive outcome when it’s actually negative. E.g.: a patient is diagnosed with cancer but they do not have the disease.
  • Type II error, or false negative, occurs when the model predicts a negative outcome when it’s actually positive. E.g.: a bank transaction is considered normal but it is actually fraudulent.

There is always a trade-off between these two types of errors. When the harms of type I error is greater than those of type II (e.g. from a hiring employer’s perspective, the cost of a bad hire is more expensive than that of not hiring the right employee), we’d want to raise the cutoff threshold to be extremely precise in positive outcome prediction. On the other hand, when the harms of type II error is greater than those of type I (e.g. the cost of missing a fraudulent transaction is more expensive than the extra hours of reviewing a false alert), we’d want to lower the cut-off threshold to cast a wider net of positive cases.

To choose the correct threshold, we need to be able to quantify these two types of risks.

Although we might have a general idea of what error type is more expensive than the other in a certain use case, these harms are rarely quantifiable in practice. By relying on a single measure of cut-off threshold, we incorrectly convey that these two risks can and have been quantified.

Problems with ranking metrics

Unlike threshold metrics, existing ranking metrics such as ROC and PR curves don’t make any assumption about relative cost between misclassification errors. They evaluate classifiers over variable thresholds. However, they are less intuitive to non-technical audiences and do not provide any actionable insights. What is the expected utility of a predictive model achieving an AUROC of 90%? Unclear.

Despite their own pros and cons, threshold and ranking metrics both share an attractive property, which is: they provide straightforward means to compare multiple models to each other, as well as with random and perfect classifiers.

Image by Author. Pros and Cons of Threshold and Ranking Metrics.

An improved method to compare performance of binary classifiers that represents the best of both worlds

Is there an evaluation method that represents the best of both worlds, i.e. not assuming knowledge of misclassification costs and providing actionable insights? The answer is YES!

The main idea is to plot values of any threshold metrics of interest for different percentages of total cases that are predicted to be positives.

From onwards in this article, we will refer to this kind of chart as a performance curve.

How to read/plot a performance curve?

For each performance curve, the X-axis displays, at each point, the percentage of total cases that is predicted to be positive, given that all cases are sorted in descending order of their predicted scores by our model. This percentage is directly related with the cut-off threshold that separates between positive and negative classes. Specifically, a lower threshold means a larger percentage predicted as positive, whereas a higher threshold means a smaller percentage predicted as positives. Although coming up with an exact and appropriate threshold value is a challenging and unintuitive task, the best percentage of cases predicted as positives (i.e. a point on X-axis) that should be chosen to compare models can be determined by the operational capacity of the model users. This idea is further explained in the Business Use Case section.

The Y-axis, on the other hand, represents any threshold metric of your interest. It could be precision, recall, accuracy, F1-score or a user-custom metric that depends on the ordering of cases and not the actual predicted values (thus requiring no threshold cutoff). In this article, we use precision and recall to demonstrate the concept, because these are the 2 most commonly used threshold metrics.

What do performance curves for random and perfect models look like?

For recall:

  • A perfect model assigns the highest scores to all true positive cases. Therefore, it finds 100% of all positive cases (i.e. reaching the point of recall = 1) when the model evaluates the first x% of cases where x% is the prevalence of positive class.
  • A random model reaches the 100% mark (i.e. recall = 1) when it finishes evaluating all cases.
Image by Author. Recall Performance Curves in Different Prevalence Scenarios.

For precision:

  • A perfect model places all true positives at the top of the ranking list, giving us a precision of 1 until we run out of positive cases.
  • A random model’s precision remains at a similar level to the prevalence of positive class in the sample population.
Image by Author. Precision Performance Curves in Different Prevalence Scenarios.

Let’s walk through a business use case!

An insurance company has been doing investigation work on a regular basis to detect fraudulent policies. The prevalence rate of fraud they have observed so far is roughly 0.1. Their investigator randomly picks out 10 cases out of 25 new cases every week to audit. The company, however, quickly realizes this practice is suboptimal and not efficient. They ask you to build a ML model to assist their investigation team.

Question 1: Assume that the investigator can only audit maximum 10 cases per week due to their work capacity. Before, they pick those 10 cases randomly. Now, they pick the top 10 cases ranked by your model. How can you determine whether your model performs better than their previous practice, and if better, how much better?

Question 2: Assume that the company cares more about catching all the fraudulent cases as quickly as possible. How can you prove that your model is able to achieve that in a more efficient way than the previous practice and quantify the improvement?

Both of these questions are very practical and easily be asked by many ML model users. However, they cannot be answered by ROC chart or PRC chart or even any single threshold metrics alone. But: they can be answered by performance curves! Let’s use the following two plots as examples.

Image by Author. Recall Performance Curves that Represent the Example Use Case.
Image by Author. Precision Performance Curves that Represent the Example Use Case.

Answer to Q1: Given the current capacity of 40% (=10/25) every week, with the assistance of your model, the investigation team could capture ~95% of true positive cases (i.e. recall rate), compared with ~41% as before. In addition, your model is able to achieve that without sacrificing too much of precision. In fact, at the work capacity of 40%, your model achieves a precision rate of ~12%, compared with 10% generated from the random practice.

Answer to Q2: To capture 100% of true positive cases every week, the investigator team only has to audit 70% of top cases ranked by our model. With the previous practice, they have to audit 100% of cases to achieve the same rate. This means your model saves them 30% of the work.

Assume that the previous practice is less naive than a random classifier, as long as you can build a model to reflect that said practice, you should be able to conduct the same analysis as above where you compare your model with the benchmark practice using performance curves.

Summary

So far, we have demonstrated the following key properties that a performance curve could satisfy:

  • Is a visual aid that gives an overview of a range of performances a model could achieve with varying percentages of predictions taken into account. The model performance could be measured with traditional threshold metrics (e.g. precision, recall, F-1 score, etc), or with any user-custom metrics that depend on the ordering of cases.
  • Provides means to compare models to each other, as well as with random and perfect classifiers over a range of operating conditions on a single dataset
  • Provides actionable insights to business stakeholders

Some other benefits a performance curve offers:

  • When precision and recall performance curves are examined alongside with the traditional Precision-Recall (PR) Curve, we can determine whether the increase (or decrease) in predictive performance is contributed by precision, by recall or by both.

For example, after multiple iterations of trying to improve our model, we notice that the most recent iteration increases the model’s AUPRC. We then check out both recall and precision performance curves across all iterations. Whereas the recall performance curve looks relatively similar over iterations, the precision performance curve of the most recent run looks significantly better than before (i.e. closer to the perfect curve and further away from the random curve). We thus know the increase in AUPRC is contributed mostly by precision.

--

--

Data scientist. Talks about anything data science and human-centered designs. Experience in finance and healthcare. https://www.linkedin.com/in/tamtranthe/