An Intuition for AUC and Harrell’s C

A graphical approach

Elena Jolkver
Towards Data Science

--

Photo by author

Everyone venturing into the realm of machine learning or predictive modeling comes across the concept of model performance testing. Textbooks usually differ only in what the reader learns first: regression with its MSE (mean standard error) or classification with a plethora of performance indicators, like accuracy, sensitivity, or precision, to name a few. While the letter can be calculated as a simple fraction of correct/incorrect predictions and is hence very intuitive, the ROC AUC can be daunting at first. Nevertheless, it is also a frequently used parameter to assess predictor quality. Let’s unpack its mechanics first to understand the nitty-gritty details.

Get your head around the AUC first

Let’s assume we have built a binary classifier predicting the probability of a sample belonging to a certain class. Our test dataset with known classes yielded the following results, which can be summarized in a confusion matrix and reported in more detail in a table, where the samples have been sorted by the predicted probability of being of class P (positive):

Confusion matrix and detailed prediction table with individual samples’ probabilities. Image by author.

The ROC AUC is defined as the area under the ROC (receiver operating characteristic) curve. The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) [Wikipedia]. The TPR (aka sensitivity) is the ratio of correctly identiefied positive cases out of all positive cases. In our case, the TPR is calculated as 4/5 (four out of five cases have been classified correctly as positive). FPR is calculated as the ratio between the number of negative cases wrongly categorized as positive (false positives) and the total number of actual negative cases. In our case, the FPR is calculated as 2/6 (two out of 6 negative cases were misclassified as positives, if we set the “positivity”-threshold at the probability of 0.5).

We can plot the ROC curve from TPR and FPR values and calculate the AUC (Area Under Curve):

ROC Curve based on prediction probabilities. Image by author.

Where did the individual TPR/FPR values for the AUC curve come from? To this end, we consider our probabilities table and calculate TPR/FPR for each sample, setting the probability, at which we consider a sample to be positive, as the one given in the table. Even when we cross the usual level of 0.5, at which samples are usually declared as “negative”, we continue to assign them as positive. Let’s follow this procedure in our example:

Image by author

One sample out of five positives has been classified correctly as being positive at a threshold of 0.81, no sample has been predicted negative. We continue until we encounter the first negative example:

Image by author

Here, our TPR stalls at the previous value (3 out of 5 positive samples have been predicted correctly), but FPR increments, we have erroneously assigned one out of six negative samples to the positive class. We continue until the very end:

Image by author

Et voilà: we arrive at the complete table which is used to create the ROC curve.

Why Harrell’s C is nothing but the AUC

But what about Harrell’s C-index (also known as the concordance index or C-index)? Consider the particular task to predict death upon the occurrence of a particular disease, say cancer. Eventually, all patients will die, irrespective of cancer — a simple binary classifier won’t be of much help. Survival models take into account the duration until the outcome (death). The sooner the event occurs, the higher the risk of the individual to encounter the outcome. If you were to assess the quality of a survival model, you would look at the C-index (aka Concordance, aka Harrell’s C).

In order to understand the calculation of the C-index, we need to introduce two new concepts: permissible and concordant pairs. Permissible pairs are pairs of samples (say: patients) with different outcomes during observation, i.e. while the experiment was conducted, one patient of such a pair experienced the outcome, while the other was censored (i.e. has not reached the outcome yet). These permissible pairs are then analyzed for whether the individual with the higher risk score experienced the event, while the censored one has not. These cases are called concordant pairs.

Simplifying a bit, the C-index is calculated as the ratio of the number of concordant pairs to the number of permissible pairs (I omit the case of risk ties for simplicity). Let’s walk through our example, assuming, that we used a survival model which calculated the risk rather than the probability. The following table contains permissible pairs only. The column “Concordance” is set to 1, if the patient with the higher risk score experienced the event (was one of our “positive” group). The id is simply the row number from the previous table. Pay special attention to the comparison of individual 4 with 5 or 7.

Image by author

This leaves us with 27 concordant pairs out of 30 permissible ones. The ratio (the simplified Harrell’s C) is C = 0.9, which suspiciously reminds us of the previously calculated AUC.

We can construct a concordance matrix that visualizes how the C statistic is computed, as suggested by Carrington et al. The plot shows the risk scores of actual positives vs. the risk scores of actual negatives and displays the proportion of correctly ranked pairs (green) out of all pairs (green + red) if we interpret each grid square as the representation of a sample:

Concordance matrix for the calculation of Harrell’s C. Image by author

The concordance matrix shows the correctly ranked pairs in concordance toward the bottom-right, incorrectly ranked pairs toward the top-left, and a border in between which exactly corresponds to the ROC curve we have seen before.

Unpacking the process of building up a ROC curve and the concordance matrix, we recognize a similarity: in either case we ranked our samples according to their probability/risk score and checked, whether the ranking corresponded to the ground truth. The higher we set the probability threshold for classification, the more false-positives we get. The lower the risk of actual positive cases, the more likely will an actual negative case be misclassified as positive. Plotting our ranked data accordingly, yielded a curve with the same shape and area, which we call AUC or Harrell’s C, depending on the context.

I hope this example helped to develop an intuition for both, the AUC and Harrell’s C.

Acknowledgment

The idea to compare these two parameters arose from a fruitful discussion during the Advanced Machine Learning Study Group meetup, kudos Torsten!

Reference: Carrington, A.M., Fieguth, P.W., Qazi, H. et al. A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms. BMC Med Inform Decis Mak 20, 4 (2020). https://doi.org/10.1186/s12911-019-1014-6

--

--