Multiclass classification evaluation with ROC Curves and ROC AUC

Adapting the most used classification evaluation metric to the multiclass classification problem with OvR and OvO strategies

Vinícius Trevisan
Towards Data Science

--

Image by author

When evaluating multiclass classification models, we sometimes need to adapt the metrics used in binary classification to work in this setting. We can do that by using OvR and OvO strategies.

In this article I will show how to adapt ROC Curve and ROC AUC metrics for multiclass classification.

The ROC Curve and the ROC AUC score are important tools to evaluate binary classification models. In summary they show us the separability of the classes by all possible thresholds, or in other words, how well the model is classifying each class.

As I already explained in another article, we can compare the ROC Curves (top image) with their respective histograms (bottom image). The more separate the histograms are, the better the ROC Curves are as well.

ROC Curves comparison. Image by author.
Class separation histograms comparison. Image by author.

But this concept is not immediately applicable for muticlass classifiers. In order to use ROC Curves and ROC AUC in this scenario, we need another way to compare classes: OvR and OvO.

In the following sections I will explain it better, and you can also check the code on my github:

OvR or OvO?

OvR — One vs Rest

OvR stands for “One vs Rest”, and as the name suggests is one method to evaluate multiclass models by comparing each class against all the others at the same time. In this scenario we take one class and consider it as our “positive” class, while all the others (the rest) are considered as the “negative” class.

By doing this, we reduce the multiclass classification output into a binary classification one, and so it is possible to use all the known binary classification metrics to evaluate this scenario.

We must repeat this for each class present on the data, so for a 3-class dataset we get 3 different OvR scores. In the end, we can average them (simple or weighted average) to have a final OvR model score.

OvR combinations for a three-class setting. Image by author.

OvO — One vs One

Now as you might imagine, OvO stands for “One vs One” and is really similar to OvR, but instead of comparing each class with the rest, we compare all possible two-class combinations of the dataset.

Let’s say we have a 3-class scenario and we chose the combination “Class1 vs Class2” as the first one. The first step is to get a copy of the dataset that only contains the two classes and discard all the others. Then we define observations with real class = “Class1” as our positive class and the ones with real class = “Class2” as our negative class. Now that the problem is binary we can also use the same metrics we use for binary classification.

Note that “Class1 vs Class2” is different than “Class2 vs Class1”, so both cases should be accounted. Because of that, in a 3-class dataset we get 6 OvO scores, and in a 4-class dataset we get 12 OvO scores.

As in OvR we can average all the OvO scores to get a final OvO model score.

OvO combinations for a three-class setting. Image by author.

OvR ROC Curves and ROC AUC

I will use the functions I used on the Binary Classification ROC article to plot the curve, with only a few adaptations, which are available here. You can also use the scikit-learn version, if you want.

In this example I will use a synthetic dataset with three classes: “apple”, “banana” and “orange”. They have some overlap in every combination of classes, to make it difficult for the classifier to learn correctly all instances. The dataset has only two features: “x” and “y”, and is the following:

Multiclass scatterplot. Image by author.

For the model, I trained a default instance of the scikit-learn’s RandomForestClassifier.

In the code below we:

  • Iterate over all classes
  • Prepare an auxiliar dataframe using one class as “1” and the others as “0”
  • Plots the histograms of the class distributions
  • Plots the ROC Curve for each case
  • Calculate the AUC for that specific class

The code above outputs the histograms and the ROC Curves for each class vs rest:

ROC Curves and histograms OvR. Image by author.

As we can see, the scores for the “orange” class were a little lower than the other two classes, but in all cases the classifier did a good job in predicting every class. We can also note on the histograms that the overlap we see in the real data also exists on the predictions.

To display each OvR AUC score we can simply print them. We can also take the average score of the classifier:

# Displays the ROC AUC for each class
avg_roc_auc = 0
i = 0
for k in roc_auc_ovr:
avg_roc_auc += roc_auc_ovr[k]
i += 1
print(f"{k} ROC AUC OvR: {roc_auc_ovr[k]:.4f}")
print(f"average ROC AUC OvR: {avg_roc_auc/i:.4f}")

And the output is:

apple ROC AUC OvR: 0.9425
banana ROC AUC OvR: 0.9525
orange ROC AUC OvR: 0.9281
average ROC AUC OvR: 0.9410

The average ROC AUC OvR in this case is 0.9410, a really good score that reflects how well the classifier was in predicting each class.

OvO ROC Curves and ROC AUC

With the same setup as the previous experiment, the first thing that needs to be done is build a list with all possible pairs of classes:

classes_combinations = []
class_list = list(classes)
for i in range(len(class_list)):
for j in range(i+1, len(class_list)):
classes_combinations.append([class_list[i], class_list[j]])
classes_combinations.append([class_list[j], class_list[i]])

The classes_combinations list will have all combinations:

[['apple', 'banana'],
['banana', 'apple'],
['apple', 'orange'],
['orange', 'apple'],
['banana', 'orange'],
['orange', 'banana']]

Then we iterate over all combinations, and similarly to the OvR case we

  • Prepare an auxiliar dataframe with only instances of both classes
  • Define instances of Class 1 as “1” and instances of Class 2 as “0”
  • Plots the histograms of the class distributions
  • Plots the ROC Curve for each case
  • Calculate the AUC for that specific combination

The code above plots all histograms and ROC Curves:

ROC Curves and histograms OvO. Image by author.

Notice that, as expected, the “apple vs banana” plots are different from the “banana vs apple” ones. As in the previous case, we can evaluate each combination individually, and check for model inconsistencies.

We can also display the AUCs and calculate the average OvO AUC:

# Displays the ROC AUC for each class
avg_roc_auc = 0
i = 0
for k in roc_auc_ovo:
avg_roc_auc += roc_auc_ovo[k]
i += 1
print(f"{k} ROC AUC OvO: {roc_auc_ovo[k]:.4f}")
print(f"average ROC AUC OvO: {avg_roc_auc/i:.4f}")

And the output is:

apple vs banana ROC AUC OvO: 0.9561
banana vs apple ROC AUC OvO: 0.9547
apple vs orange ROC AUC OvO: 0.9279
orange vs apple ROC AUC OvO: 0.9231
banana vs orange ROC AUC OvO: 0.9498
orange vs banana ROC AUC OvO: 0.9336
average ROC AUC OvO: 0.9409

The average ROC AUC in this case is 0.9409, and is close to the score obtained on the OvR scenario (0.9410).

Conclusion

OvR and OvO strategies can (and should) be used to adapt any binary classification metric to the multiclass classification task.

Evaluating OvO and OvR results also can help understanding which classes the model is struggling to describe, and which features you can add or remove to improve the result of the model.

If you like this post…

Support me with a coffee!

Buy me a coffee!

And read this awesome post

--

--