The world’s leading publication for data science, AI, and ML professionals.

Classifiers are easy, if you think Bayes!

A lot of classification terminology has a direct correspondence to Bayesian concepts: evaluating classifiers becomes much easier

Image by Thanasis Papazacharias from Pixabay
Image by Thanasis Papazacharias from Pixabay

Evaluating classifiers can be tricky. First of all, the terminology is quite confusing and hard to remember. A number of different evaluation metrics are available, and the reason to use one over the other can be quite obscure.

But these concepts need not be confusing. In fact, they become quite intuitive once you frame the problem of Classification in the right way. And I have recently convinced myself that the right way to frame the problem is… the Bayesian way!

In this post I will explain how a lot of the classifier-terminology has a 1:1 mapping onto Bayesian concepts. This relation is well understood, so I am not discovering anything new here. Still, it is rarely brought up in tutorials and books. Which is unfortunate, because understanding this 1:1 mapping is very powerful for two reasons:

  1. Bayesian thinking is very intuitive, which means the convoluted terms of classifier-terminology become much clearer and easier to remember;
  2. Evaluating a classifier also becomes simpler as pros and cons of each evaluation metrics make intuitive sense through a Bayesian lens.

I will start with a quick review/intro on Bayes. I’ll do that while spelling out the 1:1 mapping between classifier and Bayesian-terminology. Finally, I will explain how this mapping can be very useful for your day-to-day work with classifiers, especially their evaluation.

Let’s start!

(Re)defining the classification problem

To start with we need some notation. Imagine you are given a data point represented by a feature vector X, and you know that X can belong to one of two possible classes: class 0 (y=0) or class 1 (y=1). Unfortunately you don’t know which class X belongs to, but you are given a black box (the classifier) that does an interesting trick: it takes the feature vector X, it makes some calculation and finally outputs a guess about what class X belongs to (g = 1 indicates that the classifier is guessing class 1, g = 0 class 0).

A classifier, according to Bayes (Photo by Hulki Okan Tabak on Unsplash)
A classifier, according to Bayes (Photo by Hulki Okan Tabak on Unsplash)

The guess is not always reliable: you try the classifier on a number of test feature vectors and you end up with a mixture of right and wrong predictions. So the natural question is: given this collection of tests, how can I quantify the classifier’s reliability?

Enter Bayes

To formalise the process of evaluating the classifier, we are going to use the odds form of the Bayes rule (or theorem), which is the following:

where P(…) indicates a Probability, P(…|…) indicates conditional probability and θ, ¬θ and O are random variables.

In a nutshell what this rule shows is how, after making an observation, we should change our previous belief about a theory (the rightmost term, P(θ)/P(¬θ)) into a new, updated belief (the leftmost term, P(θ|O)/P(¬θ|O)) which takes into account the observation. The proper way to do that, Bayes tells us, is by multiplying the previous belief by a multiplicative factor (the term in the middle, P(O|θ)/P(O|¬θ)) which we can calculate knowing some maths.

This is the general idea, but to see how this formula can help understand classifiers we need to dive into each component of the equation in a bit more detail. We’ll see that each term corresponds to an equivalent concept in our classification problem.

Theories

θ and ¬θ are two alternative statements or theories about the world;

In our classification problem these are simply our alternative theories about the origin of X, namely: θ = "X belongs to class 1" (y=1), ¬θ= "X belongs to class 0" (y=0). We don’t know which of the two theories is true, and we hope the classifier will shed some light.

Prior odds

P(θ)/P(¬θ) is the degree of confidence (expressed as odds) that we have about theory θ being true, before making any observation about the world (prior odds);

In our classification problem this is simply P(y=1)/P(y=0). What value should we assign to this ratio? Well, before interrogating the classifier we know nothing about X so it is reasonable to set P(y=1)/P(y=0) equal to the relative proportion of the two classes in the population (e.g. if the proportion of white/black socks in your closet is 3/2, then the odds you assign to a random sock being white are 3/2). Hence, the prior is what we call the prevalence of class 1 in classification setting.

Beware of small odds... (Photo by dylan nolte on Unsplash)
Beware of small odds… (Photo by dylan nolte on Unsplash)

Note that a heavily imbalanced dataset would result in the prior odds being very high (because the numerator P(y=1) is close to 1 and the denominator P(y=0) is close to 0) or very low (P(y=1) close to 0 and P(y=0) close to 1).

Observation

O is an observation, a fact about the world that we happen to observe and that has a connection with theories θ and ¬θ

In our classification problem the observation "O" is the output of our classifier when we feed the feature vector X to it. In other words, it is the classifier’s guess about X‘s class. This is the signal that we want to use to change our prior belief in θ.

We represented this guess with the binary variable "g". For the sake of simplicity, let’s assume that the classifier is guessing class 1, i.e. g = 1.

An anthropomorphic classifier is trying to guess (Photo by Lyman Hansel Gerona on Unsplash)
An anthropomorphic classifier is trying to guess (Photo by Lyman Hansel Gerona on Unsplash)

Likelihood ratio

P(O|θ) is the probability that theory θ assigns to the fact "O" happening. It is called the "likelihood function" of θ. Similarly, P(O|¬θ) is the probability that ¬θ assigns to "O" happening. Their ratio P(O|θ)/P(O|¬θ) is called "likelihood ratio" and it quantifies to what extent theory θ gave a different (higher or lower) probability to the observation "O" than¬θ: if the difference is big (likelihood ratio very high or very low) then the effect of multiplying this term by the prior odds will result in a strong update of our beliefs, which makes intuitive sense.

In our classification problem, P(O|θ) = P(g =1|y=1): this is the probability that the classifier will guess that X belongs to class 1 (g=1), in the scenario that X really belongs to 1 (y=1). If you are familiar with the (confusing) classifier-terminology, you’ll know that this is nothing but the classifier’s True Positive Rate (TPR)!

And what about P(O|¬θ)? This is P(g =1|y=0), namely the probability that the classifier will guess that X belongs to class 1 when actually X belongs to class 0. Do you recognise it? This term is nothing but the False Positive Rate (FPR)!

Both terms can be easily calculated using your collections of experiments on test feature vectors.

Now, these two terms appear as a ratio in the Bayes rule, P(g =1|y=1)/P(g =1|y=0). We have just shown that this can be re-written as TPR/FPR.

Does it ring a bell? Do you know any classification metric that compares TPR and FPR? Specifically, a metric which is higher the higher the ratio TPR/FPR is? Yes, the AUC!! This very common classifier’s metric is simply describing¹ the likelihood ratio in the Bayes theorem!

Example of ROC curves, from which AUC is calculated (MartinThoma, CC0, via Wikimedia Commons)
Example of ROC curves, from which AUC is calculated (MartinThoma, CC0, via Wikimedia Commons)

Posterior odds

P(θ|O)/P(¬θ/O) is the degree of confidence (again expressed as odds) that we have about the theory θ after having observed the fact O (posterior odds). According to the Bayes rule, it can be calculated multiplying the prior odds by the likelihood ratio.

In our classification problem, the posterior odds are P(y=1|g=1)/P(y=0|g=1).

If we think about what this expression means, we immediately realise that the numerator P(y=1|g=1) (probability of X really belonging to class 1 when the classifier predicts so) is the classifier’s precision! This is another popular evaluation metric for classifiers, often described as an alternative to the AUC². We can see here how the two are deeply linked through the Bayes rule.

By the way the denominator of the posterior odds, P(y=1|g=0), is also a standard classification metric – although less popular than precision among data scientists – called false discovery rate.

Classifiers, revisited

Let’s now appreciate the new perspective that this Bayesian framing gives us about classifiers. To this end, it is useful to rewrite the Bayes rule here, along with the Bayesian terminology (top of the formula) and the corresponding classifier-concepts (bottom):

Bayes' rule, with correspondence between classifier (bottom) and Bayesian terminology (top) (Image by Author)
Bayes’ rule, with correspondence between classifier (bottom) and Bayesian terminology (top) (Image by Author)

We found that the AUC (and the ROC curve it is derived from) is nothing but a way to summarise the likelihood ratio. If we look again at the Bayes rule, we recognise that the likelihood ratio is the multiplicative factor transforming a prior belief into a posterior belief. So what the AUC really quantifies is the degree by which using the classifier changes our prior odds: if the multiplicative factor is either very high or very low, then the use of the classifier will significantly change our belief about what class X belongs to, which is good (and the AUC will capture this fact with a value close to 1)³; if the multiplicative factor is instead close to 1, then its effect on the prior will be negligible, our prior beliefs won’t change much and the classifier is useless (and the AUC will be close to 0.5).

This is what makes the AUC (and the ROC curve in general) a good default option to evaluate a classifier. It has the nice characteristic of being a metric about the classifier and nothing else – in particular, it’s not about the dataset (whose properties, such as proportion between classes, are instead captured by the prior odds).

However, what the AUC and the likelihood ratio do not capture is the absolute degree of confidence we are going to have about X after the use of the classifier… After all, the likelihood ratio only tells us to what extent the prior belief changes thanks to the use of the classifier. But there are scenarios where even a very strong Bayesian update doesn’t change much, in practice, your confidence about theory θ

For example, if before using the classifier you thought that X was very, very unlikely to belong to class 1 (let’s say, a prior odds for θ of 1/1,000,000), then even with a strong Bayesian update (let’s say a likelihood ratio of 100) won’t change much our situation (the posterior odds will be 100*1/1,000,000 = 1/10,000, which are still minuscule odds). As we’ve seen, these very small prior odds occur when the target sample is heavily imbalanced.

That’s where the precision comes into play. Precision is a better choice when the target is imbalanced because in this scenario you want to look at your belief after using the classifier – that is, you want to look at the posterior odds. And the precision is exactly that⁴.

Excellent AUC? You may still want to check your posterior odds before going all-in (Photo by cottonbro from Pexels)
Excellent AUC? You may still want to check your posterior odds before going all-in (Photo by cottonbro from Pexels)

The disadvantage of posterior odds (aka precision) is that, because they take into account both likelihood ratio and prior odds, they are not just a metric about the classifier. Rather, they measure the interaction (effectively, the product) between the classifier’s reliability (likelihood ratio) and the actual dataset you are applying the classifier to (prior odds). As a result, despite being useful, precision doesn’t tell us much about how intrinsically good the classifier is, and it will change when the dataset changes (in particular, it will change if the class’s prevalence you’ll get in production differs from the training set).

No need to pick one metric (if you think Bayes)

So, which metric is the best? As usual, there is no clear-cut answer, and it all depends on your particular problem and context. The standard, handbook answer is that you should normally pick ROC/AUC unless the target is heavily imbalanced, in which case you want to use precision.

This works fine as a rule of thumb, but I hope that by framing the problem in Bayesian terms you are now able to see the reason behind this heuristic. ROC/AUC is in general a good default choice because the likelihood evaluates the classifier’s intrinsic performance at separating the two classes, independently of the question of how rare class 1 is compared to class 0 (encoded in the prior odds). Precision is a better choice when the target is imbalanced because in this scenario even good classifiers can give us very small confidence, in practice, about the classification question – and you realise that only looking at the posterior odds.

Ideally you want to look at both to have a complete picture. You want to know both how good your classifier is in general and how well it interacts in practice with your dataset. If you think Bayes you’ll be able to keep these two aspects in mind and make better decisions.


[1] Technically, the AUC is summarising the ratio TPR/FPR over the full range of the classifier’s threshold, while in this Bayesian framing we are keeping the threshold fixed for the sake of simplicity.

[2] Typically in its form averaged over the full range of the classifier’s threshold, aka average precision.

[3] A very small likelihood ratio would actually translate into an AUC close to 0 rather than 1, but we can consider this scenario as a good classifier being used in the wrong way…

[4] Through the Bayesian framing it becomes clear that the whole "don’t use AUC if the target is imbalanced" advice is simply an invitation to avoid the base rate fallacy.


Related Articles