The world’s leading publication for data science, AI, and ML professionals.

6 Metrics to Evaluate your Classification Algorithms

Learn the most common metrics you can use to evaluate your classification models – in this article we will explore 6 metrics including…

[Disclaimer: This post contains some affiliate links to my Udemy Course]

Photo by mcverry @Unsplash.com
Photo by mcverry @Unsplash.com

Building a Classification algorithm is always a fun project to do when you are getting into Data Science and Machine Learning. Along with Regression, Classification problems are the most common ones that businesses jump right into when they start experimenting with predictive modelling.

But, evaluating a classification algorithm may get confusing, really fast. As soon as you develop a logistic regression or a classification decision tree and output your first ever probability spitted from a model, you immediately think: how should I use this outcome?

First and foremost, when it comes to evaluating your classification algorithms there is a big choice that you must do: do you want to use a metric that is already tied to a threshold of your "probability" outcome? Or use a metric that is agnostic to that threshold?

The majority of famous metrics are tied to thresholds and all of them have pros and cons. There is also a metric that is agnostic to such threshold that may be better to evaluate across models that contain different probability distributions in their outcome (for example, regressions vs. tree-based models). We’ll explore both types in this post.

Also, for this blog post examples, we will use binary classification (0 or 1 cases) – some common examples from "real life":

  • Model to predict if a transaction is fraudulent or not;
  • Model to predict if customer will churn (not buy anymore) or not;
  • Model to predict if a website visitor will convert into a paid user;

We’ve prepared most of the ground to learn about classification metrics – let’s start!


Confusion Matrix

I’m cheating a bit by starting with the Confusion Matrix but hang in there! The confusion matrix is not a metric by itself but the base for multiple metrics to evaluate classification algorithms. Confusion Matrices are simply mapping matrices that compare the outcome of your algorithm and the true label of a specific target.

One thing you might be questioning yourself is how you assign a label to the outcome of your algorithm – in other words, if you are assigning examples to 0’s or 1’s (binary classification), how do you translate that probability into a real classification? The common trick is to define a threshold (say 50%) and consider as 1’s everything above that value and 0 otherwise.

This is one of the main issues with metrics derived from the Confusion Matrix – they are dependent on a threshold that you have to define beforehand.

The confusion matrix is pretty simple, you simply map the real outcome versus the predictions coming from the model – with a simple toy example:

Confusion Matrix Example
Confusion Matrix Example

On the left you have the predicted values coming from your model – this is equal to p(y), that you might see on other resources. On top, you have the real values from the target – equal to y.

In the example above, you can see that there were 10 examples where the algorithm predicted 1 and the real target is also 1. These are called True Positives (TP). Compounding on this example, there are 3 other basic values you can obtain from the confusion matrix, other than TP:

  • True Negatives (TN): Where the real target is 0 and your algorithm also predicted 0;
  • False Negatives (FN): Where the real target is 1 and your algorithm predicted 0. When this happens your algorithm failed to classify the Data point correctly;
  • False Positives (FP): Where the real target is 0 and your algorithm predicted 1. When this happens your algorithm also failed to classify the data point correctly;

Many metrics spawn from the metrics above. The numbers above are essential to understand the relationship between the outcome of your model and the real values when you already passed the outcome of your model through a threshold.

Let’s understand some of the metrics that are derived from the values above.

Accuracy

The top of mind metric to evaluate classification algorithms is accuracy.

Accuracy is simple to understand and compounds on the confusion matrix data. To compute accuracy you simply apply the following formula:

Accuracy Formula
Accuracy Formula

On the numerator, we have all the examples that we got right with our algorithm. On the denominator we have our entire sample. That’s it, simple as that!

Notice that if we pick up the values from the confusion matrix above we have the following:

The higher this value, the better your algorithm is when it comes to classifying examples. An accuracy of 100% means that your algorithm is perfect when separating the classes.

But, there is a big caveat with accuracy: When you have huge class imbalance, meaning one of your classes is extremely overweight in terms of number of examples.

One real life example is fraud detection – of the millions of transactions done each day, only a few are really fraudulent. Sometimes that percentage is less than 1%.

If you build a model with the goal of having a good accuracy, well… you can have an accuracy of 99% if you just assign every transaction as "non-fraudulent". Does this mean that your model is good? No!

Your accuracy would be high but your model would be worthless, because it fails to catch those 1% fraudulent transactions that are extremely damaging to financial systems.

Accuracy is valuable when the target is slightly balanced but, luckily, there are other metrics that we can use that are better suited for other types of samples.

Precision/Sensitivity

Precision is one of the widely used metrics to understand how well your classification of 1’s are behaving. Basically, it helps you to understand how precise (no pun intended) are all "positive" examples spawned from the algorithm.

Precision is made up of the following formula:

Just to recap, the formula above is read as: True Positives over True Positives plus False Positives. Picking up the values from our confusion matrix example:

You can think that precision mostly answers the following question:

Of all the examples I classified as positive, how many of them are really positive?

This will help you to understand how "broad" your algorithm is when classifying new examples. One of the metaphors that really helped me to memorize precision is the fishermen metaphor.

If a fishermen fishes using a giant net and catches 100 fish and 1 boot, what is their precision? High! In our metaphor, precision is a representation in the following form: "Of all the objects in the fishermen’s net, what’s the percentage of them that were fish?"

A metric similar to Precision is Recall – let’s check it.

Recall

Recall helps us understanding what’s the percentage of examples that have target equal to 1 we are flagging correctly.

Similar to Precision, this is a really good metric to use when the target is a rare event (also called imbalanced). The formula is:

And checking the formula with the values from our example:

You may be asking yourself: "Why is this a good metric when the target is rare?"

Remember when we said that accuracy wasn’t a good metric to use when the target is rare? Imagine that you have 1000 examples, and only 5 of them have target equal to 1. If you classify everything as 0, your accuracy will be 995/1000, or 99.5% -again, seems a great model! But if you compute recall on this example:

As True Positives are 0 – the Recall would also be 0. A higher recall means you are able to correctly classify all the true labels that exist, hence your algorithm should be more valuable.

Notice that if you classify too many examples has "1’s", your precision will tend to be lower. Luckily, there is a metric that enable us to combine both Precision and Recall, the f-score (we will check it in more detail below).

Returning to the fisherman metaphor, a high recall means that you were able to fish a lot of the fish available in the area, without fishing too many boots!

Specificity

Both Recall and Precision have the True Positives on the numerator. There is another metric that looks to the true negatives ratio which is called specificity. In most use cases, specificity is not used as a standalone but it is relevant to compute the AUC score as we will see below.

The formula for specificity is the following:

For our confusion matrix example:

Specificity is a similar concept to Recall but leans on the examples where the target real target is 0. With specificity we try to understand the proportion of negatives that our algorithm is calculating correctly.

F-Score

We’ve seen how Precision and Recall dodge the "rare target" case, giving more important feedback about the model than accuracy. But, they still convey different information. Wouldn’t it be nice if we had a metric that would balance both Precision and Recall?

Luckily, we have the F-score (sometimes called F-measure or F1-Score)!

F-score is an harmonic mean between Precision and Recall with the following formula:

On our example, as precision and recall are the same, we will have the following:

This metric helps us to understand the balance between precision and recall. If it is near 1, then both precision and recall are near 1.

A high f-score means that our model is able to separate perfectly the 1’s and 0’s in our problem – this is the ultimate goal of any classification algorithm.

Until now, we only have seen metrics derived from the Confusion Matrix and as we’ve seen, to build the CM we need to define a threshold to classify our examples. Next, we are going to see one metric that is independent from an apriori condition, other than the outcome of the model.

AUC

The AUC spuns from the ROC curve. You can check details of the ROC curve in this article.

The ROC curve is a simple concept that checks the value of Recall vs. 1- Specificity for any threshold available in the classification. An example of a typical ROC curve is the following:

Image by Author
Image by Author

On the x-axis we have the recall. On the y-axis we have 1-Specificity (also called false positive rate – FPR) and each value of this curve corresponds to a specific value of recall and FPR. Let’s zoom in the orange point details:

  • The orange point has a value of recall of around 40%.
  • The value of FPR is around 10%.
  • The threshold chosen for this point is 78%. This means that you will consider an example target equal to 1 if the probability resulting from your algorithm is higher than 0.78 or 78%.

If you consider a higher threshold, we are speaking of a point similar to the lighter orange one below:

Image by Author
Image by Author

Notice that if you raise the threshold needed to consider an example coming from your algorithm equal to 1, you will naturally "catch" fewer examples with it. Ideally, when you do this, you would like your recall to stay the same and lower your FPR – this would mean that by raising the threshold you would be catching less false positives.

Notice that when you get to the end of the curve on the right side, you are considering all examples as positives – you catch all the fish but all the boots as well!

So.. what exactly is AUC? AUC or Area Under the Curve is just the total area under the ROC Curve. The higher this value, the higher the area and the better your model is separating the classes of 1’s and 0’s at any threshold.

Image by Author
Image by Author

The AUC, in this case (the entire area below the curve, marked with the arrows) would be something near 0.7 – a near-perfect model would have a curve similar to this:

Image by Author
Image by Author

In this case, even with a threshold as high as 95%, you would catch most of your examples where the real target is equal to 1, while minimizing false positives. It’s very rare that you have a ROC curve with this look, as this means that the classes you are trying to split are almost linearly separable, something uncommon in real life examples.

The AUC ranges from 0 until 1. 0. A value of 0 means that your model is worse than a random model (bad sign!) and a value of 1 means a perfect model (an utopia). Why is this metric agnostic to a threshold? As you are calculating the result to any threshold available to build the curve and the area under it, you are make your evaluation independent of them.


And we’re done! These metrics are some of the most common ones to evaluate Data Science and Machine Learning classification models. There are many others that you can use but these are probably the most common ones to stumble upon across ML projects.

Is there any other metric that you commonly use to evaluate your classification models? Write down in the comments below, I would love to hear your opinion!

I’ve set up a course on learning Data Science in a Udemy course where I’ve included a deep dive into these metrics— the course is suitable for beginners and I would love to have you around.


Related Articles