Visual Explanations & Connections to Probability
In this article, we’ll visually review the most popular supervised learning metrics for
- Classification – Accuracy, Precision, Recall, Fᵦ & AUC; and
- Regression – MAE, MSE and R².
For classification tasks, the more advanced classification metrics allow you to calibrate the importance of Type I and II errors for your use case, while helping you deal with imbalanced datasets.
We’ll also visually explore some connections between classification metrics and probability.
In this and this article, I apply these supervised learning metrics to real-world ML problems in Jupyter Notebooks using scikit-learn.
Follow me on YouTube for more analytics tutorials.
1. Metrics for Classification
Confusion Matrix
from sklearn.metrics import confusion_matrix
A binary classification results in four outcomes, which can be summarised in a confusion matrix. Metrics for classification are then computed using the values from the matrix.
Accuracy
from sklearn.metrics import accuracy_score
The most basic metric is accuracy, which gives the fraction of all predictions that our model got right.
Although accuracy is easy to calculate and straight-forward to understand, it fails for imbalanced datasets, as we’ll see below.
Precision
from sklearn.metrics import precision_score
Precision gives the fraction of positive predictions we got right. This metric prioritises minimising Type I errors. A classic example problem that demand a high-precision model is classifying spam emails— more on this below. Here, letting through false positives mean important emails are accidentally thrown out.
Recall
from sklearn.metrics import recall_score
Recall gives the fraction of actual positives we predicted right. This metric prioritises minimising Type II errors. Example problems that demand high-recall models include diagnostic tests for COVID-19, classifying fraudulent credit card transactions and predicting defective aircraft parts. In these problems, false negatives are either financially costly or even deadly.
Examples using Accuracy, Precision & Recall
Take the following example on classifying fraudulent credit card transactions.
The confusion matrix told us that this ‘dumb’ model classified every single transaction as legitimate. The accuracy is a stellar 992/1000 = 99.2% as a result. But…we missed every single fraudulent transaction! Damn.
To capture the importance of these Type II errors, we can use recall instead.
Here, recall = 0/8 = 0%, highlighting that no fraud CC’s were detected. Whoops!
Check out the next example of filtering out spam email. Here, making Type I errors is much worse than making Type II errors. This is because accidentally throwing away an important email is more problematic than accidentally letting a few junk emails get through. A good metric here is precision. In other words, we’re looking to train a high precision model.
We have precision = 100/130 = 77%. This means 77% of emails labelled as spam was actually spam – out of the 130 emails the system filtered into the spam folder, 100 were real spam messages. Not terrible, but I’d say certainly not good enough for a production email server!
Finally, check out this example on diagnosing COVID-19. Here, committing Type II errors (false negative test) means missing people who are actually sick. This could be fatal to an individual and disastrous for public health.
Meanwhile, a Type I error (false positive test) would at most inconvenience the individual. Better be safe than sorry! Thus, we’re looking to train a high recall model.
We have recall = 100/120 = 83%. This means the COVID-19 test was able to correctly identify 83% of cases – flagging 100 out of the 120 COVID-19-positive, letting go home 20 people infected with the SARS-COV-19 virus. Unfortunately, this test wouldn’t be good enough for general use.
F₁ and Fᵦ scores
from sklearn.metrics import f1_score, fbeta_score
Taking into account both precision and recall, the F₁-score is an advanced metric that lets you have the best of both worlds and is robust against imbalanced datasets. The metric is defined to be the harmonic mean of precision and recall:
Alternatively, you can calculate the F₁-score **** straight from the confusion matrix:
For our COVID-19 model, the F₁-score = 100/(100+0.5(20+80)) = 67%.
More generally, the Fᵦ-score allows you to calibrate the importance of Type I and II errors more precisely. It does this by letting you tell the metric that you view recall as β times more important than precision.
Setting β = 1 gives you the F₁-score.
Here’s the formula for calculating the Fᵦ-score straight from the confusion matrix:
Area Under Curve (AUC)
from sklearn.metrics import roc_curve, roc_auc_score
Specifically, we’re looking at the area under the Receiver Operating Characteristic (ROC) curve.
This area, ranging from 0 to 1, measures the ability of your classifier to separate the two classes and sift signal from noise.
A full area of 1 represents the perfect model, while a model with an AUC of 0.5 is no better than random guessing. Generally, a score of 0.9 is considered outstanding, 0.8 is excellent and 0.7 is acceptable.
To first draw the ROC curve, we plot the True Positive Rate (TPR) (aka recall) against the False Positive Rate (FPR).
Specifically, we vary the probability threshold for a True prediction, attain a confusion matrix, and plot the point (FPR, TPR). We do this for all possible thresholds, each time giving us a different confusion matrix and (FPR, TPR) point. The ROC curve thus summarises information about the set of possible confusion matrices.
A great visual explanation about the ROC is on YouTube.
What should the ROC look like? An ROC that’s far above the diagonal means more confusion matrices where the model is able to discern positives over false positives. That is, more points & confusion matrices where TPR dominates over the FPR.
This information can be summarised as the Area under the ROC curve. The higher the AUC, the "higher the ROC curve", the better the model.
A model with a much higher AUC means we’re able to vary the probability threshold more widely yet retain the model’s ability to read signal from noise.
In this article, I use scikit-learn to generate a ROC curve and compute the AUC in Python.
In the next section, we’ll examine TPR, FPR, TNR and FNR in more detail.
Enjoying this story? Get an email when I post similar articles.
Confusion Matrix, Probabilities and Trees
To gain a new perspective, we can look at the confusion matrix through the lens of a tree. A benefit of this is we can make some connections between the confusion matrix and probability.
Notice something very interesting about the 2nd layer of the probability tree. They’re four rates:
- True Negative Rate a.k.a. specificity.
- False Positive Rate a.k.a. Type 1 error rate.
- False Negative Rate a.k.a. Type 2 error rate.
- True Positive Rate a.k.a. sensitivity or recall.
These rates – being branches of a probability tree – are also conditional probabilities.
Here is a summary of these four rates in detail.
2. Metrics for Regression Tasks
MAE (Mean Absolute Error)
from sklearn.metrics import mean_absolute_error
The MAE metric computes the average of all the deviation errors between your model and data.
MSE (Mean Square Error)
from sklearn.metrics import mean_squared_error
This MSE squares all of these deviation errors and computes their average.
R² (coefficient of determination)
from sklearn.metrics import r2_score
The R²-score effectively compares the MSE of your model and the MSE of the simple mean model and ranges between 0 and 1, with 1 being the best.
Here, the idea is if the model is excellent, the MSE of your model is far smaller than the MSE of the simple mean model for your dataset, thereby giving you a ratio of these two that’s close to 0. The R²-score will then be closer to 1.
However, when your model is bad, the MSE of your model won’t be much different to the MSE of the simple mean model, resulting in their ratio being close to 1. The R² score will then turn out to be close to 0.
3. Summary
When evaluating the performance of classification algorithms, we start with the confusion matrix.
Here’s the matrix for a general binary classification task, from which we compute the scores for various metrics.
Accuracy is the simplest metric, but doesn’t take into account the relative importances of Type I and II errors, nor is it robust against imbalanced datasets. More advanced metrics like precision, recall and F₁ remedy these issues. The ROC and AUC provides a graphical approach to how well your classifier can distinguish between classes and separate signal from noise.
As we’ve seen, the four outcomes of the matrix can also be interpreted in a probability tree, where the TPR, TNR, FPR and FNRs correspond to certain conditional probabilities.
For regression algorithms, the three most popular metrics are MAE, MSE and the R²-score. The latter is calculated from two MSEs – one of your model and one of the simple mean model.
In a sequel article, we’ll look at loss functions and how they’re used to optimise common supervised learning algorithms and neural network models. Cheers!
The diagrams in this article were created using Adobe Illustrator, LaTeXiT and Microsoft Excel.
Find me on Twitter & YouTube here, here & here.
My Popular AI, ML & Data Science articles
- AI & Machine Learning: A Fast-Paced Introduction – here
- Machine Learning versus Mechanistic Modelling – here
- Data Science: New Age Skills for the Modern Data Scientist – here
- Generative AI: How Big Companies are Scrambling for Adoption – here
- ChatGPT & GPT-4: How OpenAI Won the NLU War – here
- GenAI Art: DALL-E, Midjourney & Stable Diffusion Explained – here
- Beyond ChatGPT: Search for a Truly Intelligence Machine – here
- Modern Enterprise Data Strategy Explained – here
- From Data Warehouses & Data Lakes to Data Mesh – here
- From Data Lakes to Data Mesh: A Guide to Latest Architecture – here
- Azure Synapse Analytics in Action: 7 Use Cases Explained – here
- Cloud Computing 101: Harness Cloud for Your Business – here
- Data Warehouses & Data Modelling – a Quick Crash Course – here
- Data Products: Building a Strong Foundation for Analytics – here
- Data Democratisation: 5 ‘Data For All’ Strategies – here
- Data Governance: 5 Common Pain Points for Analysts – here
- Power of Data Storytelling – Sell Stories, Not Data – here
- Intro to Data Analysis: The Google Method – here
- Power BI – From Data Modelling to Stunning Reports – here
- Regression: Predict House Prices using Python – here
- Classification: Predict Employee Churn using Python – here
- Python Jupyter Notebooks versus Dataiku DSS – here
- Popular Machine Learning Performance Metrics Explained – here
- Building GenAI on AWS – My First Experience – here
- Math Modelling & Machine Learning for COVID-19 – here
- Future of Work: Is Your Career Safe in Age of AI – here