Understanding Confusion Matrix, Precision-Recall, and F1-Score

Why accuracy shouldn’t be the only performance metric you care about while evaluating a Machine Learning model

Published in

Towards Data Science

9 min readOct 19, 2020

Cover Image (Image by Canva, edited with permission by the author)

As I was going over several notebooks in Kaggle the past few days, I couldn’t help but notice some notebooks titled “Achieving 100% accuracy on dataset_name”, “Perfect 100% accuracy using algorithm_name”, along with various other guides on how to achieve a 100% accuracy on every dataset you would come across. While some of these notebooks did a great job at building a generalized model for the dataset and delivering pretty good results, a majority of them were just overfitting on the data.

Overfitting is the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably — Wikipedia

And the saddest part about all this? They weren’t even realizing that they were overfitting on the dataset while trying to achieve that golden number. Most of these notebooks were found on beginner-friendly datasets like the “Iris dataset” or “Titanic dataset” and it makes sense, right? Most of us while starting on the Machine Learning trajectory was only taught one thing: “Accuracy matters”. And while this is true, it matters only up to a certain extent. This is why I’ll be discussing some other performance metrics like Confusion Matrix, Precision-Recall, and F1-Score that you should consider using along with Accuracy while evaluating a Machine Learning model. Let’s get started.

Confusion Matrix

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. — Wikipedia

To understand the confusion matrix let us consider a two-class classification problem with the two outcomes being “Positive” and “Negative”. Given a data point to predict, the model’s outcome will be any one of these two.

If we plot the predicted values against the ground truth (actual) values, we get a matrix with the following representative elements:

True Positives (TP): These are the data points whose actual outcomes were positive and the algorithm correctly identified it as positive.

True Negatives (TN): These are the data points whose actual outcomes were negative and the algorithm correctly identified it as negative.

False Positives (FP): These are the data points whose actual outcomes were negative but the algorithm incorrectly identified it as positive.

False Negatives (FN): These are the data points whose actual outcomes were positive but the algorithm incorrectly identified it as negative.

As you can guess, the goal of evaluating a model using the confusion matrix is to maximize the values of TP and TN and minimize the values of FP and FN.

To understand this concept better, let us take a real-life example: Prediction of heart disease. In this case, the outcomes would be a patient having heart disease or not. The confusion matrix would look something like this:

Confusion Matrix for heart disease prediction (Image Source: Author)

Here, TP means that the patient actually has heart disease and the algorithm predicted it correctly. TN means the patient doesn’t have heart disease and the algorithm predicted it correctly. So the goal should be to keep these values as high as possible.

How does the confusion matrix help detect overfitting?

To understand this, let’s consider the heart disease example. Most of the time when we’re dealing with medical use cases, there is a high chance of having a skewed dataset i.e one of the target variables will be having more data points than the other. In this case, most of the people undergoing the test will not be diagnosed with any heart disease. Due to this, there is an imbalance (skewness) in the dataset.

Heart Disease Example depicting a skewed dataset (Image Source: Author)

If we train a model on the dataset shown in the image above, since the number of data points of patients not having a heart disease is far more than patients having heart disease, the model will be biased towards the target having more data points (especially if we’re evaluating a model solely based on its accuracy).

Now, after training, if we test the model on the test set (which has 8000 data points for patients not having heart disease and only 2000 data points for patients having heart disease), even if the model predicts that all the 10000 data points don’t have any heart disease, the accuracy still gets up to a whopping 80%. This can be misleading especially in a field where the risk of False Negatives should be negligible (A model predicting a patient having heart disease as a patient not having heart disease will prove really fatal).

This is where the confusion matrix comes into play. For the above scenario, the confusion matrix would look something like this:

Confusion Matrix for the scenario explained above (Image Source: Author)

Now, if you look at the confusion matrix along with the accuracy the model got, we could clearly identify that the model is overfitting on the training dataset as it is predicting every unknown data point as a patient not having heart disease. If it wasn’t for the confusion matrix, we would have never known the underlying issue.

The scikit learn package includes the Confusion Matrix. You can look up the official documentation here.

Precision-Recall

Now that you have understood what Confusion Matrix does, it’ll be easier to understand Precision-Recall.

We have already seen how accuracy can be misleading in some cases. Precision and Recall helps us further understand how strong the accuracy shown holds true for a particular problem.

Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while Recall (also known as sensitivity) is the fraction of the total amount of relevant instances that were actually retrieved. Both precision and recall are therefore based on an understanding and measure of relevance. — Wikipedia

In simple terms, precision means what percentage of the positive predictions made were actually correct.

Precision Formula (Image Source: Author)

In our example of heart disease, it’d look something like this:

Precision Example (Image Source: Author)

It could be translated into simple language as, of all patients classified as having heart disease, how many of them actually had heart disease?

Recall in simple terms means, what percentage of actual positive predictions were correctly classified by the classifier.

In our example, it’d look something like this:

It basically asks, of all the patients that have heart disease, how many were classified as having heart disease?

The formula may seem kinda identical at first, but once you get the gist, it’ll be harder to get confused between the two.

Precision-Recall is also available in the scikit learn package. You can look up the official documentation here.

Precision-Recall Trade-Off

Suppose we train a Logistic Regression Classifier to identify whether the patient has heart disease or not. It will predict that the patient has heart disease if the probability (threshold) is greater than or equal to 0.5 and the patient doesn’t have heart disease if the probability is less than 0.5.

Now, if we want to build a model in such a way that it predicts if the patient has heart disease only if it is very confident in the hypothesis, we might have to increase the threshold to 0.7 or 0.8.

In this scenario, we end up with a classifier having high precision and low recall. Higher precision because now the classifier is more confident that the patient has heart disease. Lower recall because now that the classifier’s threshold is set so high, there will be fewer patients classified as having heart disease.

The alternative to this is we build a model in such a way that it won’t miss any possible cases of a patient having heart disease (to avoid false negatives). If a patient having heart disease goes unnoticed by the model, it could prove fatal. In this case, we decrease the threshold to 0.2 or 0.3 so that even if there is a slight chance that the patient may have heart disease, it raises an alarm and further diagnosis can be done to prove the hypothesis.

What we have here is an example of high recall and low precision. Higher recall because we will be classifying a larger number of patients having heart disease. Lower precision because out of a large number of patients predicted having heart disease some of them won’t actually have heart disease upon further diagnosis.

Generally speaking, the precision-recall values keep changing as you increase or decrease the threshold. Building a model with higher precision or recall depends on the problem statement you’re dealing with and its requirements.

F1-Score

Precision-Recall values can be very useful to understand the performance of a specific algorithm and also helps in producing results based on the requirements. But when it comes to comparing several algorithms trained on the same data, it becomes difficult to understand which algorithm suits the data better solely based on the Precision-Recall values.

Which algorithm is better solely based on Precision-Recall values? (Image Source: Author)

Hence there is a need for a metric that takes the precision-recall values and provides a standardized representation of those values.

In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It is calculated from the precision and recall of the test, where the precision is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive. — Wikipedia

F1 score can also be described as the harmonic mean or weighted average of precision and recall.

Having a precision or recall value as 0 is not desirable and hence it will give us the F1 score of 0 (lowest). On the other hand, if both the precision and recall value is 1, it’ll give us the F1 score of 1 indicating perfect precision-recall values. All the other intermediate values of the F1 score ranges between 0 and 1.

F1 Score is also available in the scikit learn package. You can look up the official documentation here.

Conclusion

I hope this article helped you understand the terms Confusion Matrix, Precision-Recall, and F1 Score. Using these metrics will definitely help you in getting a better idea about your model’s performance. Once you have completely understood these concepts, you could also look into some other evaluation metrics like Log loss, ROC-AUC curve, Categorical Crossentropy, and more.