The world’s leading publication for data science, AI, and ML professionals.

Guide to Confusion Matrices & Classification Performance Metrics

Accuracy, Precision, Recall, & F1 Score

Image by Afif Kusuma on Unsplash
Image by Afif Kusuma on Unsplash

In this article, we will explore confusion matrices and how they can be used to determine performance metrics in machine learning classification problems.

When running a classification model, our resulting outcome is usually a binary 0 or 1 result, with 0 meaning False and 1 meaning True. We can compare our resulting classification outcomes with our actual values of the given observation to judge the performance of the classification model. The matrix used to reflect these outcomes is known as a Confusion Matrix, and can be seen below:

Image by Author
Image by Author

There are four potential outcomes here: True Positive (TP) indicates the model predicted an outcome of true, and the actual observation was true. False Positive (FP) indicates the model predicted a true outcome, but the actual observation was false. False Negative (FN) indicates the model predicted a false outcome, while the actual observation was true. Lastly, we have the True Negative (TN), which indicates the model predicted an outcome of false, while the actual outcome was also false.

Confusion matrices can be used to calculate performance metrics for classification models. Of the many performance metrics used, the most common are accuracy, precision, recall, and F1 score.

Accuracy:The formula for calculating accuracy, based on the chart above, is (TP+TN)/(TP+FP+FN+TN) or all true positive and true negative cases divided by the number of all cases.

Accuracy is commonly used to judge model performance, however, there are a few drawbacks that must be considered before using accuracy liberally. One of these drawbacks deals with unbalanced datasets where one class, either true or false, is more common than the other causing the model to classify observations based on this imbalance. For example, if 90% of cases are false and only 10% are true, there’s a very high possibility of our model having an accuracy score of around 90%. Naively, it may seem like we a have high rate of accuracy, but in actuality, we are just 90% likely to predict the ‘false’ class, so we don’t actually have a good metric. Normally, I wouldn’t use accuracy as a performance metric, I’d rather use precision, recall, or the F1 score.

Precision:Precision is the measure of true positives over the number of total positives predicted by your model. The formula for precision can be written as: TP/(TP+FP). What this metric allows you to calculate is the rate of which your positive predictions are actually positive.

Recall: Recall (a.k.a sensitivity) is the measure of your true positive over the count of actual positive outcomes. The formula for recall can be expressed as: TP/(TP+FN). Using this formula, we can assess how well our model is able to identify the actual true result.

F1 Score:The F1 score is the harmonic mean between precision and recall. The formula for the F1 score can be expressed as: 2(p*r)/(p+r) where ‘p’ is precision and ‘r’ is recall. This score can be used as an overall metric that incorporates both precision and recall. The reason we use the harmonic mean as opposed to the regular mean, is that the harmonic mean punishes values that are further apart.

An example of this could be seen where p = .4 and r = .8. Using our formula, we see that 2(0.4*0.8)/(0.4+0.8), which simplifies to 0.64/1.20=0.533; while the normal mean would just be (.4+.8)/2=0.6

Which Performance Metric Should We Use?A very common question revolves around which metric should be used? And when? The simple answer is – it depends. Unfortunately, there is no one size fits all with these metrics and each metric is important in its own way, providing different pieces of information about your classification model performance.

As stated earlier, accuracy is generally not a great metric to use for overall model performance, but it can be used to compare model outcomes while tuning your training data and finding optimal hyperparameter values.

Precision, which measures your predicted positive rate, is a good measure to use when we want to focus on limiting false positives. An example of when precision would be a good metric would be in disaster relief efforts with limited resources. If you’re working in a relief effort knowing that you can only make 100 rescues while there could be 100+ needed to save everyone, you want to make sure the rescues you do make are true positives and not wasting precious time by leading you to a false positive.

Recall, which measures the true positive rate, is a good metric to use when we are focusing on limiting our false negatives. An example of this would be in medical diagnostic tests such as the ones for COVID-19. If these diagnostic tests don’t focus on limiting false negatives, we would see a risk of people who actually have Covid spreading it to others thinking they are negative due to false negative test results. False positives in this case aren’t as big of an issue because if someone, who doesn’t have Covid, tests positive, their outcome would be to isolate and test again which would likely show negative results, letting them continue with their lives without majorly affecting others.

Maximizing the F1 score looks to limit both false positives and false negatives as much as possible. I personally like to use the F1 score as my general performance metric unless the specific problem warrants using either precision or recall.

Real Example:We will now learn how to generate a confusion matrix using the sklearn library, hand calculate our resulting confusion matrix, and show how to get the same results using sklearn. For this demonstration, I’ll refer to base random forest model created in my earlier article (which can be located [here](https://www.kaggle.com/hesh97/titanicdataset-traincsv)). The data for this article is licensed CC0 – Public Domain, and was initially published to Kaggle by Syed Hamza Ali (link to data here).

## Run the training model 
rf = RandomForestClassifier(n_estimators = 200, max_depth = 7, max_features = 'sqrt',random_state = 18, criterion = 'gini').fit(x_train, y_train)
## Predict your test set on the trained model
prediction = rf.predict(x_test)
## Import the confusion_matrix function from the sklearn library
from sklearn.metrics import confusion_matrix
## Calculate the confusion matrix
confusion_matrix(y_test, prediction, labels = [1,0)
Image by Author
Image by Author

Using the confusion_matrix() function is as simple as the steps above once we’ve successfully trained our model and predicted on our holdout data. In this confusion matrix we see a TP = 66, FP = 5, FN = 21, and TN = 131.

  • We can calculate accuracy as (66+131)/(66+5+21+131)=0.8834
  • Next we can calculate precision as 66/(66+5)=0.9296
  • Now we can calculate recall as 66/(66+21)=0.7586
  • Finally, we can calculate the F1 score as 2(0.9296*0.7586)/(0.9296+0.7586)=0.8354

Now that we’ve demonstrated calculating these metrics by hand, we will confirm our results and show how we can calculate these metrics using sklearn.

## Import the library and functions you need
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
## Accuracy
accuracy_score(y_test,prediction)
## Precision
precision_score(y_test,prediction)
## Recall
recall_score(y_test,prediction)
## F1 Score
f1_score(y_test,prediction)

Running this code will yield us the following results, which confirm our calculations.

Image by Author
Image by Author

Conclusion:There are many metrics one could use to determine the performance of their classification model. In this article we described confusion matrices, as well as calculated by hand and with code, four common performance metrics: accuracy, precision, recall, and F1 score.

Thank you for taking the time to read this article! I hope you enjoyed reading and have learned more about performance metrics. If you enjoyed what you read, please follow my profile to be among the first to see future articles!


Related Articles