
The model assessment phase starts when we create a holdout set which consists of examples the learning algorithm didn’t see during training. If our model performs well on the holdout set we can say that our model generalizes well and is of good quality.
The most common way to assess whether a model is good or not is to compute a performance metric on the holdout data.
This article will focus on the performance metrics for binary Classification models. This is worth specifying because regression tasks have completely different trackable performance metrics.
Performance metrics for binary classification
The metrics we’re going to cover are:
- Accuracy
- Precision and recall
- F1 score
- Log loss
- ROC-AUC
- Matthews Correlation Coefficient (MCC)
Accuracy
When we want to analyze the performance of a binary classifier, the most common and accessible metric is certainly the accuracy. What it does is tell us how many times our model has correctly classified an item in our dataset with respect to the total.
In fact, the formula for accuracy is the division between the number of correct answers and the total of the answers

The accuracy takes into account the performance of the model in the strictest sense. In other words, it does not allow us to understand the context in which we are operating.
Taken out of context, accuracy is a very delicate metric to interpret.
For example, it is not recommended to use accuracy as an evaluation metric when working with an unbalanced dataset, where classes are unevenly distributed. If accuracy is nothing more than the ratio of correct answers to the total, then you will understand that if a class makes up 90% of our dataset and our model (wrongly) classifies each example in the dataset with that specific class, then its accuracy will be 90%.
If we are not careful we might think that our model is very performing, when in reality it is very far from it.
However, accuracy is a sensible metric to use if we are sure that our dataset is balanced and that the data is of high quality.
Precision and recall
To better understand the concepts of precision and recall we will use a confusion matrix to introduce the topic. It is so called like so because it communicates to the analyst the degree of error (and therefore confusion) of the model.
To create a confusion matrix, all we have to do is list the actual classes present in the dataset on rows and those that the model must predict in the columns. The value of the cells corresponds to the response of the model under those conditions.
Let’s see an example using Sklearn as a source

This is an example of a confusion matrix for a binary classifier applied to the famous Iris dataset.
The values on the diagonal indicate the points where the prediction of the classification model corresponds to the real class present in the dataset. The more elements are present on the diagonal, the more our predictions are correct (care, I did not say that the model is performing! Let’s remember the accuracy issue above).
Let’s abstractly represent the confusion matrix in order to better understand the values that fill the cells.

We see a number of labels: true negatives, true positives, false negatives, and false positives. Let’s see them one by one.
- True Negatives (TN): Contains examples that have been correctly classified as negative.
- True positives (TP): Contains examples that have been correctly classified as positive.
- False Negatives (FN): contains examples that have been incorrectly classified as negative and are therefore actually positive
- False Positives (FP): Contains examples that have been incorrectly classified as positive and are therefore actually negative
By using a confusion matrix we are therefore able to better understand the behavior of our classifier and how to improve it further.
To continue, let’s see how to derive the accuracy formula from the confusion matrix:

This is nothing more than the number of correct answers divided by the total.
Now let’s define precision with the formula

How is it interpreted? Precision is just the accuracy calculated only for positive classes. It is also called specificity since it defines how sensitive an instrument is when there is the signal to be recognized. In fact, the metric tells us how often we are correct when we classify a class as positive.
Let’s take another example: we installed an alarm system in our house with a facial recognition algorithm. This is connected to cameras and a control unit that sends a notification on an app on our mobile phone if someone enters the house that it does not recognize as a friend or family member.
A high-precision model will alert us a just few, sporadic times, but when it does we can be pretty sure it’s really an intruder! Thus, the model is able to correctly distinguish an intruder from a family member when there is indeed an intruder in the house.
The recall, on the other hand, represents the other needle in the balance. If we are interested in recognizing as many positive classes as possible, then our model will have to have a high recall score.
Its formula is

In practice it means that here we have to take into account false negatives instead of false positives. Recall is also called sensitivity because as recall increases, our model becomes less and less accurate and also classifies negative classes as positive.
Let’s look at an example that includes the recall: we are radiologists and we have trained a model that uses computer vision to classify the presence of any lung tumors.
In this case we want our model to have high recall, as we want to be sure that every single example considered positive by the model is subjected to human inspection. We don’t want a malignant tumor to go unnoticed, and we will gladly accept false positives.
To summarize, let’s see this analogy:
A high precision model is conservative: it doesn’t always recognize the class correctly, but when it does, we can be assured that its answer is correct. A high recall model is liberal: it recognizes a class much more often, but in doing so it tends to include a lot of noise as well (false positives).
The attentive and curious reader will have deduced that it is impossible to have a model with both high precision and high recall. In fact, these two metrics are complementary: if we increase one, the other must decrease. This is called the precision / recall trade-off.
Our goal as analysts is to contextualize and understand which metric offers us the most value.
F1 score
At this point it is clear that using precision or recall as evaluation metrics is difficult because we can only use one at the expense of the other. The F1 score solves this problem.
In fact, the F1 score combines precision and recall into one metric.

This is the harmonic mean of precision and recall, and is probably the most used metric for evaluating binary classification models.
If our F1 score increases, it means that our model has increased performance for accuracy, recall or both.
Log loss
Log loss is a common valuation metric, especially on Kaggle. Also known as cross-entropy in the context of deep learning, this metric measures the difference between the probabilities of the model’s predictions and the probabilities of observed reality. The goal of this metric is to estimate the probability that an example has a positive class.
This metric is mathematically more complex than the previous ones and there is no need to go into depth to understand its usefulness in evaluating a binary classification system.
Here is the formula for completeness

n stands for the number of examples in the dataset, y_i stands for observed reality, and yˆ_i stands for model’s prediction.
I will not continue with the explanation of this formula because it would go astray. Google is your best friend in case you want to learn more 🙂
ROC-AUC
The ROC-AUC metric is based on a graphical representation of the receiving operating characteristic curve. I won’t try to explain it in my own words, because this time Wikipedia does a really good job
ROC curves […] are graphical schemes for a binary classification model. Sensitivity and (1-specificity) can be represented along the two axes, respectively represented by True Positive Rate (TPR, fraction of true positives) and False Positive Rate (FPR, fraction of false positives). In other words, the ROC curve shows the relationship between true alarms (hit rate) and false alarms.
The final sentence in bold (applied by me) is the one that makes the description of the ROC curve we just read intuitive. Obviously we want the relationship between true and false alarms to be in favor of the true ones, because better performing models will do exactly that.
Let’s see what this graph looks like

AUC stands for Area Under the Curve. If we focus on the blue line, we see that below it there is in fact a larger area than the green and orange lines. The dashed line indicates a 50% ROC-AUC metric.
Consequently a good model will have a large ROC-AUC, while a poor model will be positioned near the dotted line, which is nothing more than a random model.
The ROC-AUC metric is also very useful for comparing different models against each other.
Matthews Correlation Coefficient (MCC)
Here we see the last evaluation metric for a binary classification model that is designed to evaluate even models trained on unbalanced datasets.

The formula seems like a tongue twister, but it actually behaves like a correlation coefficient. It therefore ranges between -1 and+1.
A value that tends to +1 measures the quality of our classifier’s predictions even in contexts with unbalanced classes in the dataset, since it indicates a correlation between actual observed values and predictions made by our model.
Conclusion
As with the evaluation metrics of regression models, Sklearn provides several methods to quickly calculate these metrics. Here a link to the documentation.
As a final note, these are metrics for evaluating a binary classification model. For multi-class classification, for example, just apply one of these metrics to each class and then apply a strategy that generalizes on all the examples, such as the average (micro / macro averaging). But that’s a talk for another article 🙂
If you want to support my content creation activity, feel free to follow my referral link below and join Medium’s membership program. I will receive a portion of your investment, and you’ll be able to access Medium’s plethora of articles on Data Science and more in a seamless way.
See you in the next article 👋