
Model evaluation is an essential part of machine learning. In dealing with classification problems, there are so many choices of metrics and sometimes it causes confusion. This article will first discuss and compare all the commonly used metrics for a binary classification problem, not only by their definitions, but also by the situations when some metrics are preferred over the others. Lastly, this article will discuss how to adjust the model in favor of certain metrics.
Common metrics for binary classification problems
Binary classification problems are a typical supervised machine learning problem with binary target values. We usually refer to the target values as positive class and negative class. When evaluating the performance of a model, the most common and straightforward metric to use is Accuracy. Accuracy is the number of right prediction divided by total prediction:
At least in these two situations, accuracy is not enough:
First, when data is imbalanced. If the training data is imbalanced towards the negative class, for example, we expect 99% of the data is in the negative class, any model that always predict "negative" will obtain an accuracy at 99%. However, does it mean this model is an exceptional model to deploy? Obviously not. Thus, besides methods we can use to deal with unbalanced datasets, we need metrics that can distinguish the accuracy in positive and negative classes.
Second, when I want to know how models perform in predicting one class particularly. For example, when I care about the model predicting the positive class right more than it predicting the negative class right, accuracy by itself wouldn’t give me enough information I want.
Therefore, we need to consider other metrics to find the best model that works in different scenarios. Let’s discuss the "accuracy" in position and negative class separately. In this article, I will combine the machine learning models evaluation with traditional statistical hypothesis test to illustrate the connections between certain concepts. In hypothesis testing, we usually define the null hypothesis as no effect or negative, while testing the alternative hypothesis for the positive effect. In binary machine learning models, we test whether a data point is in the positive or negative class. Thus, we can say that in machine learning, when a data point is predicted to be in the negative class, is the same as saying we fail to reject the null hypothesis (accept H0) in hypothesis testing. Here is a table that connects hypothesis testing and machine learning:

Column 2 and column 3 show the results of model predictions, no matter we are using hypothesis testing or machine learning models, while row 2 and row 3 represent the true class of the data. We define True Negative (TN) as the model predicts negative when the data is in negative class, and vice versa for True Positive (TP). We further define False Positive (FP) as the model predicts positive while the data is in negative class, and False Negative (FN) is the model predicts negative while the data is in positive class. The true predictions are TN+TP, while the false prediction is FP+FN. In statistical analysis, FP is Type I error because you reject H0 (reject negative) when H0 is true ( true negative). FN is Type II error when accepting H0(accept negative) while H0 is false (true positive). 1- Type I error is confidence level, while 1-Type II error is statistical power.
Going back to binary model metrics, we can redefine accuracy as:
To test "accuracy" in different classes, I will introduce two sets of metrics. They are recall and precision, sensitivity and specificity.
Recall and Precision: what and when
If we want to make sure when the model predicts positive, it is very likely the data is true positive, then we can check out precision:
which is also called Positive Prediction Value.
If we want to make sure when model predicts negative, it is very likely the data is true negative, then we can check out recall:
which is the same formula for sensitivity.
Comparing precision and recall, we can see that the two formulas are the same expect precision has FP in the denominator while recall has FN there. Thus, to increase precision, the model needs to have as less FP as possible, while FN is ignorable. In contrast, to increase recall, the model needs to have small FN while not caring about FP.
Let’s consider a scenario that we are trying to find to the targeted customer to deliver advertisements. Let’s define customers in positive class as the customers that would make a purchase after seeing the advertisements. The goal here is to show advertisements to positive customers so they can make a purchase. In this way, we make the best use of advertisements and let the company make more profits. How to evaluate the model to find the best model? We can only answer this question based on different business settings.
Scenario one, when the cost of advertisements is high. The cost can be the actual cost of sending advertisements evaluated by the time and efforts from employees, and also it can be the cost of losing potential customers when showing advertisement too frequently to them. When the advertisement cost is high, the model needs to be very precise in making positive predictions. Thus, Precision is the best metric since you want to make sure most of the positive predictions are correct.
Scenario two, when the cost of advertisements is low, then it is okay to show advertisements to customers who are actually in the negative class. Here the goal will be to make sure all customers that are in the positive class receive advertisements. We should not leave out any positive customers in the negative class. Thus, the model needs to reduce False Negative, and increase Recall.
Sensitivity and Specificity: what and when
The other two sets of metrics that are usually used in medical settings. For example, when test whether a patient has a certain disease. They are sensitivity and specificity:
Sensitivity is evaluating the model correctly predicts how many positive points out of all true positive points, while specificity is evaluating the model correctly predicts how many negative points out of all negative points. In the example of testing whether a person is healthy (negative) or sick (positive), sensitivity is testing how many patients are successfully located out of all patients. High sensitivity means correctly identifies patients with a disease. Specificity is testing how many healthy people are tested negative out of all healthy persons. The specificity of a test refers to how well a test identifies patients who do not have a disease.
The combinations of metrics
In some scenarios, we only care about lowering FP or FN, but in other cases, we care about both. Other than accuracy, we have other metrics that measure both FP and FN so we can adjust the model to lower them at the same time. The widely used ones are Auc (areas under the ROC curve), F scores, etc. The ROC (receiver operating characteristic curve) curve is a curve showing the performance of a classification model at all classification thresholds. The curve shows the combination of True Positive Rate (TPR) and False Positive Rate (FPR):


Higher AUC is always more desirable. In the perfect situation when FN and FP are both zero, TPR is one and FPR is zero, which makes AUC equals to one. Thus, AUC evaluates both FN and FP.
In the same way, we can use Precision-Recall curve, instead of ROC curve, to calculate area under the Precision-Recall curve. The difference will be using precision and recall on the axis. The differences in the graph will be:

Both curves help measure FN and FP in the same time. However, ROC curves should be used when there are roughly equal numbers of observations for each class, while Precision-Recall curves should be used when there is a moderate to large class imbalance.
The F score is a formula that combines both recall and precision in the equation, thus it measures and helps lower FN and FP at the same time. The specific formula can be found here.
How to adjust the model in favor of specific metrics?
The last question one may ask is if I know I want to increase recall, and precision is not important for me, how should I adjust the model. For example, when we conduct a study of tumor detection, one may expect to have high recall since we don’t want to let the tumor untreated. Thus, when using a Machine Learning model to solve this binary classification problem, we want to make sure that False Negative is as small as possible. We want to be very cautious in predicting negative class. If we are using logistic regression, for example, the default threshold of predicting positive or negative class is 0.5. If we want to increase recall, we can decrease this threshold to be lower than 0.5. Thus, the model makes fewer predictions in negative class and False Negative will be less as well.
This is the ultimate guide for binary classification model metrics. If you are interested in learning more details about how to select machine learning models, the blog post below discusses the effective Model Selection methods using resampling and probabilistic approaches, popular model evaluation methods, as well as the Machine Learning model trade-offs:
The Ultimate Guide to Evaluation and Selection of Models in Machine Learning – neptune.ai
Thank you for reading! Here is the list of all my blog posts. Check them out if you are interested!
Read every story from Zijing Zhu (and thousands of other writers on Medium)