
The simplest way to assess the qualify of a Classification model is to pair the values we expected and the predicted values from the model and count all the cases in which we were right or wrong; that is – construct a confusion matrix.
For anyone who has come across classification problems in machine learning, a confusion matrix is a fairly familiar concept. It plays a vital role in helping us evaluate classification models and provides clues on how we can improve their performance.
Although classification tasks can produce discrete outputs, these models tend to have some degree of uncertainty.
Most model outputs can be expressed in terms of probabilities of class belonging. Typically, a decision threshold which allows a model to map the output probability to a discrete class is set at the prediction step. Most frequently, this probability threshold is set to 0.5.
However, depending on the use-case and on how well the model is able to capture the right information, this threshold can be adjusted. We can analyze how the model performs at various thresholds to achieve the desired results.

Although we can inspect the Confusion Matrix at each threshold, it would be much easier to analyze aggregated information. Moreover, since rates are easier to compare than whole numbers, there is motivation to express the results of the actual and predicted outcome in terms of our expectations. For example, we can define a true positive rate as the percent of values we expected to be positive that were predicted to be positive by our model.
Thus, a data scientist has access to several methods to analyze the results. Two of the most common ways used today are "Specificity and Sensitivity" and "Precision and Recall". Specificity and sensitivity, first introduced in 1947 [[source](https://doi.org/10.1016%2Fs0911-6044%2803%2900059-9)], is used mostly in clinical settings to assess performance of medical tests. Precision and recall emerged much later but quickly became powerful metrics in assessing machine learning results [source]. Other measures like "Informedness and Markedness" are less known but still valuable in many instances.
Although many ways of analyzing confusion matrix results exist today, I will summarize the two aforementioned ones and draw on the point I hope to make – the method you choose should depend on your individualized use-case and the consequences of your model’s predictions.
In the following sections, I will discuss the metrics in reference to a 2×2 confusion matrix which attempts to classify some binary outcome denoted as positive and some negative values.
Sensitivity and Specificity
Sensitivity is another term for a true positive rate (TPR). It is the count of correctly predicted positive values divided by the total count of actual positive values.
Sensitivity = True Positives / Actual Positives
Sensitivity measures the model’s ability to correctly identify positive cases (true positives) among all actual positives. If we want to make a model more sensitive to positive values, we will attempt to increase the number of correctly predicted positive values, which may increase the overall number of results predicted as positive.
On the other hand, specificity is the true negative rate (TNR) which is the count of correctly predicted negative values of all actual negative values.
Sensitivity = True Negatives / Actual Negatives Sensitivity = 1 – False Positives / Actual Negatives
It may be helpful to think of sensitivity as 1 minus the rate of false positives. Since false positives amount to a greater number of total predicted positive values, if our model is more specific, then fewer negative cases will be classified as positives ones and our output may result in fewer overall positive predictions.
Ideally, we want a model or a test to have high sensitivity and high specificity. However, in real life, we may be dealing with a tradeoff. If there is some uncertainty in our classification task, a threshold value will impact how our metrics behave. Lowering the decision threshold allows more cases to be classified as positive, increasing sensitivity (but potentially raising the rate of false positives and reducing specificity). Setting a tighter decision threshold will result in the actual positives being misclassified. Therefore, a true positive rate will decrease, and we will assess that our model is less sensitive to positive values.
Intuitively, we can illustrate the use of the metrics with this example. Suppose I ask my toddler to fetch some markers for me. If I am not very specific in my request, he may return a mix of markers and pencils but if I am too specific, he may not be able to find that many markers. If I have greater sensitivity, I may only want to use markers to draw with; but if my sensitivity is lower, I may be ok drawing with any markers or pencils.
In some instances like disease detection, missing positives can have severe consequences so we should aim for a model with higher sensitivity. Conversely, in some cases like fraud detection, having too many predicted positives can lead to large costs so higher specificity is preferred.
Precision and Recall
Recall and sensitivity are mathematically equivalent metrics and represent a rate of true positive cases.
Recall = True Positives / Actual Positives
The motivation behind this metric is to quantify how well a model can recall actual positive values. A model with greater recall is better at identifying the actual positive cases in the dataset, even if it predicts fewer total positive cases.
Precision measures how many of the positive cases our mod predicted are actually positive.
Precision = True Positives / Predicted Positives
If we move our decision threshold to include a greater range of probabilities for positive values, our total number of predicted positive cases will increase. As the decision threshold is lowered to predict more positives, the total number of false positives could increases, thus, reducing precision (since the ratio of true positives to predicted positives decreases).
Conversely, we can increase precision if we limit the number of false positives in the dataset either by improving our model or by inflicting a tighter bound on the range of positive value probabilities.
One way to understand the intuition behind precision and recall and their tradeoff is through this fishing analogy. Suppose we are attempting to catch fish which are swimming in a small cluster among some seaweed. We can try to be less precise and cast a wider net: this should result in more fish, but we will also get more seaweed in our net. If we try to be more precise and cast a smaller net, we may get most of the fish and very little seaweed; but we can also miss our fish cluster and catch very few fish.

Again, there is usually a tradeoff between precision and recall and achieving the right balance depends on the use case. For example, while high recall can benefit disease detection where we are interested in correctly identifying all ill patients, high precision can be more helpful in identifying fraud because we want to make sure that we don’t overwhelm fraud investigation with too many false positives.
Which pair of metrics should be used?
Although by convention, sensitivity and specificity are used in clinical settings while precision and recall are favored in Machine Learning, the choice of the metric pair in analysis depends on the use case.
Since sensitivity and recall are mathematically equivalent, the key difference lies in specificity and precision as well as the decisions the model intends to make.
Mathematically, the denominator of specificity is the number of true negatives while the denominator of precision is the number of predicted positive values. Yet, both metrics are related to false positives. Specificity depends on the actual number of negative cases; precision is independent of actual cases and only concerns itself with our prediction.
Therefore, sensitivity and specificity are better measures in use-cases where we care about correctly capturing negative cases – especially when a negative value has a risk or a bounty associated with it. The reason why sensitivity and specificity are frequently used in clinical settings is because of the cost associated with missing true negative results: having low specificity may send patients through unnecessary treatment.
Machine learning has relative cases as well. An example of a model which carries a penalty associated with the rate of false positives is a model which attempts to predict if a stock price will be higher or lower a year from now. Having low specificity could cause our client to lose their money due to poor investments our model thought would increase in value.
On the other hand, use cases that care more about most accurately identifying positive cases and ensuring that our positive predictions are spot on can benefit from precision and recall analysis. The reason we see these metrics most often in machine learning is that we frequently care more about finding all the positive cases and ensuring that our positive predictions are, indeed, positive. But even some medical tests can benefit from this method if the cost associated with incorrectly predicting negative vvalues doesn’t carry too much risk.

Finally, let’s take a look at an example of a balanced dataset, where our sensitivity and recall are high but our FPR and TNR vary. In this example, specificity varies significantly more in response to the change in FPR. Most dramatically, when half of our positive predictions are correct but almost all predictions of actual negative alues are wrong, specificity will be practically zero while precision can remain high.

Therefore, as long as the true positive rate remains high, precision won’t respond to the change in false positive rate as much as specificity would. And, specificity will vary independent of how many actual positives there are.
So, our evaluation metric selection depends on our case and more precisely – on the consequences of our predictions.
Not all conventions fit all uses. Building impactful models requires actually thinking about the ends, not just the means. Instead of only only focusing on accuracy we should spend some time to pondering about how our results will be used.
I loved this relevant paragraph from Spencer Antonio Marlen-Starr:
The decision maker does not (or at least, he need not) care about how frequently the predictions on which he bases his decisions are accurate, but about the impact or magnitude of the payoffs, and the cumulative effect of the interaction between the overall accuracy rate of the predictions made during the process of making many decisions over time multiplied by the magnitude of the payoffs received or penalties incurred as their consequences.