A lot of effort in classification tasks is placed on feature engineering and parameter optimization, and rightfully so. These steps are essential for building models with robust performance. However, all these efforts can be wasted if you choose to assess these models with the wrong evaluation metrics.
Currently, there are a plethora of ways to quantify the performance of a model. However, due to the maelstrom of Machine Learning jargon, it can be tempting to pay less attention to the distinctions between these metrics and select them arbitrarily.
Unfortunately, doing this hampers the ability of the model to meet the objectives in question.
For instance, one of the functions of a model is to provide insight into which features should be subject to research or experimentation in the subsequent analysis (e.g. A/B testing). If you relied on a model trained with the wrong evaluation metric, you would run the risk of investing your time, energy, and money in exploring the wrong variables.
Thus, it would be wise to take the time to understand what different evaluation metrics mean, how they are computed, and when they are most useful.
Key Terminology
Before discussing any metrics, it is important to introduce a few key terminologies used when comparing model predictions to the actual values.
A classification made by a model results in one of four outcomes.
A true positive reflects an outcome where a machine correctly predicts a positive case.
A true negative reflects an outcome where a machine correctly predicts a negative case.
A false positive reflects an outcome where a machine assigns a positive prediction to a negative case
A false negative reflects an outcome where a machine assigns a negative prediction to a positive case.
To better contextualize these terms, consider a model that detects credit card fraud.
A true positive outcome entails correctly identifying a transaction as a fraud.
A true negative outcome entails correctly identifying a transaction as legitimate.
A false positive outcome entails identifying a legitimate transaction as fraud.
A false negative outcome entails identifying a fraudulent transaction as legitimate.
These prediction outcomes serve as components used to derive many evaluation metrics for classification models.
Let’s go over a few of them.
Evaluation Metrics
1. Accuracy
This is by far the simplest metric.
Accuracy can be computed with the following formula:

This metric is unpopular in real-life scenarios. Since it does not distinguish between false positives and false negatives, it is a terrible choice for imbalanced datasets.
2. Precision
The precision metric is similar to the accuracy metric. It penalizes false positives without considering false negatives.
Its value can be computed with the following formula:

It is a valid metric to use if you want to avoid false positives or are more tolerant towards false negatives.
Medical professionals, for example, have plenty of incentives to avoid false positives. Diagnosing a patient with a disease that they don’t even have will lead to treatment or surgery that was not even necessary in the first place. This can in turn cause more harm to the patient subjected to their care. Furthermore, such false predictions can lead to reputation loss and a barrage of lawsuits.
3. Recall
The recall metric could be considered as the polar opposite of the precision metric. Instead of penalizing false positives and ignoring false negatives, this metric penalizes false negatives and ignores false positives.
The recall metric can be computed with the following formula:

Consider the previous case of disease diagnostics. Although false positives are unappealing to medical professionals, the same can be said for false negatives. The cost of failing to diagnose a patient with a disease or condition that they actually have will lead to them being deprived of needed treatment.
4. F-1 Score
In many real-life cases, false positives and false negatives are both unappealing. As a result, both precision and recall may be inadequate for evaluating a model’s performance.
Fortunately, instead of tracking both metrics separately, one can use the f-1 score metric to leverage both precision and recall.
The f-1 score metric is calculated with the following formula:

This is a personal go-to metric for me when I am taking on classification projects given its applicability.
For many fields and disciplines where false negatives and false positives are undesirable, the f1 score stands out as an appealing choice.
5. F-beta score
The one caveat of using the f1 score is that it weighs precision and recall equally. However, there could be cases where one precision might be more important than recall, or vice versa.
The f-beta metric would be an apt fit for such a situation. It can be derived with the following formula:

If you wish to prioritize the precision metric over the recall metric (i.e. limit false positives), select a beta value below 1.
If you wish to prioritize the recall metric over the precision metric (i.e. limit false negatives), select a beta value above 1.
Conclusion

Choosing the evaluation metric(s) for your classification models may not be as time-consuming as procedures like data wrangling and feature engineering.
However, it still has a considerable influence on the final outcome of your project or study. Taking the time to consider the metric most suited for the model can ensure a satisfactory end result.
I wish you the best of luck in your machine learning endeavors!