Knowing which metrics to use is not always straightforward
With the boom in deep learning, more and more people are learning how to train their first classifier. But once you are done training, the analysis that follows is crucial. Knowing which metrics to use is not always straightforward. In this article, I’ll discuss accuracy, precision, recall, f1 score, and confusion matrix for measuring the performance of your classifier.
True vs False and Positive vs Negative
Before starting with the metrics let us look at a few terms which we would be used to define these metrics [1].
We will make the following definitions:
- Tumor – Positive class
- No Tumor – Negative class
So positive and negative are classes defined by us. True means that the prediction and the actual class match. False means that the prediction and the actual class do not match.
Now we can look at the following four terms:
- True positive (TP): This means that the prediction was positive class and the actual class was also positive.
- False positive (FP): This means that the prediction was positive class and the actual class was negative.
- True negative (TN): This means that the prediction was negative class and the actual class was also negative.
- False negative (FN): This means that the prediction was negative class and the actual class was positive.
Accuracy
This is a very popular metric that tells you out of all the predictions what percentage are correct.
Accuracy = Total correct predictions / Total predictions
[2]
or
Accuracy = (TP + TN) / (TP + FP + TN + FN)
Note: Total predictions = TP + FP + TN + FN
and Total correct predictions = TP + TN.
This is a pretty simple measure of performance. However, it can be misleading for datasets where there is a high class imbalance. Suppose we have 100 samples out of which 10 are tumors and 90 are not tumors. The classifier classified 89 non tumors correctly but just one of the actual tumors.
For this example: TP: 1, FP: 1, TN: 89, FN: 9
We will get an accuracy of 90%. It might look like the model is performing very well. However, in reality, it is performing very poorly for the tumor class. A false negative in a medical application can cost a life. However, accuracy gave a false sense of great performance. The next metrics that we will be looking at help in these situations and give a better understanding of the classifier performance.
Precision
Precision is defined as the fraction of relevant instances among the retrieved instances [3]. It is also known as the positive predictive value. It is the ratio of true positives to the predicted positives in the set i.e. sum of true positives and false positives.
Precision = TP/ (TP + FP)
Precision attempts to answer the question: What proportion of positive identifications was actually correct? [4]
If we continue the earlier example,
Precision = 1 / (1 + 1) = 0.5
The precision is 0.5 which means the classifier is as good as random guessing.
Note: A model with a precision value of 1 will have no false positives.
Recall
Recall is defined as the fraction of relevant instances were retrieved [3]. It is also known as sensitivity. It is the ratio of true positives to the total positives in the set i.e. sum of true positives and false negatives.
Recall = TP/ (TP + FN)
Precision attempts to answer the question: What proportion of actual positives was identified correctly? [4]
If we continue the earlier example,
Recall = 1 / (1 + 9) = 0.1
The recall is 0.1 which means the classifier can only retrieve 10% of the tumor instances and 90% will be wrongly classified as non tumor. This is a very poor performance based on the requirement for the tumor detection application.
Note: A model with a recall value of 1 will have no false negatives.
F1-score
This is a metric that combines the precision and recall values to give a performance evaluation. F1 Score is defined as the harmonic mean of precision and recall [5].
F1-score = 2 * (precision * recall ) / (precision + recall )
It gives equal weightage to precision and recall. F1 score will give a lower value compared to just taking the arithmetic mean or the geometric mean of precision and recall values. It will tend towards the lower of the two values to mitigate the impact of large outliers.
For the above example,
F1-score = 2 0.5 0.1 / (0.5+0.1) = 0.16667
We got the F1-score closer to the recall the lower of the two metrics. Thus, a high F1-score would indicate that the classifier has high precision as well as recall.
Confusion Matrix
In Machine Learning for classification, a confusion matrix (also called as error matrix) is a table that allows visualization of classification performance [6]. Each row represents the instances in the actual class and the columns represent the instances in the predicted class or the other way.
Note: Please ensure that you read the documentation of the library you are using properly to ensure you know rows and columns represent which entries. For scikit-learn, the rows represent the true labels and the columns represent the predicted labels [7].
For binary classification, the confusion matrix can be neatly represented using the TP, FP, TN and FN values.
So for our example, the confusion matrix is:
Looking at this confusion matrix, we can clearly see what is the performance of the classifier. An ideal confusion matrix will have only diagonal entries and no non-diagonal entries i.e. FP and FN would be zero. The precision and recall values would be 1 in that case. A good way to visualize this matrix is as a heatmap. That will heat the places where there are larger entries i.e. make those entries bright and the other ones would be mellow. If we observe a bright diagonal line, we quickly know that the confusion matrix is correct. This is more meaningful for the multi-class classification confusion matrix.
Looking at the matrix allows one to quickly find out how many samples were correctly classified for every class and the distribution of the misclassified samples. You can then check and see if something stands out. Like a particular class is being heavily misclassified as another class, then you can check the samples manually and see if everything is okay. Maybe something is wrong with the samples. Or you understand the conditions under which the classifier fails.
Conclusion
In this article, we looked at few classification metrics that are useful for analyzing the performance of a classifier. We started with the concepts of true vs false and positive vs negative. Then saw accuracy as a metric and observed that it is misleading for unbalanced datasets. Next, we looked at precision, recall and F1-score as metrics that work with imbalanced datasets. Finally, we looked at the confusion matrix, which is a very useful tabular representation to visualize the performance of a classifier. I hope this article helped you. If you use any other tools or metrics, please do share them with me! Thanks for reading the article. Follow for more interesting and insightful articles. P.S. If you would want to read about more metrics you can read https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226 by Rahul Agarwal.
References
[1] https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative
[2] https://developers.google.com/machine-learning/crash-course/classification/accuracy
[3] https://en.wikipedia.org/wiki/Precision_and_recall
[4] https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
[5] https://en.wikipedia.org/wiki/F-score
[6] https://en.wikipedia.org/wiki/Confusion_matrix
[7] https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html