Accuracy performance metrics can be decisive when dealing with imbalanced data. In this blog, we will learn about the Confusion Matrix and its associated terms, which looks confusing but are trivial. The confusion matrix, precision, recall, and F1 score gives better intuition of prediction results as compared to accuracy. To understand the concepts, we will limit this article to binary classification only.
What is a confusion matrix?
It is a matrix of size 2×2 for binary classification with actual values on one axis and predicted on another.

Let’s understand the confusing terms in the confusion matrix: true positive, true negative, false negative, and false positive with an example.
EXAMPLE
A machine learning model is trained to predict tumor in patients. The test dataset consists of 100 people.

True Positive (TP) – model correctly predicts the positive class (prediction and actual both are positive). In the above example, 10 people who have tumors are predicted positively by the model. True Negative (TN) – model correctly predicts the negative class (prediction and actual both are negative). In the above example, 60 people who don’t have tumors are predicted negatively by the model. False Positive (FP) – model gives the wrong prediction of the negative class (predicted-positive, actual-negative). In the above example, 22 people are predicted as positive of having a tumor, although they don’t have a tumor. FP is also called a TYPE I error. False Negative (FN) – model wrongly predicts the positive class (predicted-negative, actual-positive). In the above example, 8 people who have tumors are predicted as negative. FN is also called a TYPE II error.
With the help of these four values, we can calculate True Positive Rate (TPR), False Negative Rate (FPR), True Negative Rate (TNR), and False Negative Rate (FNR).

Even if data is imbalanced, we can figure out that our model is working well or not. For that, the values of TPR and TNR should be high, and FPR and FNR should be as low as possible.
With the help of TP, TN, FN, and FP, other performance metrics can be calculated.
Precision, Recall
Both precision and recall are crucial for information retrieval, where positive class mattered the most as compared to negative. Why?
While searching something on the web, the model does not care about something irrelevant and not retrieved (this is the true negative case). Therefore only TP, FP, FN are used in Precision and Recall.
Precision
Out of all the positive predicted, what percentage is truly positive.

The precision value lies between 0 and 1.
Recall
Out of the total positive, what percentage are predicted positive. It is the same as TPR (true positive rate).

How are precision and recall useful? Let’s see through examples.
EXAMPLE 1- Credit card fraud detection

We do not want to miss any fraud transactions. Therefore, we want False-Negative to be as low as possible. In these situations, we can compromise with the low precision, but recall should be high. Similarly, in the medical application, we don’t want to miss any patient. Therefore we focus on having a high recall.
So far, we have discussed when the recall is important than precision. But, when is the precision more important than recall?
EXAMPLE 2 – Spam detection

In the detection of spam mail, it is okay if any spam mail remains undetected (false negative), but what if we miss any critical mail because it is classified as spam (false positive). In this situation, False Positive should be as low as possible. Here, precision is more vital as compared to recall.
When comparing different models, it will be difficult to decide which is better (high precision and low recall or vice-versa). Therefore, there should be a metric that combines both of these. One such metric is the F1 Score.
F1 Score
It is the harmonic mean of precision and recall. It takes both false positive and false negatives into account. Therefore, it performs well on an imbalanced dataset.

F1 score gives the same weightage to recall and precision.
There is a weighted F1 score in which we can give different weightage to recall and precision. As discussed in the previous section, different problems give different weightage to recall and precision.

Beta represents how many times recall is more important than precision. If the recall is twice as important as precision, the value of Beta is 2.
Conclusion
Confusion matrix, precision, recall, and F1 score provides better insights into the prediction as compared to accuracy Performance Metrics. Applications of precision, recall, and F1 score is in information retrieval, word segmentation, named entity recognition, and many more.