Have you ever evaluated your model in this way?

Evaluation of Classification Model with McNemar Test

Published in

Towards Data Science

7 min readMar 7, 2021

Model Evaluation is the subsidiary part of the model development process. It is the phase that is decided whether the model performs better. Therefore, it is critical to consider the model outcomes according to every possible evaluation method. Applying different methods can provide different perspectives.

There are different metrics (or methods) such as accuracy, recall, precision, F1 Score. These are the most used and known metrics for the model evaluation in classification. Every one of them evaluates the model in different ways. For example, while the accuracy provides insight into how your model can predict correctly, recall provides insight into how your model is sensitive for positive classes. In this article, I am going to explain another method called the McNemar test. This method can also provide another perspective to evaluate your model.

Model Evaluation for Classification Models

Classification models are used for predicting certain classes by given input variables. However, their results are consist of nominal or discrete data. Therefore, confusion matrices are mostly preferred while evaluating the model results. It ensures to correct comparison of the actual and the predicted classes according to its cell values. In the following figure, there is an example 2x2 confusion matrix. This is a confusion matrix that shows the predicted classes and the actual classes.

Example confusion matrix (Created by author)

As you can see from the table above, there are two classes as positives and negatives. They might be indicated as also 0 and 1 or A and B. It doesn’t make any difference at all. The cells marked as True represent the number of classes that have been correctly predicted. The cells marked as False are the right opposite. To evaluate your model, it is better to ask a few questions to this confusion matrix to understand your actual expectations. Each question can be answered by a different evaluation metric. Let’s see some of these questions,

1 — How accurate the model while predicting all classes?
2 — How sensitive the model while predicting positive classes?
3 — How precise the model while predicting positive classes?
4 — How sensitive the model while predicting negative classes?

The first question can be answered by the metric called “Accuracy” which is mostly used. The second one can be found by using a metric called “Recall”. The third can be answered by using “Precision” and for the fourth one, the “Specificity” can be used. There are also other questions and metrics interrelatedly, but I would like to focus on the question which the McNemar is to be required to use.

Model Evaluation with McNemar

One of the mistakes while evaluating the classification model is considering only the true cases. It means that looking for only how the model estimates actual cases correctly. Therefore, when the results are unsatisfactory, people try to apply different methods or different variations to get the result that makes them satisfied, without considering the main reason for that result. It shouldn’t be forgotten the accuracy also depends on the false predictions as much as it depends on the true predictions. Thus, false predictions also have to be taken into consideration before rendering a certain verdict. These are the predictions which we want to be as minimum as possible. The metrics called Recall and Precision slightly explain the performance of the positive classes (or negative) by considering the false cases too. But, what I try to say is, the false positives and the false negatives should be compared like they are compared for the true cases. This is where the McNemar test should be used for obtaining a probability of difference between the cases of false negative and false positives.

As with the other metrics, there is a question which the McNemar test can answer too. This question is “How your model equally performs for the wrong (false) predictions?”. If there is a difference statistically, you might want to focus on the class which makes the higher difference. You can reexamine your training process or your data-set to reduce or balance the number of false predictions.

Note: In the same cases, unbalanced false negative and positive predictions occur when the model has trained with unbalanced classes.

McNemar’s Test

McNemar’s test is applied to 2x2 contingency tables to find whether row and column marginal frequencies are equal for paired samples [1]. What row and column marginal frequencies mean for confusion matrices is the number of false predictions for both positive and negative classes. It uses the Chi-Square distribution to determine the probability of difference. Let’s have a look at the following tables and try to understand how it brings out the difference.

In the table above, the “b” and “c” cells represent the marginal frequencies. The null hypothesis of McNemar’s test is (p is the proportion) pa + pb = pa + pc and pc + pd = pb + pb . If we do the math here, we can obtain pb = pc from both equation. So, the null hypothesis is the proportion of cell “b” is equal to the proportion of cell “c” and the alternative hypothesis is the proportion of cell “b” isn’t equal to the proportion of cell “c”. The following formula gives the McNemar statistic.

Applying McNemar’s Test to Confusion Matrix

The confusion matrix that is obtained after predicting over the binary classification model is also a form of a 2x2 table. So, the difference between marginal frequencies can be obtained by using McNamer’s test. Let’s see the following confusion matrix and try to find is there a difference between false predictions.

Confusion matrix to apply McNemar’s Test (Created by author)

As you can see from the table above, the difference between the false negative and the false positive is 10 (45–35). It shouldn’t be forgotten that this result comes from your sample test-set or train-set with which you develop your model. When the model goes to the production phase, it might get much more different input values. Therefore, we need to find whether this difference is statistically significant. If the difference is statistically significant, it is expected to get the same difference in production. This is the reason why there is a need for having a probability value.

I am going to use Python for applying McNemar’s test. It is really simple to calculate, as you can see from the previous formula. Additively, we need to obtain a p-value from the Chi-Square test by using the McNemar statistic. It can be found by using both methods chi2.cdf or distributions.chi2.sf that comes from “scipy.stats” module. To use these methods we also need the value of the degree of freedom. It is found by the product of one minus of rows and columns (2–1)*(2–1). For McNemar’s test, the null hypothesis is there no difference between the marginal frequencies and the alternative hypothesis is there is a significant difference between the marginal frequencies. So, if the p-value is greater than 0.05, it can be concluded that there isn’t a significant difference between false negatives and false positives. And, if the p-value is less than or equal to 0.05, it means there is a significant difference.

p-value has been found as 0.26. Because the p-value is greater than 0.05, the null hypothesis can not be rejected. It means that on the %95 confidence level there isn’t a significant difference between frequencies of false positives and false negatives. What about the 0.26 and the probability of being the difference. 1–0.26 gives the probability of being the difference. So, with the probability of 0.74, we can say that there might be a difference between false negatives and false positives.

Conclusions

The difference between false cases has been tested in this article. The difference between true positives and true negatives can also be tested by using McNemar test. This can also provide a different perspective to evaluate your model. The most important thing before applying McNemar test is being made sure that the number of two classes approximately the same. Otherwise, it might direct to unreliable results. McNemar’s test is applied to only 2x2 tables. When there is a multi-class model, McNemar can be used after the confusion matrix can be divided into sub 2x2 tables.

References

[1] https://en.wikipedia.org/wiki/McNemar%27s_test , McNemar’s Test, Wikipedia