The world’s leading publication for data science, AI, and ML professionals.

McNemar’s Test to evaluate Machine Learning Classifiers with Python

Learn how to compare ML classifiers with a significance test using Python

Photo by Isaac Smith on Unsplash
Photo by Isaac Smith on Unsplash

Introduction

In my last article I talked about the importance of properly comparing different models using statistical tools, in order to choose the best model during the selection phase.

In this article I want to focus on one statistical test in particular that as a data scientist or Machine Learning engineer you need to know. You can use this test in order to determine whether there is a statistically significant difference between two classifiers such that you can actually use only the best one.

McNemar’s test

McNemar’s test can be used when we need to compare the performance of two classifiers when we have matched pairs. The test works well if there are many different predictions between the two classifiers A and B, then if we have a lot of data.

With this test we are able to compare the performance of two classifiers on N items having a single test set, as opposed to how you used to do it in the paired t-test.

Assumptions

  1. Random Sample
  2. Independence
  3. Mutually exclusive groups

Contingency Table

McNemar’s test is designed to focus mainly on the differences that there are between two classifiers, therefore on the cases that they predicted in a different way. So the first thing we need to do is to go and calculate the following values.

Contingency Table (Image by Author)
Contingency Table (Image by Author)
  • n00: number of items misclassified by both A and B
  • n01: number of items misclassified by A but not by B
  • n10: number of items misclassified by B but not by A
  • n11 : number of items classified correctly by both A and B
  • Null Hypotesis : n01 = n10

Null hypothesis: A and B have the same error rate.

McNemar’s test is basically a form of paired chi-square test, so next we need to compute X² **** value using the following formula.

Chi-Square (Image by Author)
Chi-Square (Image by Author)

Example

Contingency Table (Image by Author)
Contingency Table (Image by Author)

X² = 7.55

Let’s say we want a level of significance p = 0.05. We can reject the null hypotesis if X² >X²(0.05) (in a two tailed test).

So let’s use a chi-square table with degrees of freedom = 1.

X² Table (source : https://en.wikibooks.org/wiki/Engineering_Tables/Chi-Squared_Distibution)
X² Table (source : https://en.wikibooks.org/wiki/Engineering_Tables/Chi-Squared_Distibution)

Since 7.55 > 3.84 we can reject the null hypotesis and state that there is significance difference between the classifiers A and B!

Let’s code

Final Thoughts

Hypothesis testing provides a reliable framework for making data decisions about the population of interest. It helps the researcher to successfully extrapolate data from the sample to the larger population.

Comparing the result you get one time on different models to choose which one is the best, is never a good method. Statistical tests allow us to state objectively whether one model performs better. Every time you run experiments, statistical tests are used to show that what you have done is better than what existed previously, just go and read any machine learning paper!

The End

Marcello Politi

Linkedin, Twitter, CV


Related Articles