The world’s leading publication for data science, AI, and ML professionals.

The Confusion Matrix in Hypothesis Testing

Identifying and understanding type I error, type II error, and power using the confusion matrix.

Photo by Robert Katzki on Unsplash
Photo by Robert Katzki on Unsplash

The Confusion Matrix is very useful for certain tasks in both inferential analysis, like A/B testing, and in predictive analysis, like classification, to understand and evaluate a statistical test or predictive model. Both involve making a binary decision. In hypothesis testing, we reject or fail to reject the null hypothesis, and in binary classification, the classifier will predict an observation to be positive or negative.

Both tasks allow us to structure the result in a 2×2 confusion matrix, showing true and false positives, and true and false negatives. If you look at the two matrices below, you can see that both are structured the same way with the entries on the main diagonal of the matrices corresponding to correct decisions.

Image by Author
Image by Author
Image by Author
Image by Author

While the matrices are structured the same way, and in essence, also contain the same type of measurements, the information contained in the matrices is named and used differently depending on the context. In Hypothesis Testing, the confusion matrix contains the probabilities of different outcomes, which helps us understand certain properties of the test such as type I and II errors and power. In binary predictions or classifications, the counts of each square (TP, FP, TN, and FN) are used to calculate different metrics to evaluate a certain predictive model.

In this article, we will look at the confusion matrix in the context of hypothesis testing. You can read about the confusion matrix for classification problems here.

The Confusion Matrix In Classification


Visualizing a hypothesis test

Hypothesis testing is the method of testing if the variation between two sample distributions is caused by an actual difference between the two groups, or if the variation can be explained through random chance.

To understand the confusion matrix in the context of inferential analysis, it may help to look at the visual representation of a one-sided hypothesis test below to show how the four possible outcomes in the matrix correspond to the possible events of a hypothesis test.

Image by Author
Image by Author
Image by Author
Image by Author

Type I error (alpha)

Type I error is also called the size or significance level of a test. It is denoted by alpha and is usually set at 0.05. Type I error represents the false positives of a hypothesis test, meaning we falsely reject the null hypothesis given the null being true.

  • Type I error (alpha) = P(Rejecting the null | Null is true)

If we are comparing the means of two groups, this means that we falsely conclude that there is a difference in means between the two groups when the means of the two groups are in fact the same.

Type I error corresponds to the red area in the graph. Even though the null hypothesis is true, we will find the test statistic in the rejection area and falsely reject the null hypothesis in 5% of the cases. Many people are confused by this. If the test statistic falls in the rejection region, how can it be a false positive?

Well, consider the behavior of p-values. (All test statistics, like z, t, chisq, f, etc. can be transformed into p-values ranging from 0 to 1.) Under the null hypothesis, and all assumptions of the test being met, the p-values should form a uniform distribution between 0 and 1. With repeated sampling, we expect some p-values to be <0.05, due to natural variation in the data and random sampling. While it is more unlikely to draw a sample with extreme values than a sample with less extreme values, it does happen due to chance. That is why we expect that, on average, 5% of rejected nulls will be false positives.

Can we not just lower the alpha value, you might ask. We can do that, and you can see that some studies use the lower significance level 0.01. But we must be aware that this also affects the Power of the test negatively which is not optimal.

Type II error (beta)

Type II error is what the matrix shows as false negatives, and is denoted as beta. **** This occurs when we fail to reject the null given the alternative hypothesis is true.

  • Type II error (beta) = P(Fail to reject the null | Alternative is true)

If we are comparing the means of two groups, this means that we falsely conclude that there is no difference in means when the means of the two groups are in fact different. In other words, we fail to detect a difference when in truth there is one.

This will occur more frequently if the test does not have adequate power.

Power

Power refers to the probability of correctly rejecting the null hypothesis when the null is false, the true positives. Power is the complement of type II error (beta): 1-beta.

  • Power = P(Rejecting null | Alternative is true)

Power is the probability of detecting a true difference in means, meaning we want the power of a test to be as high as possible. Commonly an acceptable power should be higher than 80%.

Accepting a higher type I error/significance level (alpha) will lead to a smaller beta, which will increase the power (power = 1-beta) to detect a true effect. Reversely, if we lower the probability of committing a type I error, we will also lower the power. It might not be obvious from the confusion matrix, but if you look at the graph, you can see that this is the case. An increase in alpha leads to a decrease in beta, and vice versa.

However, increasing alpha to values above 0.05 is generally not advised. There are a few other methods we can use to increase power which will be presented in a separate article.

(Power is the equivalent of recall metric in classification, which is also an expression for the percentage of detected positives.)

Confidence level

The complement of alpha, 1-alpha, is the probability of correctly failing to reject the null hypothesis, the true negatives. This probability is called the confidence level of a test and is used when making confidence intervals, commonly set at 0.95 (1-alpha).

  • Confidence level (1-alpha) = P(fail to reject the null | Null is true)

In our example of comparing means, this would be the correct conclusion that there is no difference between the groups.


It should be emphasized that the confusion matrix in inferential analysis represents probabilities, not counts from one individual test. In one hypothesis test, there is only one outcome. The probabilities show how often each outcome is expected over repeated testing, having all assumptions and the null hypothesis being true. To evaluate these outcomes, you could do a simulation where you control the parameters.

If we evaluate the type I error, run a simulation 1,000,000 times, and the significance level is 0.05, we expect to reject the null 50,000 times (5% of the tests). This is not the same thing as rejecting the null in an individual test. Let us say that the null is true, if you reject the null hypothesis, the chance of being wrong is 100%, not 5%. Over repeated testing, you can expect to get false positives, on average, in about 5% of the cases.


If you are interested in how type I error and power can be evaluated, you are welcome to read this in-depth article on how you can evaluate the performance of the t-test using simulation.

A Closer Look at the Performance of the T-test

The Confusion Matrix In Classification


Related Articles