The world’s leading publication for data science, AI, and ML professionals.

Understanding Alpha, Beta, and Statistical Power

How to minimize errors and maximize results in your hypothesis tests

Photo by Thomas Kelley on Unsplash
Photo by Thomas Kelley on Unsplash

Introduction

Knowing how to set up and conduct a hypothesis test is a critical skill for any aspiring data scientist. It can feel confusing at first trying to make sense of alpha, beta, power, and type I or II errors. My goal in this article is to help you build intuition and provide some visual references.

First, let’s envision setting up a standard A/B experiment where the A group is the control and B is the experimental group. Our null hypothesis is that the two groups are equal and the change applied to group B did not have a significant effect (A = B). Our alternate hypothesis is that the two groups are not the same and that the change applied to group B did, in fact, cause a significant difference (A ≠ B). We could visualize the sampling distributions to look something like this:

Image created by the author
Image created by the author

Confidence Level and Alpha

The confidence level (CL) refers to how sure we want to be before we reject the null hypothesis (how sure do we want to be before we say that the experiment had a significant impact and to implement the changes of group B). This is picked before hand and is expressed as a probability percentage. Do we want to be 95% sure in order to reject the null? Maybe we need to be 99% sure. The confidence level will depend on your test and how serious the consequences would be if you were wrong. Generally, the standard starting confidence level value is 95% (.95).

The alpha value is expressed as 1-CL. If the confidence level was .95 then the alpha value would be .05 or 5%. This represents the probability that we are willing to reject the null hypothesis when it is actually correct. In other words, if we had an alpha of 5% then we are willing to live with a 5% chance that we will conclude that there is a difference when there really isn’t. Making an error like this would be called a false positive or type I error. Let’s look at our picture again to get a visual intuition.

Image created by the author
Image created by the author

The confidence level/alpha value creates a decision boundary. Values that are above the boundary will be considered part of distribution B and support the alternative hypothesis, those below will be considered part of distribution A and support the null hypothesis. You can now see with the shaded portion of the diagram, how the alpha value represents the percentage of values that we are willing to wrongly categorize as being part of distribution B. We have to set this decision boundary and prepare for these wrong answers because there is overlap between the two distributions that creates ambiguity. The shaded portion are those values that in ground truth support the null hypothesis (are part of distribution A) but we mistake as supporting the alternate hypothesis (we misinterpret them as being part of distribution B). That is why we call them false positives – they falsely support a positive test. Just to really drive this point home, if we had a confidence level of 95% and an alpha of 5%, that would mean that the shaded region in the image would be 5% of the area under the curve of distribution A.

Statistical Power and Beta

The power of a hypothesis test is the probability that the test will correctly support the alternative hypothesis. Another way of saying this is that the power is the probability that the entries belonging to distribution B will be correctly identified. Power is calculated as 1-beta. So what is beta? Beta is the probability that we would accept the null hypothesis even if the alternative hypothesis is actually true. In our case, it is the probability that we misidentify a value as being part of distribution A when it is really part of distribution B. A standard power metric is often .8 or 80% which makes beta .2 or 20%. Again, you will want to consider your own test when deciding on the beta level appropriate for you. Let’s visualize beta.

Image created by the author
Image created by the author

The shaded region represents beta. You can see how these are values that will be thought to be part of distribution A (support the null hypothesis) and thus have a negative test result. That is why they are considered a false negative. This type of testing error is called a type II error.

Putting It Together

Photo by Greg Rosenke on Unsplash
Photo by Greg Rosenke on Unsplash

After learning about alpha and beta, you can see how it is often a balance between the two. If we want to avoid false positives or type I errors, then we can raise our confidence level. But the more stringent we are at avoiding false positives then we increase the probability of getting false negatives or type II errors. There are a few things you can consider when trying to reconcile this problem:

  1. You can consider your particular test and which type of error would be worse. Imagine a Covid-19 test as an example. If you were to test someone for Covid-19, a false negative is worse than a false positive because someone who has the virus but is told they don’t (false negative) may end up infecting many others since they think they don’t have it. It is better to wrongly identify people as having the virus (false positive) because the worst that will happen is that they stay home and isolate even if it wasn’t necessary. So the testing for Covid-19 should prioritize having a lower beta value. But perhaps a company testing a website would want to minimize type I errors or false positives because making a change can be an expensive endeavor full of risks that wouldn’t be worth it unless they were very sure it would have the intended positive impact. They may miss out on a few opportunities, but they avoid expensive mistakes. By considering your particular problem you can minimize or maximize the error that matters most to you.
  2. You can consider a minimum amount of difference that you care about for your problem. Let’s say you are doing a test to see if a change to a website increases the conversion rate. If there is only a tiny change in the conversion rate, it is going to be much harder to detect the difference in your hypothesis test and you will be much more likely to make an error. By setting a threshold for how much the metric has to improve for it to be worth it to make the change, you are going to create a situation where there will be less overlap between the two distributions and a smaller region of ambiguity and error. If the change measured in the experiment is very small then there is higher risk of making an error because there is likely to be more overlap between the two distributions.
  3. You can consider increasing the sizes of your samples. By having larger sample sizes in your experiment, it makes it easier for your experiment to pick up differences between the two distributions. It is also easier to pick up smaller differences with a larger number of samples, if small differences are still important to you. You could also figure out the sample size needed based on a desired alpha, power, and minimum effect size using calculators on the internet.

Conclusion

I hope this has been an adequate explanation for these different factors and how they affect each other. Your homework from here is to play around with an interactive tool online that will help you build intuition. If these concepts feel confusing to you right now, then know that they feel confusing to everyone at first. It just takes a little practice.


Related Articles