A General Guidance of Hypothesis Testing

A gentle explanation of P-Value, Type I Error, Type II Error, Statistical Power

Shuyu Luo
Towards Data Science

--

Hypothesis Testing, as such an important statistical technique applied widely in A/B testing for various business cases, has been relatively confusing to many people at the same time. This article aims to summarize the concept of a few key elements of hypothesis testing as well as how they impact the test results.

The story starts from hypothesis. When we want to know any characteristics about a population like the form of distribution, the parameter of interest(mean, variance etc.), we make an assumption about it, which is called the hypothesis of population. Then we pull samples from population, and test whether the sample results make sense given the assumption.

For example, your manager somehow knew that the mean of the click-through-rate per user from company’s website across the user base is 0.06(mean of CTR of population), while you doubt that and believe the CTR should be higher. How to test whether you or your manager will win. What hypothesis testing does is to firstly come up with a hypothesis that is assuming that the mean of CTR μ = 6%, and then randomly pull sample data of a good number of users and calculate the mean of CTR from them(mean of CTR of samples). Based on the observation of the samples, determine whether or not reject the assumption you made. So here comes Null Hypothesis and Alternative Hypothesis.

Null Hypothesis: μ = 6%

Alternative Hypothesis: μ > 6%

Say that the sample size is big enough to be valid, and you found the mean of CTR is 7.5%. Can you just tell your boss that the real CTR is 7.5%? Is the difference of 1.5% based on the samples significant enough for you to make this decision? Probability can always help with uncertainty. Hopefully, you want to provide an answer like this: The probability of seeing the mean of CTR is equal to or greater than 7.5% from the sample is 1% if the null hypothesis is true that is the mean of CTR equals to 6%. Probability as 1% is too rare to happen. So it is reasonable to reject the null hypothesis. This is exactly what P-Value explains.

“P-Value is the probability under the null hypothesis ob obtaining a result as or more extreme than the one observed.”

P-Value = P(μ ≥ 7.5% | Null Hypothesis is true)

Let’s look at the visualization of P-value for two situations as below.

The black line represents the distribution of Null Hypothesis, and the blue line represents the distribution of sample means. Pink area is the P-Value as stated above. The first situation is that these two distributions are far away from each other with a low P-Value of 0.01. It means the probability of seeing the mean of distribution of sample means is equal to or larger than 7.5% is 1%, Given the null hypothesis that the real mean of population equals to 6% is true. This is very rare yet is what we observed so that we stated that the assumption is not true. Similarly, the second situation with P-value equals to 30%, meaning these is 30% probability to see the mean of distribution of samples means is equal to or larger than 7.5%, which is a pretty high probability, so that we can not reject the null hypothesis. This is why when P-Value is small, reject the null hypothesis.

On the other hand, whether or not the null hypothesis is true, and whether or not you should reject the null hypothesis, their combinations can be expressed as probability. This is the beauty of hypothesis testing. Look at the visualization with more details:

Let’s say in this case, P-Value(the area included by the approximate pink triangle) is small enough, equal to 0.05. Now what is the point of α, β and 1- β?

α is the Type I Error, is the probability of incorrectly rejecting the null hypothesis , given that it’s true. It can also be regarded as the maximum value of P-value you can tolerant. After setting the value of α, you will have a corresponding significance level. For example, when α = 5%, significance level is 95%.

α = P(Reject Null Hypothesis | Null Hypothesis is true)

β is the Type II Error, is the probability of failing to reject the null hypothesis, given it’s false.

β = P(Fail to reject Null Hypothesis | Null Hypothesis is false)

1-β is the Statistical Power, is the probability of correctly rejecting null hypothesis, given it’s false.

1-β = P(Reject Null Hypothesis | Null Hypothesis is false)

There are 4 situations summarized as the table below:

This exactly reminds me of confusion matrix we usually implement for classification problem. α is actually False Positive, β is False Negative, 1-β is True Positive, and 1-α is True Negative. See as below.

Obviously, we would like to minimize FP and FN, and maximize TP, but just as the distribution graph showing above, as α decreases, β increases, and statistical power decrease. So there is always a trade-off between them, and we can’t only optimize towards one single metric. It’s exactly the same logic of confusion matrix when you try to balance recall and precision. There are several factors impacting α and β, such as sample size, the spread of distribution, how different between assumption and observation, etc. A standard choice of statistical power is 80%, and for α is 5%.

This article conceptually provided a top-level summary of hypothesis testing along with a few important metrics. There are many other details of calculations and requirements in term of distribution and the choice of the metric to be measured. Keep diving in!

--

--