The world’s leading publication for data science, AI, and ML professionals.

P-value, Hypothesis Testing and Statistical Significance

How to understand if the difference really matters.

You have probably heard about backgammon. Although being one of the oldest board games, it has resisted the digital age and still very common in many eastern cultures. Backgammon is a two-player game where each player has 15 checkers. The goal is to move the checkers to the corner of the board and collect them. Players roll a pair of dice and move the checkers accordingly. So, it is a game that requires strategy and luck to be synchronized to win which I think is the main reason why it has been around for a long time. For this post, we are interested in the "luck" part of backgammon.

Backgommon (Figure source)
Backgommon (Figure source)

A dice has six outcomes, from 1 to 6. When you roll two dice, number of outcomes increases to 36 (6×6). For the sake of simplicity, I assume the game is played with one dice only. The number of movements you can make depends on the outcome of dice. So, if you roll higher numbers, you can move fast and increase your chance to win. If you keep rolling all sixes, your opponent will start calling you "lucky" after a few rolling. For example, it is highly unlikely that you roll 6 for three consequtive times. Now it is time to introduce p-value.

P-value is a measure of how unlikely an event is. This definition may cause to understand p-value as probability of an event. It is related to probability of an event but they are not the same thing.

P-value is the probability of getting our observed value or values that have same or less chance to be observed.

Consider rolling dice example. Let’s define an event A as "rolling a 6 for three consequtive times". Then, the probability of event A:

This small number is the probability of event A. What is the p-value for event A. P-value has three components:

  1. Probability of event A
  2. Probability of events that have equal chance to occur as event A
  3. Probability of events that have less chance to occur than event A

We already know the probability of event A. Let’s calculate the other parts. The events that have equal chance to occur as event A are rolling a different number for three consecutive times. For example, rolling a 1 for three times. Since there are 6 numbers on a dice, the probability of these events (excluding 6 because it is already calculated):

Note: Order is ignored for simplicity. If order is taken into consideration, 5–6–6 or 6–5–6 have same probability as 6–6–6. We consider the case 5–6–6 and 6–5–6 as same (one time 5 and two times 6).

There is no event that has less chance to occur in our case so the probability of less likely events is zero. Therefore, the p-value for rolling a 6 for three consequtive times:

The p-value is low so we can say this is an unlikely event to occur.

P-value is mostly used in hypothesis testing. Consider we have a website and plan to make some changes in its design to increase the traffic. We want to test if "new design" attracts more attention and thus increase the traffic to the website. We should define two hypotheses:

  • Null hypothesis: New design does not increase the traffic
  • Alternative hypothesis: New design increases the traffic

And we measure the traffic by click throug rate (CTR). When it is not possible or feasible to compare two populations, we take samples and compare samples on behalf of populations. In our case, population is all the traffic to our website during its existence which is impossible to know until it is dead. So we take samples. We show the current design to half or our audience and the "new design" to the other half. Then we measure the click through rate for 50 days (i.e. collect 50 samples). We calculate the CTR for all samples and find out that the average CTR of "new design" is higher than the current design. Do we change the design permanently just by comparing the means? Absolutely not.

We may receive a higher CTR by random chance. We need to check the p-value. Before going in that step, I will mention a state-of-the-art theorem known as central limit theorem:

According to the central limit theorem, as we take more averages from a data distribution, the averages will tend towards a normal distribution regardless of the population distribution. Central limit theorem is more certain if sample size is large and the number of samples are more than 30.

So the distribution of sample means of current version is as follow:

This is the probability distribution. As seen on the graph, the most likely CTR for current design seems to be 10. We can say it is likely to observe values between 12.5 and 7.5. However, as the values keep increasing or decreasing, the probability significantly reduces. The values in the tails are very extreme and highly unlikely to observe. If the average CTR obtained from the new design is somewhere around 12.5, then we can conclude that this result might be due to random chance because it is also possible to be obtained from the current design. Recall that p-value is the probability of obtaining our result or equally likely or more extreme results. Thus, the p-value in this case is the green area in the graph below.

P-value of getting 12.5 from the new version
P-value of getting 12.5 from the new version

Let’s assume the average CTR of new design is 15.0 which is an extreme value to obtain with the current design. The p-value of observing 15.0 with current design can be seen in the graph below. Since it is highly unlikely to get 15.0 with the current design, we can conclude that the difference between the results from the new and current design is not due to random chance. Thus, the new design actually increases the click through rate of the website.

P-value of getting 15.0 from the new version
P-value of getting 15.0 from the new version

P-value tells us if the difference actually matters.

Note: We are testing if the result of new design is higher than the current design. If we were to test whether the result of new design is different than the current design, we would need to include the values on the left tail that has probability same as or lower than our result. In this case, the p-value becomes:

P-value if the new version is different than current version with a result of 15.0
P-value if the new version is different than current version with a result of 15.0

Lower p values show more certainty in the result.

If the p-value is 0.05, then we are 95% sure of the results. In other words, there is 5% chance that the results are due to random chance. This process of comparing two samples on behalf of populations is known as statistical significance test.

Statistical significance test measures whether test results from a sample are likely to apply to the entire population.

Before performing a statistical signifiance test, we set a confidence level which indicates how sure we want to be of the results. If we set the confidence level as 95%, the significance value is 0.05. In this case, for a test to be statistically significant, p-value must be lower than the significance value. Assume we set the confidence level for our test as 95% and find a p-value of 0.02, then:

  • We are 98% sure that the CTR of new design is higher than the current design.
  • There is a 2% chance that the results are obtained due to random chance.
  • According to our confidence level, the results are statistically significant.

We can now apply the sample result to the entire population. In terms of hypothesis testing, this result suggests rejecting the null hypothesis and take action based on alternative hypothesis. Recall the two hypotheses:

  • Null hypothesis: New design does not increase the traffic
  • Alternative hypothesis: New design increases the traffic

We can go with the new design.

The confidence level depends on the task. 95% is a commonly used value. For sensitive tasks like chemical reactions, the confidence level might be set as high as 99.9%. In the case of 99.9% confidence level, we look for a p-value of 0.001 or less.


Thank you for reading. Please let me know if you have any questions.


Related Articles