What is the p-value?

A detailed explanation of p-value

Chia-Yun Chiang
Towards Data Science

--

If you google “what is p-value”, the first result shown on the page is the definition from Wikipedia:

“ The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.” — Wikipedia

Hmm… the good thing is we know this definition is correct; the bad thing is this definition is too correct to understand.

So…today’s game is trying to break down this sentence!

Since now is the holiday season! (Hooray~), let’s invite Gingerbread Man to join us for fun!

Gingerbread Man: Are you sure? p-value…for fun??? (Photo by Jill Wellington on Pexels)

Santa Claus’s Cookie Shop

Santa Claus’s cookie shop is selling their famous product — gingerbread cookie! Santa is very proud of his cookies. He believes his product is the most delicious one in the world. Also, Santa said that the average weight (μ) of each product (a bag of gingerbread cookies) is 500g.

Now is the 21st century. Santa already has his own factories and automated machines to help him make cookies. As you know, there’s no perfect machine and production process, so there is a variance between different bags of cookies. Assume that we know the bags of cookie is normally distributed with a standard deviation (σ) equals 30g.

So, if Satna’s claim is true (the average weight of one bag of cookies = 500g), we could expect the distribution of one bag of cookies looks like below:

Can We Believe in Santa’s Words?

But, as a curious customer who really loves gingerbread cookies, I am wondering…does the average weight of a bag of cookies really equal 500g? What if Santa deceives customers and gives us less than 500g cookies? How do we validate Santa’s words?

Here, is where “hypothesis testing” comes in.

To implement hypothesis testing, firstly, let’s set up our null hypothesis (H0) and the alternative hypothesis (H1). As a reasonable person, we should not suspect others without having any evidence. So, we assume Santa is honest about his business (H0). If we want to check whether his cookies is less than 500g, we need to collect data and have enough evidence to support our guess (H1). So…we have the hypothesis statement set up as follows:

H0: Average weight of one bag of cookies (μ) = 500g
H1: Average weight of one bag of cookies (μ) < 500g

Since we are unsure about how our population distribution looks like, I use the dashed line to represent possible distributions. If Santa’s claim is true, we could expect one bag of cookies has a distribution with a mean weight equals to 500g (left picture). However, if Santa's claim is not true and the mean weight of cookies is less than 500g, the population distribution should look differently (any of right picture).

Cool! The problem statement is set. So now, the next question is: how to test our hypothesis statement?

Maybe just weigh all bags of cookies so that we could know the exact population distribution? Well…obviously, it is IMPOSSIBLE for us to collect ALL the cookies (population) produced from Santa Claus’s cookie shop!!! So…what should we do?

Here, “inferential statistic” comes in handy!

Core Concept of Inferential Statistics

In inferential statistics, what we are interested in is the population parameters (attributes of the population). However, it is almost impossible to collect all the data of the whole population to calculate the parameters. As a result, we sampling from the whole population to get the sample data. Then, we calculate the statistic (attributes of the sample) from the sample data as our estimator to help us infer the unknown population parameters. (As the picture below)

Examples of parameters and statistics:
- parameters: population mean (μ), population standard deviation (σ) …
- statistics: sample mean (x̄), sample standard deviation (s) …

Testing our hypothesis statement is also an inferential statistic work. The process is the same as above. But now, we are not interested in a single unknown parameter; instead, we are interested in “whether we can reject the null hypothesis?”.

How to answer this question? The same method is used — we calculate the statistics from our sample data for inferring the answer to this question. The statistics used here called Test Statistics.

Great! So now, we know we should collect sample data and calculate the test statistic in order to test the hypothesis statement.

But…let’s pause for a second. Before jumping into the testing part, let me quickly review the concept of sampling distribution to make sure we are on the same page.

Sampling Distribution Review

Sampling Distribution is the distribution of the sample statistic.

Let’s take one of the statistics — sample mean (x̄) — as an example. If we sampling from the population many times, we could get many sample datasets (sample 1 to sample m). Then, if we calculate the sample mean (x̄) from each sample dataset, we could get m data points of the sample mean (x̄). Use these data points, we could draw a distribution of sample mean (x̄). Since this distribution is from the sample statistic, we called the distribution Sampling Distribution of sample mean (x̄).

The same idea applies to other statistics. For example, if we calculate the test statistic from each sample dataset, we could get the sampling distribution of the test statistic.

A sampling distribution is similar to all the other distributions, it shows how likely (probability) the statistic value might appear if we sampling from the population many times.

I’ll use brown color to represent the sampling distribution curve in the following sections.

Nice! Now, it’s time to jump into the testing part!

Testing Hypothesis Statements

The first thing we need to do is to have a sample dataset. So, I go to Santa Clause’s cookie shop and randomly pick up 25 bags of cookies (n) as our sample data. Also, I calculate the mean weight (x̄) of this sample is 485g.

The first part of testing is to compare our sample statistic to the null hypothesis so that we can know how far away our sample statistic is from the expected value.

To do so, we first assume the null hypothesis is true. What does this mean? This means, in our case, we assume the population distribution of one bag of cookies is really equals to 500g. And if this statement is true, according to Central Limit Theorem, we could have a sampling distribution of sample mean (x̄) looks like the below picture (mean value of the sample mean = 500g) if we sampling from this population many times.

p-value definition:
“ The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.” — Wikipedia

So now, if the null hypothesis is true, we could easily see that our sample mean is 15g below (485–500 = -15) the expected mean value (500g).

Hmm… but “15g” is only a number, which is not very helpful for us to explain the meaning. Also, if we want to calculate the probability under the curve, it is inefficient to calculate it case by case (imagine there are numerous distributions, each of them has its own mean and standard deviation…you really don’t want to calculate the probability for many many times…)

So, what should we do? We standardize our value so that the mean value of distribution always equals zero. The benefit of standardization is that statisticians already generate a table that includes the area under each standardized value. So that we don’t need to calculate the area case by case. All we need to do is to standardize our data.

How to standardize? In our case, we use the z-score to transform our data. And z-score is the Test Statistic in our case.

The below picture shows the sampling distribution of the test statistic (z-score). We can see that if our sample data exactly equal to the null hypothesis (the population mean =500g, the sample mean = 500g), we should have the test statistic equals to 0. In our case, our sample mean equals 485g, which gives us the test statistic equals to -2.5 (-2.5 is the test result we observed from our sample data). This indicates that our sample data has 2.5 standard errors below the expected value.

p-value definition:
“ The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.” — Wikipedia

Here, I want to mention that the test statistic is chosen based on different cases. You might hear different kinds of statistical tests, such as z-test, t-test, chi-square test…Why we need different kinds of tests?

Because we might need to test different types of data (categorical? quantitative?), we might have different purpose of testing (testing for mean? proportion?), the data we have might have a different distribution, we might only have limited attributes of our data……Hence, how to choose a suitable testing method is another crucial work.

In this case, since we are interested in testing the mean value, also, I assume our population data is normally distributed with known population standard deviation (σ). Based on our condition, we choose the z-test for this case.
** Please refer to the assumptions of different kinds of statistical test if you are interested in when to use each statistical test.**

Okay, so now we have our test statistic. we know how far away our test statistic is from the expected value when the null hypothesis is true. Then, what we really want to know is: how likely (probability) we get this sample data if the null hypothesis is true?

To answer this question, we need to calculate the probability. As you know, the probability between one point to the other point is the area under our sampling distribution curve between these two points.

So here, we do not calculate the probability of a specific point; instead, we calculate the probability from our test statistic point to infinite — indicates the cumulative probability of all the points which farther away from our test statistic (also farther away from the expected test statistic).

This cumulative probability is our p-value.

You might wonder why we don’t calculate the probability of the specific test statistic (one point). Here are two possible explanations I found in this post:

(1) Mathematically, the probability of a specific point on the probability curve is zero. To calculate the probability, we need to calculate the area under the curve.

(2) To decide whether we should reject the null hypothesis, we use p-value compare to the significant level. since the significant level is the cumulative probability, we need to use the same format to compare two of them. Hence, the p-value should also be a cumulative probability. (I’ll introduce siginificant level later)

p-value definition:
“ The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.” — Wikipedia

Wonderful! We just explained all the parts of the p-value definition!

Let’s calculate the p-value in our case. As I mentioned before, the best thing to use the test statistic is that statisticians already understand the attributes of the sampling distribution. So we could just look up the z-table, or use any statistical software to help us get the p-value.

In our case, we have p-value equals 0.0062 (0.62%).

Please note that we are doing the one-tail test in our case. That is, we only consider one direction for calculating “extreme” probability in the distribution. Since our alternative hypothesis (H1) is set up as “mean value less than 500g”, we only care about the value that less than our test statistics (left-hand side).

We can identify which tail we should focus on based on our alternative hypothesis. If the alternative hypothesis includes:
(1) interested attribute less than (<) expected value: focus on left-tailed
(2) interested attribute greater than (>) expected value: focus on right-tailed
(3) interested attribute not equal to (≠) expected value: focus on two-tailed

Notes:
interested attribute- mean, proportion…(In our case is mean value)
expected value- specific number…(In our case is 500g)

Now, we have p-value = 0.0062. Hmm… it is a small number…but what does this mean?

This means, under the condition that our null hypothesis is true (population mean really equals 500g), if we sampling from this population distribution 1000 times, we will have 6.2 times chance to get this sample data (sample mean = 485g) or other samples with sample mean less than 485g.

In other words, if we get sample data with a sample mean equals to 485g, there are two possible explanations:

  1. The population mean really equals 500g (H0 is correct). And we are very “lucky” to get this rare sample data! (6.2 times out of 1000 times sampling)

Or…

2. The assumption of the “null hypothesis is true” is incorrect. This sample data (sample mean equals 485g) actually comes from other population distribution where the sample mean = 485g more likely to happen.

Cool! So now we know that if our p-value is very small, that means either we get a very rare sample data or our assumption (null hypothesis is true) is incorrect.

Then, the next question is: we only have the p-value now, but how to use it to judge when to reject the null hypothesis? In other words, how small the p-value is, we are willing to say that this sample comes from another population?

Here, let’s introduce the judgment standard — significant level (α). The significant level is a pre-defined value that needs to be set before implementing the hypothesis testing. You can seem significant level as a threshold, which gives us a criterion of when to reject the null hypothesis.

This criterion is set as below:

if p-value ≤ significant level (α), we reject the null hypothesis (H0).
if p-value > significant level (α), we fail to reject the null hypothesis (H0).

Say, I set my significant level as 0.05.

We can see the below picture, the red area is the significant level (In our case, it equals 0.05). We use the significant level as our criterion, if the p-value within (less than or equal to) the red area, we reject H0; if the p-value exceeds (greater than) the red area, we fail to reject H0.

Here, I want to mention that the significant level (α) also indicates the maximum risk we are acceptable for type I error (type I error means we reject H0 when H0 is actually true).

It is easy to see why through the below picture — The distribution curve below indicates the null hypothesis (H0) is true. And the red area is the probability we decide to reject the null hypothesis when it is true. If the p-value equals the significant level (our case is 0.05), then it would be the maximum probability we incorrectly reject H0 when H0 is true.

In our case, we have p-value = 0.0062, which smaller than 0.05, as a result, we can reject our null hypothesis. In other words, based on our testing, we are sad to say that we have enough evidence to support our alternative hypothesis (the mean value of one bag of cookies is less than 500g). And that means… we have enough evidence to say that Santa cheats to us….

Well…what happens if we change the significant level to 0.005?

The result will be different. Since 0.0062 > 0.005, we then fail to reject H0. So here is the tricky part, since the significant level is subjective, we need to determine it before the testing. Otherwise, we might very likely to cheat ourselves after knowing the p-value.

We have enough evidence to support that you cheat on us!!! (Photo by Andrea Piacquadio on Pexels)

Recap

Thank you for reading to this point. Let’s have a quick recap to close today’s game!

What is the p-value?

Part 1: To test whether our sample data support the alternative hypothesis or not, we first assume the null hypothesis is true. So that we can know how far away our sample data from the expected value given by the null hypothesis.

p-value definition:
“ The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.” — Wikipedia

Part 2: Based on the distribution, data types, purpose, known attributes of our data, choose an appropriate test statistic. And calculate the test statistic of our sample data. (Test statistic shows how far away our sample data from the expected value)

p-value definition:
“ The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.” — Wikipedia

Part 3: Calculate the probability (area under the sampling distribution curve) from the test statistic point to infinite (indicates more extreme) at the direction represent your alternative hypothesis(left-tailed, right-tailed, two-tailed).

p-value definition:
“ The p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct.” — Wikipedia

This cumulative probability is the p-value.

What is the meaning of a small p-value?

If we have a very small p-value, it might indicate two possible meaning:
(1) we are so “lucky” to get this very rare sample data!
(2) This sample data is not from our null hypothesis distribution; instead, it is from other population distribution. (So that we consider to reject the null hypothesis)

How to use p-value?

To determine whether we could reject the null hypothesis, we compare the p-value to the pre-defined significant level (threshold).

if p-value ≤ significant level (α), we reject the null hypothesis (H0).
if p-value > significant level (α), we fail to reject the null hypothesis (H0).

** Thank you for reading! Welcome to any feedback, suggestions, and comments. I would be grateful if you could let me know what you think and the possible errors within my article. **

Icon Attribution

Gingerbread Man icon made by iconixar from Flaticon

Paper Bag icon made by catkuro from Flaticon

--

--