The defendant sits speechless on the stand. The jury thinks he is guilty. He can feel it. The last piece of evidence had been devastating. A picture of him with blood on his shirt and a butcher knife in one hand could be enough to convict him. How could he explain that? How can he convince the jury that it was just a coincidence? He was not a murderer, he just happened to be in the wrong place, at the wrong time with the wrong appearance. The defence calls the last witness to the stand. He is a mathematician. His first statement is: "I have calculated the P-value in this situation and it is much lower than 5%". The jury seems confused. The mathematician explains his calculation. The jury took a closer look at the evidence. At the end of the trial, they agree: the verdict is not guilty.
What are the P-values and what do they mean?
What is this all about? What are P-values and what do they have to do with murder trials? What is more, what is the connection of all this with linear regression? First, we need to understand what P-values actually are. A good starting point is the statistical hypotheses. A hypothesis is a supposition, an idea in which we propose an explanation for something. It could be anything. A hypothesis is a theory that explains why something is happening. Collecting evidence can help us to confirm whether the hypothesis is true or false. One of my weirdest hypotheses is that cats somehow know what is in the fridge without opening the fridge! I don’t have a lot of evidence to support this but it is still a hypothesis. If I want to accept or reject this hypothesis, I should find a way of testing it and then enough evidence to confidently say that the hypothesis is true or false. With this in mind, what is a statistical hypothesis then? It is like the other hypothesis, the difference is that this time we will have an assumption about the distribution of a variable. I could say that more than 75% of the golden retrievers in the world are yellow. That hypothesis is telling me something about the distribution of golden retrievers in the world. A P-value is a way of testing one of these statistical hypotheses, a way of knowing if one could disregard the hypothesis or take it into account.
The concept of P-value is strongly linked to the concept of the Null Hypothesis. Generally, if you are using P-values or thinking about using P-values, you are trying to determine if something has a real effect on something else. The hypothesis, in that case, would be something like this: the speed at which a cat moves has a relation to its colour. Maybe this is just your perception because every time you see a black cat it is jumping around whereas when you look at a siamese, it is lying around waiting for the next bowl of shrimp pate! A null hypothesis is what you are trying to disprove. In this case, the null hypothesis would be that "the colour of a cat does not affect the speed at which cats move". A good way of imagining how to structure a null hypothesis is to add "prove me wrong!" at the end. In that case, you could say things like:
- Cats sleep more than dogs. Prove me wrong!
- The average weight of hair that is shed by Huskys is higher than any other dog breed. Prove me wrong!
- Cats wake up earlier than dogs. Prove me wrong!
- On average, people drink tea faster than they drink coffee. Prove me wrong!
- Everyone who is photographed with a blood-stained shirt and a knife is a murderer. Prove me wrong!
The best way of proving someone wrong is to present him or her with enough evidence that demonstrates that the hypothesis is not true. If I go back to the previous example and say "the colour of a cat does not affect the speed at which cats move (prove me wrong!)", you could find 10 non-black cats that are faster than my black cat. Would this be enough to disregard my hypothesis? Let’s say that you measure the top speed of 1000 non-black cats and the average speed is higher than the average speed of 1000 black cats. How about that? It looks like solid proof! However, you could have picked the slowest 1000 black cats in the world and the fastest 1000 non-black cats. How could I know? Maybe you were lucky… A P-value can help us to determine this. A P-value is the probability of obtaining the results we are observing assuming that the null hypothesis is true. The second part of the last sentence is the key here! A P-value is not the probability of finding a non-black cat that is faster than a black cat. A P-value is the probability of finding a non-black cat, faster than a black cat, in a world where black cats are always faster. So, a small P-value tells us that, if the null hypothesis is true, then it is very improbable that our observation happens. This means that we can reject the null hypothesis. In this case, we can say "it is not true that the colour of a cat does not affect the speed at which cats move". If your group of non-black cats were as fast or faster than the group of black cats, then it was pure luck.
Going back to the trial, a high P-value indicates that in a world where the defendant is guilty, there are a few chances of finding him with a blood-stained shirt and butcher knife. In this case, even if he looks guilty, there is statistical evidence that he might not be. Let’s analyze this with a better example.
Classrooms, grades and multiverses
Prof. Matthew Coolguy is a university professor teaching History of Early Medieval Europe. He has been teaching the same course for more than 8 semesters and he is worried about the poor performance of his students. He thinks that maybe lecturing at 7:00 am is too early for most of the students and is affecting their performance. He also thinks that the course material might have something to do with this. He uses really old slides with a lot of text and black and white pictures. Maybe if he updates the slides, reduces the amount of text and includes colourful pictures the students will be more interested in his course. Finally, one of Prof. Coolguy’s colleagues told him that he also had a similar problem and he solved it by changing classrooms. There are newer and brighter classrooms in a different building that Prof. Coolguy could use. He could move his lectures to that building and see how it goes. With this in mind, Prof. Coolguy decides to run an experiment: he will choose one of these options and start implementing it in the following semesters. Then he will compare the grades of the new semesters with the ones obtained in the old semesters and he will decide if his hypotheses were true.

What would be a good way of doing this experiment? We could simply implement a change in the following semester and if the average grade is higher than the previous average grade we would say that this change had a positive impact on the student’s performance. Is it that simple? Let’s say that after 8 semesters the average grade is 5.9 out of 10. For the next semester, Prof. Coolguy switches the start time of his lectures from 7:00 am to 10:00 am. For the 25 students that are attending class, this is the only thing that changed, the classroom is the same as well as the material and the slides. Let’s now say that at the end of this semester the average grade of these students was 6.9 out of 10. Is this enough to conclude that the change in the start time was beneficial for the students? Maybe these 25 students were really good students and Matthew just happened to find 25 excellent students that will score higher grades no matter the time at which the lectures start. On the other hand, imagine the case in which the 25 students from the next semester are really bad students with no interest whatsoever in the History of Early Medieval Europe. They had scored bad grades regardless of the start time. What would be a better way of analyzing the results?
More than 100 years ago, a chemist working at the Guinness Brewery in Ireland devised a solution to a similar problem. Writing under the pseudonym "student", William Sealy Gosset, invented the t-test that is useful to handle small samples for quality control among other things. The mathematics about the Student’s t-distribution is better covered in other sources. However, we are more interested in the application of this methodology to our problem and the relation between this and P-values. With the t-test we can determine if the changes we made were significant taking into account the sample mean, its standard deviation and the number of points in the sample. This is how it works:
Step 1. Calculate the t-test or t-student value

Note how this value increases with the number of points in the sample and decreases with an increment in the standard deviation. Also, the t-test depends on how different the population and sample means are. We will elaborate on this in the next sections.
Step 2. Calculate the probability of getting a result as extreme as the one you are getting using the Student’s t-distribution

For this step, we need to use the Student’s t-distribution curve. This curve is widely available in commercial software or in tables. The t-distribution curve looks like a normal distribution curve but with heavier tails. In addition to this, the shape of the t-distribution curve will depend on the degrees of freedom that are generally calculated as n-1 where n is the number of samples.
If we are wondering about finding a group of values whose average is higher than the population average, then we would look at the area under the curve that covers the values that are equal to or higher than the t-test. On the contrary, if we are interested in finding a group of values whose average is lower than the population average, we have to look at the area under the curve that covers the values that are equal to or lower than the t-test. There is a third option in which we calculate the area under the curve for both tails. We will come back to this when we analyze this calculation in relation to Linear Regression.
The area under the curve represents the P-value. So every time we see that the P-value is less than 0.05 what we are actually saying is that the area under the curve is less than 0.05. In this case, the area under the curve represents the probability of getting a value as extreme as the one we are getting. Remember that "as extreme" means "equal to or greater than" or "equal to or less than" depending on each case.
Going back to our example, let’s say that Prof. Coolguy decides to move his classes to 10.00 am. Remember, he has been teaching at 7.00 am for 8 semesters and now, for the first time, his course will be at 10:00 am hoping that the students improve their overall scores. In this first semester after the change, he has 30 students and the average of their grades is 6.9/10 which is greater than the previous one (5.9/10). Figure 4 shows the original distribution of grades as well as a black dot that indicates the new average. The dataset that is used in this example was randomly generated only for this exercise it doesn’t represent the real grades of a group of students and it was not copied from any database.

For the new group of students, the standard deviation is 2.75, this means that calculating the t-student value will give us 2. This is the number that we have to use in the t-distribution curve or table. In our case, for a t-distribution with 30–1=29 degrees of freedom, the cumulative probability of obtaining an average that is equal to or greater than 6.9 is 0.027. Figure 5 shows a t-distribution curve with 29 degrees of freedom as well as a black dot that represents the t-student value of 2. The region colored in yellow represents an area of 0.05. This means that any t-student value that falls inside that area will represent a P-value of 0.05 or less. 0.05 is a common P-value threshold that is used to decide if the null hypothesis can be rejected or not.

At this point, it is important to remember that the null hypothesis is what we are trying to disprove. In this case, the null hypothesis states that "the start time of the class does not affect the students’ performance". So, the P-value represents the probability of obtaining the results we are seeing if we assume that the null hypothesis is true. In this case, there is a 2.7% probability of finding a group of 30 students whose average grade is 6.9 assuming that the change in the start time didn’t have any effect on their performance. We are saying that there are very few probabilities of having a group of students with that average grade if the start time didn’t have any impact. This is the reason why a small P-value (generally less than 0.05) is taken as an indication of the relevance of that parameter or that change in the results.
Let’s analyze how a t-student value might end up inside the yellow zone that represents a P-value<0.05. Figure 2 shows the equation that is used to calculate the t-student value. Note how if the difference between the population average and the sample average increases, the t-student value will also increase. This means that this value will be closer to the yellow zone. A variation that doesn’t lead to important changes in the average will have fewer probabilities of ending up inside of the yellow zone. However, the t-student value can also increase according to the sample size. A small change in a few samples will generate a smaller t-student value than a small change in many samples. In the case of our example, an increment in the grades of a few students means less than an increment in the grades of many students. If Prof. Coolguy had made the experiment with 5 students instead of 30, the t-student value would have probably ended outside the yellow zone.
There is still one thing we have to consider which is how dispersed our sample is. This means how different each of the points is from the average. This is measured by the standard deviation (Figure 6). A low standard deviation indicates that the values are close to the mean, while a high standard deviation indicates that the values are spread out. When we calculate the t-student value, a low standard deviation will increase the t-student. In our case, a low standard deviation means that all the students obtained more or less similar grades which means that the changes we did in the experiment affected most students. On the contrary, if the average grade had increased because a few students, who were not influenced by the changes, got better grades; the standard deviation would reflect that since not all grades would be close to the average. This would lower the value of the t-student and will take us to the left of the yellow zone in Figure 5.

The multiverses
Going back to our example, we could ask ourselves what would then happen if Prof. Coolguy implements the proposed changes and, what is more important, how good P-values would be in determining the impact of these changes. We will consider three different scenarios in which Prof. Coolguy implements each one of the three changes: start time, class material and type of classroom. In reality, it would be impossible to test these three scenarios at the same time so, we will generate three multiverses in which we will test what happens with each change. As creators of the multiverses, we know what response each variation will have. This means that we know beforehand the distribution of grades that each change will generate. What we really want to test with this is how the P-values will determine if these changes are significant or not. The following results are contained in a Jupyter Notebook that you can access through this GitHub link.
The distribution of the current grades of 150 students is shown in Figure 7. This distribution of grades has a mean of 5.7 and a standard deviation of 2.6. Figure 7 also contains a histogram of the grades for the cases in which each change is implemented.

- Scenario 1: change in the start time. This change has a positive result for most cases. These 150 students represent the world in which Prof. Coolguy changed the start time of his classes.
- Scenario 2: change in the classroom. This change has a negative impact on most cases. These 150 students represent the world in which Prof. Coolguy changed the classroom.
- Scenario 3: change in the class material. This change did not have an impact in most cases. These 150 students represent the world in which Prof. Coolguy changed the material he uses in his lectures.
In the code attached, we will define which scenario we would like to analyze and then we will randomly select a number of students from that scenario. With this sample, we will calculate the t-student value and determine if the P-value is less than 0.05 which usually means that we can reject the null hypothesis. In this case, it would mean that the modification has an impact on the students’ grades. After this, we can draw multiple samples from the scenario and see if that group of students is performing better than the original group of students.
Scenario 1. Change in the start time.
As we saw in Figure 7, this change will have a positive impact on most of the students’ grades. Let’s say that we implemented this change for a new semester with 30 students. Figure 8.b is showing a t-student distribution and the location of the original sample of 30 students we used to calculate the P-value (big black dot). In this case, the P-value is located inside the yellow area which means that is lower than 0.05.
Figure 8.a shows a normal distribution that represents the original students’ grades and the position of the 30 students from the new scenario with respect to that distribution (big black dot). We can see that in the new scenario, the average is higher than the original one. The question now is: how accurate this group of 30 students is in estimating the performance of the rest of the students in this scenario. In other words, can we trust this group of 30 students to say that any student will perform better when receiving classes at 10.00 am instead of 7.00 am? After all, a P-value of less than 0.05 means that there is only a 5% probability of getting a result such as this one if we assume that the change in the start time has no impact on the students’ grades.

Because we are the masters of this multiverse, we can see what will happen with the rest of the students if Prof. Coolguy keeps his lectures at 10.00 am. Figure 9 is similar to 7 but this time it contains some other points which correspond to five additional iterations in which we selected 30 students from the new population. This is similar to going forward in time and assuming that Prof. Coolguy teaches 5 additional semesters with a group of 30 students each. The y-axis location of each of these points doesn’t mean anything, the points are located in that way to avoid having many points together and make it difficult to see which one is which.

Note that in Figure 9.a, all groups have an average grade that is higher than the original average. This means that for all these groups, the change in the start time had a positive effect. It is also relevant to note that there is a group of students whose calculated P-value is bigger than 0.05 (Figure 9.b). If we had chosen this group of students to determine if the change in the start time had any effect on the students’ performance we would have concluded that it didn’t have any effect. Even when the average grade for this group of students is higher than the previous average, the dispersion in the grades as well as the number of students reduces the t-student value and locates it outside the 0.05 region.
Scenario 2. Change in the classroom.
Let’s focus now on the second scenario which is a change in the classroom. Remember that we know this change will have a negative effect on the average grade of students. If we randomly select a group of 30 students from this scenario and calculate their average grade we obtain 4.7 which is lower than the original average of 5.7 (Figure 10.a). This time the P-value is 0.0204 which represents the area under the curve to the left of -2.1 (Figure 10.b). This means that the probability of obtaining an average grade that is equal to or less than 4.7 in a world where a change in the classroom does not impact the grades is 2.04%.

Now, let’s draw 5 random samples from the population that corresponds to this scenario. Note how for two of these groups the calculated P-value is bigger than 0.05 (Figure 11.b). If we had chosen these groups in the first place, our conclusion could have been that the classroom change did not impact the average grade. On the other hand, there is one group of students whose average grade does not seem to be affected by the change (Figure 11.a). For this group of students, changing classrooms actually improved their grades slightly. This means that even in a case where the sample we took leads to a P-value of less than 0.05, we could have a group of individuals that were not affected by the change. And this is okay! A small P-value is just a small probability, is not a guarantee that something will or will not happen.

Scenario 3. Change in the class material.
The third and final scenario corresponds to the world in which Prof. Coolguy changes the material he uses in his lectures. Remember that this change does not have an important effect on the grades. If we take a random sample of 30 students and calculate its P-value we will probably get a P-value that is always bigger than 0.05 (Figure 12). However, there might be a combination of students in this scenario whose calculated P-value is less than 0.05. In that case, this group of students will indicate that the change has an impact on the grades but then we will see that most of the average grades are similar to the population average.

As you can see in the previous scenarios we can recognize 4 possible outcomes (Figure 13).
- The calculated P-value indicates that we can reject the null hypothesis and the change has indeed an effect on the population: true positive
- The calculated P-value indicates that we can reject the null hypothesis but the change didn’t have an effect on the population: false positive
- The calculated P-value indicates that we cannot reject the null hypothesis and the change didn’t have an effect on the population: true negative
- The calculated P-value indicates that we cannot reject the null hypothesis but the change did have an effect on the population: false negative

A word of caution about P-values
Before going into the linear regressions it is very important to think about the validity and the correct interpretation of P-values. When a P-value is less than 0.05 you could be tempted to think there is less than a 5 percent chance that the results you are getting are due to random chance or that the probabilities of finding a false positive are 5 percent. However, this is not what the P-values are telling us. Actually, we could end up with a false-positive rate higher than 5 percent in cases where the P-value was less than 0.05. Remember, a P-value of less than 0.05 means that there is less than a 5 percent chance of finding the results you found (or more extreme), in a world where the null hypothesis is true. This does not tell you anything about the number of false positives you may end up getting [1].
Regarding the 0.05 threshold, it is important to bear in mind that 0.05 is just an arbitrary limit. Why 0.05 and not 0.025? Or 0.005? If the P-value you found in your experiment is 0.06, how worse is this experiment in comparison to one where you got 0.05? The best answer here is that you should be always cautious whether the P-value is a small value or even a smaller value. According to one of the most read articles in Nature, P-values should not be treated categorically, this means that we should never rush to state that something is or is not statistically significant according to its P-value only [2]. The discussion about the validity of p-values is now live more than ever. So we should be careful about reaching conclusions quickly after calculating the P-values. This also applies to linear regression which is the topic of the next section.
What does this have to do with linear regression?
We have a clear idea about what a P-value is. Now, what is the relation between P-values and linear regression? And, why do P-values keep appearing on the linear regression report? The answer has to do with everything that was previously explained but it still needs some additional thinking. At this point, you are probably familiar with a regression report like the one that comes out of statsmodels. A linear regression report like this is similar to the one obtained in MATLAB, MiniTab, etc. It contains each of the regression coefficients, a general measure of how good your fit is (R²), and the P-values among other things. Reviewing what each of the numbers in the regression report actually means is a good exercise but it is not the main topic of this article. If you want more information about how to read and interpret a regression report please see reference [3]. Our interest at this point is to learn how the P-values we see in the regression report are calculated and how to interpret them.
As we said before, to understand what a P-value is telling us we first need to understand what is the null hypothesis that the P-value is referring to. In the case of linear regressions, the null hypothesis says: "Independent variable X has no effect on dependent variable Y". This means that the P-value is calculating the probability that the regression coefficient takes the value presented on the report assuming that that particular independent variable has no influence over the dependent variable. Let’s imagine this: you are trying to find a regression between dependent variable Y and independent variables X1, X2 and X3. Someone tells you: "There is no way that X2 has something to do with Y, prove me wrong!". After running the linear regression you find that the regression coefficient for X2 is 23 and its P-value is 0.001. What this means is that if you don’t consider the effect that X2 has over Y, then there is only a 1% probability of getting a fit as good as the one you got. In this case, we would say that we can reject the null hypothesis: it is not true that X2 does not have an impact on Y. In other words, a P-value of 0.05 means there is only a 5% chance that the results would have come up in a random distribution [4]. Let’s consider this now: after running the linear regression you find that the regression coefficient for X2 is 23 and now its P-value is 0.75. We cannot reject the null hypothesis now because if it’s true that X2 has nothing to do with Y, then there is a 75% chance of getting a fit as good as the one we got. This time we could say that X2 does not have an impact on Y.
Let’s analyze this with another example. The following picture (Figure 14) shows the relationship between average masses for women as a function of their height in a sample of American women of age 30–39 [5]. This dataset is available on the Simple linear regression page of Wikipedia. Using linear regression we can adjust this data to a function of the form: Weight= a0 + a1*Distance. The blue line in Figure 14 represents this function. According to the results of the linear regression, the coefficients that better adjust the data are a0=-39.062 and a1=61.2722.
![Figure 14. Linear regression for the Weight - Height dataset. (Plot made by the author. Dataset was taken from [5]).](https://towardsdatascience.com/wp-content/uploads/2022/05/1liebxIQuPmTkoIua50np4Q.png)
We can use statsmodels to calculate the P-value for each of the coefficients (Figure 15). In this case, we will get values that are considerably smaller than 0.05. In the linear regression world, this means that both coefficients are important and we cannot disregard any of them. However, what this really means is that if we assume that a0 or a1 don’t have any effect on the weight, then there would be very few chances of getting a fit such as the one we got. This is similar to saying that we have statistically significant evidence that these two coefficients are not zero.

Let’s dig a bit more into this. Consider a relationship between height and weight as the one shown in Figure 16. In this case, the slope of the line is almost zero which means that there is not a strong relationship between height and weight since regardless of the age, the weight seems to be always 65 approximately. Without doing any other calculation, we could reason that, in this case, height is not really important if we want to adjust the data points to a curve.

If we calculate the P-values for this scenario we will find that the P-value for the coefficient that multiplies the height is 0.7302 which is higher than the usual threshold of 0.05 (Figure 17). What does this mean? In the linear regression world, this means that we don’t need this coefficient to adjust the data points which is what we had found before by visually inspecting the plot. Using what we already know about P-values, this number means that the probability of getting a fit as good as the one we got without considering the age is 82%. In other words, we have good chances of adjusting the data if we don’t take into account the height which is what we already knew after looking at the data.

How to calculate the P-values in a linear regression?
At this point, the only thing we need to know is how to calculate the P-values in a linear regression. If you have read so far, then you know that the P-values represent the area under the curve in a t-student distribution. Depending on each case we might calculate the area under the left-tail, the right-tail or both. So, to obtain the P-value we first need to calculate the t-student. For linear regressions this is calculated as:

Note that b0 represents the value of coefficient b if the null hypothesis is true which is zero. So, the only thing we need to calculate the t-student is the standard error of the coefficient. The standard error is a measure of how precise each coefficient is. We can think of it as a standard deviation of a coefficient. However, since each coefficient could have a different order of magnitude, the standard error on its own is not very helpful to indicate how precise each coefficient is. This is the reason why we use the P-values in linear regressions! We take each coefficient, divide it by its standard error and then locate this value in the t-distribution curve.
To calculate the standard error we first need to calculate the standard deviation of the distribution which is generally represented by σ (sigma):

Once we have the standard deviation of the distribution, the standard error is calculated using the (X’X)^-1 matrix (Figure 20). If you have worked with linear regression before you will be familiar with this. Note that the value that goes in (X’X)^-1 (written in red) is different for each of the parameters. Once you have the SE, you can calculate the t-student by dividing the value of the coefficient by the SE.

Since we are interested in the absolute value of each coefficient, the P-value is calculated considering both tails. This means that once we have the t-student value, we will calculate the area under the curve to the right and left of t-student and -t-student. A P-value is actually representing the probability of getting a fit such as the one we got with a coefficient that is bigger than or equal to bj, or less than -bj.

Conclusion
P-values give us some information about the impact of a variation in a population. In linear regressions, they help us to have an idea about how important a variable is if we want to use it to adjust some observed data. It is true that they can be misleading sometimes and that we could find better ways of performing this analysis. However, if we interpret them correctly, there is valuable information we can extract from them. It is always important to remember what they mean, how they are calculated and why the convention of the 0.05 is usually applied. Unlike the example of Prof. Coolguy, we cannot create many multiverses and analyze what will happen to each one of them (not for now at least!). But, we could be like the mysterious mathematician in the trial described at the beginning of this article. We need to embrace the fact that most things in life are uncertain. Statistical analysis, which includes P-values, is our best guide on this uncertain path.
References
- Resnick, Brian (2017). What a nerdy debate about p-values shows about science – and how to fix it. Vox
- Amrhein, V., Greenland, S., McShane, B. (2019). Scientists rise up against statistical significance. Nature 567, 305–307
- McAller, Tim. (2020). Interpreting Linear Regression Through statsmodels .summary(). Medium
- Princeton University Library. Interpreting Regression Output.
- Wikipedia contributors. Simple linear regression. Wikipedia, The Free Encyclopedia. May 3, 2022, 00:13 UTC. Available at: https://en.wikipedia.org/w/index.php?title=Simple_linear_regression&oldid=1085888208. Accessed May 13, 2022.