Data Science Concepts
What Exactly is a P-value Anyways?

One of the first things you must learn (and explain in many interviews) as a Data Scientist is the P-Value. If you don’t have a statistical background and are looking to enter the Data Science field, you will come across the concept of the P-value. Another thing you will also find out eventually is that you will have to explain the P-value during any technical interview you will likely have. The importance of a p-value can never be understated because it is constantly used throughout most, if not all, data science projects.
Here we will be explaining the concept of the P-value and how to interpret its value. Even if you do not have a background in statistics and mathematics, you should be able to understand the P-Value’s concepts and purpose by the end of this article.
What is the P-Value?

Let’s begin by explaining what the P-value is exactly. The "P" in P-value means probability. The value it is referring to is a numerical value ranging from 0 to 1. That so far is the extreme basics of the P-Value.
The calculation of the value is determined by something called a z-score. And the z-score is derived from another formula that calculates Standard Deviation. These all work together in order to retrieve the P-Value.
What is the P-Value used for?

When conducting statistical or hypothesis testing, the p-value is used to determine if a result is statistically significant or not. What does this mean? In other words, the result is due to some other factor and not random chance. That factor could be the exact factor you were looking for when you first initiated the testing.
The Numbers in P-Value

When we conduct these statistical hypothesis tests, we are generally looking for a P-value smaller than our desired threshold. That threshold is usually less than .05 or 5%. Why these numbers exactly? Because those number are low enough statistically and it is usually where most statisticians set their statistical significance levels.
The number represents how likely the result was obtained due to randomness or chance. If the value is less than .05 than we can safely say that the result was not something due to random chance. It may not be enough to make a conclusion on its own but it is a start.
Simple Example of the P-Value
Let’s explain the P-Value with some more context by providing an example:

Let’s say you are a baker who just got a new shipment of raw honey. But here’s the catch, you’ve never even had or heard of honey before. People are telling you that it will make your cookies sweet just like sugar does, but you don’t believe them and will try for yourself. So you set out to do your own experimentation or testing with the honey.
First, you set aside two different cookie recipes. One with the honey and one without the honey. Then, you mix and bake both of them. After baking them in the oven, you take them out and taste each one to see if there is a difference in sweetness levels. You can taste the sweetness in the honey cookie more than the one without honey. However, because you are a stubborn and traditional baker who never strays from sugar, you decided that this test might have been a fluke. So you do another test but with more cookies.

With this next test, you are baking 200 cookies: 100 with and 100 without the honey in their recipe. These are a lot of cookies but you want to make sure that honey is actually a sweetener. So you make a simple statement of your general belief at the beginning of this test, which statisticians will call the null hypothesis:
"Honey does not make these cookies sweeter"
As a result, the other alternative statement, which is what statisticians also call the alternative hypothesis:
"Honey does make these cookies sweeter"
After mixing, baking, then finally tasting and giving each cookie a sweetness score, you begin doing some calculations in order to find the P-Value:
- You find the Standard Deviation of the sweetness scores for the cookies.
- You find the Z-Score.
- You find the P-Value from knowing the Z-Score.
After all that, you come up with a P-Value of something ridiculously small like .001. Since that p-value is well below the statistical significance threshold of .05, you have no choice to but to reject your general statement or null hypothesis declared in the beginning. You have come to the conclusion that honey does, with statistical probability, make your cookies sweeter.
And from now on, you might consider using honey for one of your recipes every now and then.
Conclusion
The above is one simple example of how someone might conduct hypothesis testing and how they might interpret the P-Value at the end of the experiment. By using the P-value derived from the end of the baker’s test, they were able to determine with enough statistical probability that cookies are sweeter with honey in them.
Hopefully, you have a clearer understanding of P-values and how to interpret them. It is important to understand their significance because they are everywhere in data science and will eventually come up during interviews. Use this article to prepare and understand if you’d like. Hope you enjoyed learning about math and statistics!