Know These Problems with P Values
And Why it Doesn’t Matter if your P Value is less than 0.05

Introduction
The p value is an important concept in frequentist statistics, and it is usually taught in introductory statistics courses.
Unfortunately, many of these courses either do a poor job of explaining what the p-value can (and cannot) do or blatantly promote false propaganda relating to the role of the p value in causal inference.
This has led many undergraduate students, and even academics who should know better, to make incorrect claims in their research, all because they found a p value of less than 0.05.
The goal for this article is to clear up the myths surrounding the p value, and hopefully, encourage data scientists to look beyond the p value in their own projects.
What is the P Value?
Any accurate explanation of the p value must first discuss Null Hypothesis Significance Testing (NHST).
NHST is a statistical procedure by which the researcher states two hypotheses: a null hypothesis and an alternative hypothesis.
The null hypothesis states that a given treatment has no effect on the target variable. The alternative hypothesis states the opposite; that is, the given treatment does effect the target variable.
For example, say that we want to determine if minimum wage laws results in a higher unemployment rate.
The null hypothesis and the alternative hypothesis might be as follows:
Null hypothesis: minimum wage laws do not effect the unemployment rate.
Alternative hypothesis: minimum wage laws increase the unemployment rate.
Usually, researchers want to reject the null hypothesis. (Rejecting the null hypothesis lends credibility to the alternative hypothesis).
However, how should we reject the null hypothesis?
This is where the p value comes in.
The p value is a number between 0 and 1 that tells researchers the probability of observing a certain value, given that the null hypothesis is true.
Suppose that we want to test our minimum wage hypothesis. We collect a random sample of American cities – some with minimum wage laws and some without – and find that the unemployment rate for cities with minimum wage laws is half of a percentage point higher, on average, than cities without minimum wage laws.
Is this a victory for free market capitalism? Should we strike down minimum wage laws across the world?
Well, not quite.
In standard academic research, p values greater than 0.05 are generally not considered statistically significant – a fancy phrase to mask an arbitrary cutoff point.
Suppose that we get a p value of 0.78. That is, if we assume that the null hypothesis is true, then there is a 78% probability that we would see the half a percentage point difference between American cities with minimum wage laws and American cities without such laws.
If we tried to submit these results to an academic journal, then the editors of the journal would likely laugh us back to Medium (jokes on them, of course, because more people read Medium than the American Journal of Political Science – but I digress).
In standard academic research, p values greater than 0.05 are generally not considered statistically significant – a fancy phrase to mask an arbitrary cutoff point.
So, the p value is 0.78. That sucks. The research we worked so hard on has proved to be pointless.

But, wait!
Perhaps one advantage of the arbitrary cutoff point is that it prevents false research from entering prestigious academic journals.
Or does it? We’ll come back to this.
The (Many) Problems of P Values
1) P hacking
We’ve just gotten rejected from our favorite academic journal because our p value is too high.
However, we are certain in our results. We were raised on the teachings of Friedman, Greenspan, and Bernanke. We know that minimum wage laws unleash havoc on the labor market.
So, we return to our research to see if we can lower our p value. First, we include more cities into our study: cities from Canada, China, Brazil, Europe, and Russia. Because the p value has a negative relationship with the sample size, increasing the sample size lowers our p value.
Second, we decide to change our model. We control for features that could arguably have an effect on the city’s unemployment rate – population size, location, tax rate, etc.— and features which have a more inexplicable relationship with the unemployment rate – the number of Starbucks in the city, for example. Generally, increasing model complexity can reduce the p value because our model can account for more randomness in our sample.
Now we have a p value of 0.1. Better, but still unpromising.
Finally, we change our test statistic. Perhaps we used the Student’s t test but now we decide to use a Chi-squared test and…
p value = 0.049.
Perfect. Our results are now statistically significant. The academic journal decides to publish our research.
The above procedure – affectionately labelled ‘p hacking’ by statisticians – is a form of manipulating the data analysis process to such a degree as to obscure more information than it provides.
As a researcher, you do not:
- Report that your test only becomes ‘statistically significant’ when you include non-American cities
- Explore the other models you used (which were not statistically significant)
- Report that your results are only statistically significant with a Chi-squared test; the Student’s t test yields non-statistically significant results
Unfortunately, p hacking is widespread in academic research, and it generally leads to conclusions that are difficult to reproduce. (If a research article reports a p value of 0.049, you can be reasonably sure that the researcher did some p hacking).
2) P Values Do Not Show Truth
Anyway, your p value is now 0.049.
Huzzah!
If the null hypothesis is true, then there is a 4.9% chance that we would observe the half of a percentage point difference in the unemployment rate between cities with minimum wage laws and cities without minimum wage laws.
According to applied researchers, this means that we can now safely reject the null hypothesis. That is, we have disproved that minimum wage laws have zero effect on the unemployment rate.
And this is where many applied researchers go wrong.
A low p value, by itself, says nothing about the truthfulness or falsehood of the null (or alternative) hypothesis.
Don’t believe me?
Here is what the American Statistical Association (ASA) had to say on the topic:
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
The 4.9% probability that the observation would occur if the null hypothesis was true does not translate to a 4.9% probability that the null hypothesis is true.
This conclusion should be obvious from the definition of the p value:
The p-value is a number between 0 and 1 that tells researchers the probability of observing a certain value, given that the null hypothesis is true.
We’re measuring the probability of observing a certain value, not the probability that a hypothesis is true.
We’ve already seen how easy it is to manipulate the p value:
- Increase the sample size
- Complicate our model
- Use a different test statistic
However, the p value makes many other assumptions which may be inapplicable in the current research.
For example, the p value assumes that we are comparing similar groups. That is, the American cities are similar to the Chinese cities. This is an unlikely assumption to make, even with our overly-complex model. For one, cities in China are in a drastically different political system than cities in America. Our model cannot quantify this difference.
As well, the p value assumes that the treatment (minimum wage) is applied randomly.
Of course, this is not true. The fact that one American city has a minimum wage and another does not has less to do with randomness and more to do with political and economic differences.
3) A low p value does not imply a strong effect
Suppose we fudged with our model again. Now, we get a p value of 0.004. If we assume the null hypothesis is true, then there is now only a 0.4% probability that we would observe a half a percentage point difference between cities with minimum wage laws and cities without minimum wage laws.
For a moment, let’s ignore all the other problems with p values, and let’s assume that our results represent the ground truth.
Who cares?
Should we prevent millions of people from getting a living wage because that might increase the unemployment rate by half a percentage point?
Oftentimes, a low p value masks the fact that the predicted causal impact is trivial. The ASA is correct when they write:
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold
Unfortunately, many applied researchers ignore causal impact and use small p values to make policy recommendations.
Returning to our minimum wage example, it would be incorrect to decide on a policy only by considering the p value. We have to take a more holistic approach.
So, the p value sucks. What should we do?
We’ve shown how the p value can be misleading. This would not be a big problem if not for its near-universal acceptance as a sufficient indicator of causality. After all, many Statistics, by themselves, can be misleading.
Thus, it begs the question: should we stop reporting the p value?
In 2015, the Basic and Applied Social Psychology (BASP) __ journal banned papers from reporting p values – a drastic step, in my opinion.
I believe the solution is to educate people on the use (and misuse) of p values.
The p value was never intended to be the raison d’etre of statistical research. Rather, when Ronald Fisher – the godfather of frequentist statistics – developed the p value in the 1920s, he intended it to be used as one step, among many, in causal analysis.
Despite its misinterpretation among many, the p value still provides some interesting information. It has to be used in combination with other statistics, domain knowledge, and (a little) common sense.
Notes
- The alternative hypothesis for the minimum wage law example could also have been written as "minimum wage laws effect the unemployment rate"; however, labor economists are usually interested in the alternative hypothesis as I have written it in the article.
Bibliography
[1] Nuzzo, Regina (2014) "Statistical Errors" Nature Vol. 506.
[2] Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD (2015) "The Extent and Consequences of P-Hacking in Science". PLoS Biol 13(3): e1002106. https://doi.org/10.1371/journal.pbio.1002106
[3] Cumming, Geof (2015) "A Primer on p Hacking". MethodSpace. https://www.methodspace.com/primer-p-hacking/
[4] Ronald L. Wasserstein & Nicole A. Lazar (2016) "The ASA Statement on p-Values: Context, Process, and Purpose", The American Statistician, 70:2, 129–133, DOI: 10.1080/00031305.2016.1154108
[5] Karpen, Samuel C (2017) "P value Problems", American Journal of Pharmaceutical Education, 81:9, 93