People are talking about “reproducibility crises” and “p-values” and I feel like I should understand but my eyes are already glazing over… can you help?

Published in

Towards Data Science

9 min readAug 22, 2017

I don’t know how much of a market there is for answering questions like this, but if anyone knows of one, maybe I’ll become the “Dear Abby” of data science. (“Dear Statsy” has some potential.)

Lame jokes aside, let’s talk about this “reproducibility crisis” or “replication crisis.” What is this, and why should I care?

In many scientific advances, the process of validating results from studies usually comes from something called the “peer review process.”

Let’s say that I’ve developed a medication for arthritis that I believe to be superior to the current medication used by doctors. I get a group of people together, give some of them “Matt’s Meds” and some of them “the current meds,” then follow up and see how the effects of arthritis has changed between the “Matt’s Meds” group and the “current meds” group. If the “Matt’s Meds” group seems to have less arthritis pain, I might try to publish my results so that everyone knows that my medication is better for people.

But wait. Maybe you suffer from arthritis, or your partner or parent or friend suffers from arthritis. You have no reason to trust my findings. In fact, you’re likely suspicious. You only want the best for your family and friends, but I might be looking to get rich off of selling my drug… so how do you know that what I’m doing is in the best interest of you and your family?

This is where the “peer review process” comes into play. If I want to publish my results in a medical journal, there are literally referees who read my work to make sure that I followed best practices.

Who knew that academics could be so… sporty?

These referees might ask questions like:

Does the author have any conflicts of interest here that may affect how he/she wrote the article?
Does the design of the experiment make sense, or does it appear that the experiment was not designed in a sub-par way?
Is the analysis fair and balanced, or were results cherry-picked to favor a particular outcome?

After an article has been “refereed” (and possibly revised) enough times, an article may be published so that everyone can now read it. Readers can continue to “unofficially referee” these articles so that if a less-than-perfect article makes it through the refereeing process, the author of the original article or journal that published the article may revise the article or redact it entirely. Depending on the journal, the number of individuals who actually read an article is usually quite low.

In theory, if I published findings that conclude “Matt’s Meds” are significantly better than the existing medication, you should be able to run an identical study on a similar set of patients and get similar results. (In statistical terms, we mean patients that come from the same “sampled population.”)

The term “reproducibility crisis” refers to the fact that, when we attempt to reproduce the experiment, we frequently do not observe the same results.

If I run two experiments with the same exact setup and reach two different conclusions… this is incredibly problematic. Basically, one of two things happened:

Our original experiment’s conclusions were wrong.
Our replicated experiment’s conclusions were wrong.

Since there isn’t anything to differentiate the two experiments from one another aside from the selection of participants, it’s impossible to determine which of the above two things actually happened.

If an effect is unable to be reproduced, we now have to question if our original conclusion was legitimate. If we can’t trust that original conclusion, then that conclusion and any research based on that conclusion is now brought into question.

Wait. So…

Yep. Peer-reviewed journals are unfortunately full of results that can’t be reproduced more than we’d like.

So what can we do?

Well, we want find some way to stop this crisis. There are lots of proposed methods for preventing this from happening, but one very public proposal involves… p-values.

UGH. MATT.

I know, I know. Stick with me.

So, like, what actually is a p-value?

Super formally… if we run an experiment, then a p-value is the probability that, if the null hypothesis is true and we re-run our experiment, we get a test statistic that is as extreme or more extreme than the one we observed in our original experiment.

Um. Can you make that a little less formal?

Suppose I argue that, on average, I eat four Chipotle burritos in a week. You think that’s absurd and ridiculous and all sorts of things but you follow me around for ten weeks in a row and track what I eat. Let’s say that, over these ten weeks, I eat an average of 3.5 burritos per week.

The p-value is a way for us to say, assuming I was telling the truth when I said I ate four Chipotle burritos a week on average, how extreme is it that you observed me over ten weeks and saw that I only ate an average of 3.5 burritos? P-values allow us to quantify how different our observed results (3.5 burritos on average) are from my initial claim (4 burritos on average).

Me eating an average of four Chipotle burritos per week is, sadly, not unrealistic.

What about informally?

p-values simply measure how extreme results of an experiment are.

Large p-values mean that our experimental results are in line with what we’d expect.
Small p-values mean our experimental results are pretty different from what we’d expect.

Small p-values are used to say that the results of an experiment are statistically significant, which is this magical term we’ve used to say that we have enough evidence to make some conclusion about the thing we’re trying to study.

What do you mean when you say “small p-values” or “large p-values?”

Well, society needed some way to decide “is there enough evidence here to conclude what we believe to be true is actually true?” Humans are pretty good at following explicit rules, but are way less good at just making decisions by “feeling.”

Because of this, we’ve historically has some threshold to dictate “Yep, our p-value is small enough to make conclusions” or “Nope, our p-value is too big to make these conclusions.” Historically, we’ve used 5% as this threshold. Some fields use different values and we can go down a rabbit hole discussing p-hacking and multiple testing, but that’s beyond the scope of this article.

The p-value was popularized by Ronald Fisher.

Ronald Fisher, “Father of Modern Statistics” (and my academic great-great-great-grandfather if my Master’s thesis advisor counts as my academic mother), popularized the p-value and the 5% threshold for “significance.”

You run an experiment and get a p-value of 4%. Your results are good and the conclusion you wanted is correct! You’re gonna get published!
You run an experiment and get a p-value of 7%. Your results aren’t significant and the conclusion you wanted must be wrong.

So, like, this sounds good. What’s the problem?

Cristina Yang of Grey’s Anatomy fame is always right.

Well, two things:

First, because of how p-values behave, if we use 5% for this threshold, I can expect that I will detect some significant result 1 out of 20 times when there isn’t actually a significant result. So if there are 5,000 surgical studies each year that should have non-significant findings and use this 5% threshold, I expect that 250 of them will incorrectly be interpreted as significant. If I’m going into surgery tomorrow — as the patient or the doctor — these don’t sound like amazing odds.
Second, the p-value doesn’t measure the right thing.

Wait. What?! I had to memorize some ridiculous definition for some stats class. I’ve stuck with you through this article. You said Fisher was the Father of Modern Statistics” and he’s your weird academic relative and he used p-values. What do you mean the p-value doesn’t measure the right thing?

Going back to my Chipotle example from earlier, I told you that I ate an average of 4 burritos each week. You then kept a food journal for me over ten weeks and observed that I ate an average of 3.5 burritos each week. The p-value quantifies how likely it is that you happened to see me eat 3.5 burritos assuming that I eat an average of 4 burritos each week.

That is, “Given that I actually eat an average 4 burritos each week, what is the probability that you see me eat an average of 3.5 burritos over ten weeks?”

Instead, wouldn’t it be great to know this: “Given that you see me eat an average of 3.5 burritos over ten weeks, what is the probability that I actually eat an average 4 burritos each week?”

See, we treat the p-value as really similar to the probability of our original conclusion being correct given that we observed some real-world data. But the p-value is actually related to the probability of observing this real-world data, assuming that our original hypothesis is right.

There’s a lot of probability I’m concealing right now, but the TL;DR version is that the p-value approximates the probability of A assuming that B is true, when we really want to evaluate the probability of B assuming that A is true.

p-values don’t measure what we think it is. The thing p-values measure and the thing we want to measure are related to one another, but they’re not the same.

Yeah, I know, there’s only one Lindsay Lohan, but Parent Trap >> Full House.

Okay. So that seems to make some sense. So how do we change things?

There are a lot of recommendations on how to do this. Wikipedia has a solid, in-depth description of five different methods for changing things.

But a recently popular recommendation is to move that standard threshold of statistical significance from 5% to 0.05%. If we do that, we would drastically cut down on the number of false positives. (Using our surgical example from above, we’d expect to only see 2.5 false positives out of 5,000 studies as opposed to the 250 false positives we previously expected.)

But, wait, I thought you said p-values don’t measure what we want it to measure.

Yep. Exactly. This doesn’t change that. This addresses the symptom as opposed to directly solving the problem.

There is a large set of really, really smart people who promote this idea. These authors also recognize that researchers want to use methods that aren’t based on p-values (like Bayes factors), including many of the authors themselves! The authors simply argue that adjusting this p-value threshold for those who do rely on p-values will have a substantial, positive effect on this “reproducibility crisis.”

So there’ll be less “fake news” and “alternative facts” around!

NO. “Fake news” and “alternative facts” are these ridiculous political communications terms used when people don’t want to face real facts. We need to get them out of our vocabulary as soon as possible.

I really dislike Kellyanne Conway. This is not an alternative fact.

But addressing the reproducibility crisis does protect us from relying on conclusions that are actually factually incorrect.

You can check out my other blog posts here. Thanks for reading!