Statistical Thinking

Why are p-values like needles? It’s dangerous to share them!

There’s a war on p-values… and both sides are wrong

Cassie Kozyrkov
Towards Data Science
9 min readJun 17, 2019

--

Image source: istock.com/brianajackson

For a concept taught in almost every STAT101 class, the amount of debate around p-values is staggering. As a statistician with both Frequentist and Bayesian sympathies, let me try to cut through the noise for you. I’m going to be cheerfully irreverent to both sides.

If you’re new to p-values, take a moment to check out my simple explanation with puppies or the video above to get you oriented. In a nutshell, a p-value is a decision-making tool that tells whether you should feel ridiculous about your null hypothesis.

The case against p-values

When people (often Bayesians) criticize p-values, it usually boils down to one of two arguments:

  1. Something involving definitions or formulas. Often includes “posterior probabilities are better” in there somewhere.
  2. Something involving anxiety about potential for misuse.

Allow me to translate these to what they sound like to my ear:

  1. I don’t like the way you’re setting up your statistical decision-making.
  2. Lazy people are lazy.

Argument 1 (against)

If you’ve been making Argument 1… well, it makes you look bad. That’s you forgetting that statistics is the science of changing your mind and it’s up to you to frame your decision-making however you want to, then pick the right tool for your frame. (Also, back off if you’re more skilled at math than in making decisions!)

The right approach to choose depends on how the decision-maker wants to make the decision.

If you want a reasonable way of selecting actions and you’re thinking of it in terms of minimizing the risk of picking the wrong action, the Frequentist approach is not so very evil. If you prefer to think about evolving your personal opinion with data, then the Bayesian approach makes more sense.

The right approach to choose depends on how the decision-maker wants to make the decision, so there’s no one-size-fits-all here. This stuff has many right answers… preference and philosophical stance have everything to do with it. Why are some people so puffed up about getting things Right when those things are inherently subjective and have no right answer? It baffles me.

Argument 2 (against)

Argument 2 (potential for misuse) is fair, but it’s not the p-value’s fault. It turns out that making decisions carefully using statistics takes effort, but people keep looking for the miraculous no-effort magic that gets them a crystal ball. The mysterious p-value is tempting — most of its users don’t understand how to use it and the resulting broken telephone has reached ridiculous levels. I’m with you.

That’s why I’m a huge advocate of chilling out. In other words, I’m a fan of making data-inspired decisions where you commit to not taking yourself seriously if you’re not willing to put in the effort. The best solution for those who are feeling lazy: do descriptive analytics and stay humble.

If you’re not willing to put in the effort, opt for descriptive analytics and stay humble.

Statistical inference only makes sense if you go about it rigorously in a way that fully honors the intentional way that you set up your decision frame and assumptions. This isn’t a p-value problem. It’s a snake oil problem: Statistics is often sold as a magical cure-all that purports to deliver guarantees that are crazy if you stop to think about it. There’s no magic that makes certainty out of uncertainty… but somehow there are many charlatans implying the opposite.

The case for p-values

You should also be suspicious of anyone who professes rabid love for p-values. They’re only useful in very specific circumstances. But when p-values are useful, they’re very useful.

They are a useful tool for making decisions a particular way.

It’s pretty hard to challenge that one. For decision-makers wishing to do their best in an uncertain world and make decisions in a specific way, the p-value is perfect. Don’t rain on their parade just because you’d prefer to make the decision a different way — when it’s your turn to be the decision-maker, you can do it however you please.

The other case for p-values

If you’re interested in analytics (and not statistics), p-values can be a useful way to summarize your data and iterate on your search. Please don’t interpret them as a statistician would. They don’t mean anything except there’s a pattern in these data. Statisticians and analysts may come to blows if they don’t realize that analytics is about what’s in the data (only!) while statistics is about what’s beyond the data.

Don’t use the word hypothesis when you’re doing analytics, or you’ll sound like an idiot.

To an analyst, a p-value is just another statistic, with no interpretation except “this is the number I get when I shake my dataset in a particular way, when it’s small it means my dataset has a certain kind of pattern” — think of it as a way to visualize complicated and large datasets efficiently. Don’t use the word hypothesis when you’re exploring data with analytics*, or you’ll sound like an idiot. You work with facts: these data have this pattern. Period.

To learn more about the difference between the subfields of data science, see bit.ly/quaesita_datasci.

Enough with analytics — there’s no battle there (just as there are no rules beyond “Don’t make conclusions beyond the data!”). Back to statistics, where the argument is heated!

The case for confidence intervals instead of p-values

You’re in the wrong room, buddy. Go back to analytics where confidence intervals are a more efficient way of visualizing and summarizing data. In statistical decision-making, no one cares. Why? The decisions you get using confidence intervals and p-values are identical. If you’re doing real statistical inference, you should be indifferent for any reason that isn’t aesthetic.

(It’s true that it’s a kindness to future data explorers — analysts — if you report your results with confidence intervals, but that’s got nothing to do with the quality of your decision-making.)

Back to basics

Let’s revisit the situation where p-values make statistical sense. First, you’re setting up your decision around the notion of a default action and you’re giving the data a chance to talk you out of it. You’re not trying to form mathematically-describable opinions (go Bayesian for that). You’re willing to make a decision in a way that follows the logic in this blog post. If not, p-values are not for you. There’s nothing to argue about. They’re a good tool for some jobs, but if that’s not the job you need done, then go get a better tool. Since when do we expect that one tool should fit every job?!

Now that you’ve decided to test hypotheses the classical way, let’s see how you calculate a p-value.

Create the null world

Once you have your null hypothesis stated formally (after you’ve done this), the bulk of the work will be visualizing the null hypothesis world and figuring out how things work there so we can make a toy model of it.

That’s the point of those arcane scribbles you might remember from stats class — they boil down to making a mathematical model of a universe whose rules are governed by the null hypothesis. You build that universe out of equations (or by simulation!) so you can examine it in the next step.

The math is all about building a toy model of the null hypothesis universe. That’s how you get the p-value.

The math is all about making and examining toy universes (how cool is that, fellow megalomaniacs!? So cool!) to see how likely they are to spawn datasets like yours. If your toy model of the null hypothesis universe is unlikely to give you data like the data you got from the real world, your p-value will be low and you’ll end up rejecting the null hypothesis… change your mind!

Assumptions, assumptions, assumptions

Naturally, you’ll have to make some simplifying assumptions, otherwise you’ll get overwhelmed quickly. No one has the time to make a universe as rich and complex as the one we actually live in, which is why statistics doesn’t give you Truth-with-a-capital-T, but rather a method for making reasonable decisions under uncertainty… subject to some corners you’re willing to cut. (It’s also why statistical pedantry looks so silly.)

In STAT101, those assumptions tend to be spoonfed to you as “The data are normally distributed…blah blah blah.” In real life, you have to come up with the assumptions yourself, which can feel scary since suddenly there are no right answers.

In real life, there are no right answers. The best we can do is make decisions in a way that feels reasonable.

If a p-value was calculated for someone else, it’s probably useless to you. It should only be shared among people who choose to make the same simplifying assumptions and frame their decision-making in the same way.

It’s dangerous to use other people’s p-values… they’re like needles: if you’re going to use them, get your own!

Statistical decision-making is always subjective, whether it’s Bayesian or Frequentist, because you always have to make simplifying assumptions. The conclusions are only valid insofar as you buy into those assumptions, which is why it’s weird to expect someone to agree with your punchline if they haven’t seen the assumptions it’s based on. Why do we do that? No idea. I don’t. If I’m not willing to think about how I’d like to make the decision and whether the stated assumptions are palatable to me (before I see the data or p-value), then all I’ll ever see in a p-value is what an analyst sees: after some settings were twiddled, you saw a pattern. That’s cute. Sometimes I see animals when I look at clouds too. If I’m tempted to take it seriously, I’ll follow the “insight” up in other data. Otherwise, I’ll treat it as vague inspiration… and at that quality, who the hell cares how good it is anyways?

Does this evidence surprise you?

Now that you have imagined the world that describes your null hypothesis, you’re going to ask whether the evidence you got — your data—are surprising in that world. The p-value is simply the probability that your null world coughs up data at least as damning as yours. When it’s low, that means your data look weird in such a world, which makes you feel ridiculous about acting as if you live in that world. When it’s low enough for your tastes — below a threshold you pick called a significance level — that means you’re surprised enough to change your mind and switch your action away from your default. Otherwise, you keep doing what you were going to do anyway.

Interpret a low p-value as: “Someone was surprised by something.”

Who defines what “ridiculous” means? The decision-maker (whoever chose the assumptions and significance level). If you didn’t set the analysis up, the only valid interpretation of a low p-value is: “Someone was surprised by something.” Let’s all meditate on how little that tells you if you don’t know much about the someone or the something in question.

And that’s why p-values are a bit like medical needles: They’re intended for personal use and it’s dangerous to share them.

Thanks for reading! Liked the author?

If you’re keen to read more of my writing, most of the links in this article take you to my other musings. Prefer my best of list? Try this one:

Appendix: Technical Objections

*Technical objections to your technical objections (uses jargon, sorry):

  • To those who are about to protest that “The hypothesis is used in the calculation of the confidence interval and that’s why we’d be using the word and is Cassie calling everyone (everyone!) an idiot?!” …while it’s true that the computation uses the hypothesis, let me remind you that the game in analytics is speed. Why are you rolling your own test inversion for exploration? There’s plenty of packages ready to go.
  • To those who are about to protest that “If we have all the data to test a hypothesis with certainty, then we’d utter the ‘hypothesis’ just before incinerating it in the blazing glory of our truth…” Folks, the other name for this is “Looking Up The Answer” and yes, you can use analytics for fact-based decision-making, but seriously: why are we talking about p-values (0) or confidence interv- er, confidence points in that context? You’ve got facts, so you don’t need statistics. When you have all the facts, feel free to ignore any damned lies in articles bearing the tag #statistics, including this one.

Thanks for reading! How about an AI course?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

Liked the author? Connect with Cassie Kozyrkov

Let’s be friends! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.

--

--

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita