Statistical Thinking

The saddest equation in data science

Why statistics will never give you Truth

Cassie Kozyrkov
Towards Data Science
7 min readJun 26, 2020

--

Here’s the audio version of the article, read for you by the author.

Prepare a box of tissues! I’m about to drop a truth bomb about statistics and data science that’ll bring tears to your eyes.

Meme template: SOURCE.

INFERENCE = DATA + ASSUMPTIONS. In other words, statistics does not give you truth.

Common myths

Here are some standard misconceptions:

  • “If I find the right equations, I can know the unknown.”
  • “If I math at my data hard enough, I can reduce my uncertainty.”
  • “Statistics can transform data into truth!”

They sound like fairytales, don’t they? That’s because they are!

Painful truths

There is no magic in the world that lets you make something out of nothing, so abandon that hope now. That’s not what statistics is about. Take it from a statistician. (As a bonus, this article might save you from wasting a decade of your life studying the dark arts of statistics to chase that elusive dream.)

Unfortunately, there are plenty of charlatans out there who may try to convince you otherwise. They’ll pull a classic bullying move on you, “You don’t understand the equations I’m clobbering you with, so bow before my superiority and do what I say!”

Resist those posers.

Image: SOURCE.

Don’t land with a splat, Icarus!

Think of statistical inference (“statistics” for short) as an Icarus-like leap from what we know (our sample data) to what we don’t (our population parameter).

In statistics, what you know is not what you wish you knew.

Perhaps you want tomorrow’s facts, but you only have the past to inform you. (It’s so annoying when we can’t remember the future, right?) Perhaps you want to know what all your potential users think of your product, but you can only ask a hundred of them. Then you’re dealing with uncertainty!

It’s not magic, it’s assumptions

How can you possibly leap from what you know to what you don’t? You need a bridge to cross that chasm… and that bridge is assumptions. Which brings me back to the most painful equation in all of data science: DATA + ASSUMPTIONS = PREDICTION.

DATA + ASSUMPTIONS = PREDICTION

(Feel free to replace the word “prediction” with “inference” or “forecast” if you like — they’re all the same thing here: a statement about something you can’t know for sure.)

Assumptions bridge the gap between when you know and what you don’t. Image: SOURCE.

What‘s an assumption?

If we knew all the facts (and we knew that our facts were actually true facts), we wouldn’t need assumptions (or statisticians). Assumptions are the ugly patches you use to bridge the gap between what you know and what you wish you knew. They’re hacks you have to use to make the math work out when you’re missing the facts.

Assumptions are ugly band-aids you put over the parts where information is missing.

Should I put it more bluntly? An assumption is not a fact, it’s some nonsense you make up precisely because you’ve got gaping holes in your knowledge. If you’re in the habit of bullying people with your overconfidence intervals, take a moment to remind yourself of that it’s a stretch to refer to anything based on assumptions as truth. It’s best to start treating the whole thing as a personal decision-making tool that is imperfect but better than nothing (in specific situations).

Statistics is your attempt to do your best in an uncertain world.

There are always assumptions.

Assumptions are part of decision-making

Show me an “assumption-free” real-world decision and I’ll rattle off a host of implicit assumptions you’re not even aware you’re making.

Examples: When you read a newspaper, did you assume all the facts were checked? When you made your plans for 2020, did you assume there would be no global pandemic? If you analyzed data, did you assume the information was captured without errors? Did you assume that your random number generator is random? (They usually aren’t.) When you chose to make an online purchase, did you assume the right amount would be withdrawn from your bank account? What about the last snack you had, did you assume it wouldn’t poison you? When you took medicine, did you *know* anything about its long-term safety and efficacy… or did you assume?

Like it or not, assumptions are part of decision-making.

Like it or not, assumptions are always part of decision-making. A proper foray into real-world data should contain a host of written-down assumptions where the data scientist comes clean about corners they had to cut.

Even if you choose to steer clear of statistics, you’re probably using assumptions to guide your actions. To stay safe, it’s crucial that you keep track of the assumptions that your decisions are based on.

How the statistical “magic” happens

The field of statistics gives you a whole arsenal of tools for formalizing your assumptions and combining them with evidence to make reasonable decisions. (Catch my 8 minute intro to stats here.)

It’s preposterous to expect an analysis involving uncertainty and probability to be a source of truth-with-a-capital-T.

Yep, that’s how the statistical “magic” happens. You choose which assumptions you’re willing to live with, then you combine them with data to take reasonable actions on the basis of that unholy union. That’s all statistics is.

Icarus illustration: SOURCE.

That’s why an analysis involving uncertainty and probability could never be a source of truth-with-a-capital-T. There is no secret dark art that can do that for you.

Two people can come to completely different valid conclusions from the same data! All it takes is using different assumptions.

It’s also why two people can come to completely different valid conclusions from the same data! All it takes is using different assumptions. Statistics gives you a tool for making decisions more thoughtfully, but there’s no single right way to use it. It’s a personal decision-making tool.

A study is only as good as the assumptions you’ll make about it.

What about science?

What does it mean when a scientist uses statistics to come to a conclusion? Simply that they’ve formed an opinion and have made the decision to share it with the world. That’s not a bad thing — it’s a scientist’s job to form opinions reluctantly, which makes me feel better about assuming that they’re worth listening to.

It’s a scientist’s job to form opinions reluctantly.

I’m a huge fan of taking advice from those who have more expertise and information than I do, but I never let myself confuse their opinions with facts. But while many scientists are well-versed in working with probability, I’ve seen other scientists make enough statistical mess to last several lifetimes. Opinions could not (and should not) convince someone who’s not willing to make the assumption that those opinions were arrived at competently from a blend of evidence and mutually-palatable untested assumptions.

If you’d like to hear more of my musings on science and scientists, read this.

In summary

It’s best to think of statistics as the science of changing your mind under uncertainty. It’s a framework to help you make thoughtful decisions when you lack information… and there’s no single right way to use it.

And no, it doesn’t give you the facts you need; it gives you what you need to cope with not having those facts in the first place. The entire point is to help you do your best in an uncertain world.

To do that, you’ll have to start making assumptions.

Thanks for reading! How about an AI course?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

Liked the author? Connect with Cassie Kozyrkov

Let’s be friends! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.

Next up

In follow-up articles, I’ll write about where assumptions come from, how to pick “good” assumptions, and what it means to test an assumption. If these topics intrigue you, your retweets are my favorite motivation for writing.

In the meantime, most of the links in this article take you to my other musings. Can’t choose? Try one of these:

--

--

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita