Data-Driven Leadership and Careers

The most powerful idea in data science

A quick fix for separating red herrings from useful patterns

Cassie Kozyrkov
Towards Data Science
8 min readAug 9, 2019

--

If you take an introductory statistics course, you’ll learn that a datapoint can be used to generate inspiration or to test a theory, but never both. Why not?

Humans are a bit too good at finding patterns in everything. Real patterns, fake patterns, you name it. We’re the sort of creatures that find Elvis’s face in a potato chip. If you’re tempted to equate patterns with insights, remember that there are three kinds of data patterns:

  • Patterns/facts that exist in your dataset and beyond it.
  • Patterns/facts that exist only in your dataset.
  • Patterns/facts that exist only in your imagination (apophanies).
A data pattern can exist (1) in the entire population of interest, (2) only in the sample, or (3) only in your head. Image: source.

Which ones are useful to you? It depends on your goals.

Inspiration

If you’re after pure inspiration, they’re all fabulous. Even the odd apopheny — from the term apophenia (the human tendency to mistakenly perceive connections and meaning between unrelated things) — can get your creative juices flowing. Creativity has no right answers, so all you need to do is take a look at your data and have fun with it. As an added bonus, try not to waste too much time (yours or your stakeholder’s) along the way.

Facts

When your government wants to collect taxes from you, it couldn’t care less about patterns outside your financial data for the year. There’s a fact-based decision to be made about what you owe and the way to make it is to analyze last year’s data. In other words, look at the data and apply a formula. What’s called for is pure descriptive analytics that sticks to the data at hand. Either of the first two kinds of patterns is good for that.

Descriptive analytics that sticks to the data at hand.

(I’ve never mislaid my financial records, but I imagine the United States government wouldn’t be delighted with me if my response to losing them was to use the data imputation techniques I learned in grad school to pay my taxes statistically.)

Decisions under uncertainty

Occasionally, the facts you have aren’t the same as the facts you wish you had. When you don’t possess all the information required for the decision you’d love to make, you’ll need to navigate uncertainty as you try to pick a reasonable course of action.

This is what statisticsthe science of changing your mind under uncertainty — is all about. The game is to make an Icarus-like leap beyond what you know… without ending in a splat.

That’s the big challenge at the heart of data science: how not to wind up *less* informed as a result of looking at data.

Before you sail off that cliff, you’d better hope that the patterns you found in your partial glimpse of reality actually do work beyond it. In other words, the patterns must generalize to be useful to you.

Source: xkcd

Of the three varietals, only the first (generalizable) kind of pattern is safe if you’re making decisions under uncertainty. Unfortunately, you’ll find the other kinds of patterns in your data too — that’s the big challenge at the heart of data science: how not to wind up less informed as a result of looking at data.

Generalization

If you think pulling useless patterns out of data is a purely human privilege, guess again! Machines can automate the same silliness for you if you’re not careful.

The entire point of ML/AI is to generalize correctly to new situations.

Machine learning is an approach to making many similar decisions that involves algorithmically finding patterns in your data and using these to react correctly to brand new data. In ML/AI jargon, generalization refers to your model’s ability to work well on data it hasn’t seen before. What good is a pattern-based recipe that only succeeds on old stuff? You can just use a lookup table for that. The entire point of ML/AI is to generalize correctly to new situations.

That’s why the first kind of pattern in our list is the only kind that’s good for machine learning. It’s the part that’s signal, the rest is just noise (red herrings that exist only in your old data and distract you from coming up with a generalizable model).

Signal: Patterns that exist in your dataset and beyond it.

Noise: Patterns that exist only in your dataset.

In fact, getting a solution that handles old noise instead of new data is what the term overfitting means in machine learning. (We utter that word with the same tone you’d apply to your favorite expletive.) Almost everything we do in machine learning is in service of avoiding overfitting.

So, which kind of pattern is *this* one?

Assuming that the pattern you (or your machine) pulled out of your data exists outside your imagination, which kind is it? Is it the real phenomenon that exists in your population of interest (“signal”) or an idiosyncrasy of your current dataset (“noise”). How can you tell which kind of pattern you found during your foray into a dataset?

If you’ve looked at all your available data, you can’t. You’re stuck and there’s no way to tell whether your pattern exists elsewhere. The whole rhetoric of statistical hypothesis testing hinges on surprise, and it’s in bad taste to pretend to be surprised by a pattern you already know is in your data. (That’s p-hacking, essentially.)

It’s a bit like seeing a rabbit shape in the clouds and then testing whether all clouds look like rabbits… using the same cloud. I hope you appreciate that you’re going to need some new clouds to test your theory.

Any datapoint you use to inspire a theory or question can’t be used to test that same theory.

What could you have done if you knew you only had access to one image of a cloud? Meditate in a broom closet, that’s what. Come up with your question before you look at the data.

Mathematics is never a counterspell to basic common sense.

We’re being led to a most unhappy conclusion here. If you use up your dataset in your quest for inspiration, you can’t use it again to rigorously test the theory it inspired (no matter how much mathemagical jiu-jitsu you whip out, since math is never a counterspell to basic common sense).

Tough choices

This means you must choose! If you have only one dataset, you’re forced to ask yourself: “Do I meditate in a closet, set up all my statistical testing assumptions, and then carefully take a rigorous approach so I can take myself seriously? Or do I just mine the data for inspiration, but agree that I might be kidding myself and remember to use phrases like “I feel” or “this inspires” or “I’m not sure”? Tough choice!

Or is there a way to have your cake and eat it too? Well, the problem here is that you have only one dataset and you need more than one dataset. If you have lots of data, I have a hack for you that will. Blow. Your. Mind.

One weird trick

To win at data science, simply turn one dataset into (at least) two by splitting your data. Then use one for inspiration and the other for rigorous testing. If the pattern that inspired you in the first place also exists in the data that didn’t have a chance to influence your opinions, that’s a more promising vote in favor of the pattern being a general thing in the cat litter box you scooped your data from.

If the same phenomenon exists in both datasets, maybe it’s a general phenomenon that also exists wherever those datasets came from.

SYDD!

If an unexamined life is not worth living, then here are the four words to live by: Split Your Damned Data.

The world would be better if everyone split their data. We’d have better answers (from statistics) to better questions (from analytics). The only reason that people don’t treat data splitting as a mandatory habit is that in the previous century it was a luxury very few could afford; datasets were so small that if you tried to split them then there might be nothing left. (Learn more about the history of data science here.)

Split your data into an exploratory dataset that everyone can dredge for inspiration and a test dataset that will later be used by experts for rigorous confirmation of any “insights” found during the exploratory phase.

Some projects still have that problem today, especially in medical research (I used to be in neuroscience, so I have a lot of respect for how hard it is to work with small datasets) but many of you have so much data that you need to hire engineers just to move it all around… what’s your excuse?! Don’t be miserly, split your data.

If you’re not in the habit of splitting your data, you might be stuck in the 20th century.

If you’ve got data in droves but you’re seeing unsplit datasets, your neck of the woods is suffering from an old-fashioned perspective. Everyone got comfy with archaic thinking and forgot to move on with the times.

Machine learning is the offspring of data splitting

At the end of the day, the idea here is so simple. Use one dataset to form a theory, call your shots, and then perform the magic trick of proving you know what you’re talking about in a brand new dataset.

Data splitting is the easiest quick fix for a healthier data culture.

That’s how you stay safe in statistics and it’s also how you avoid being eaten alive by overfitting in ML/AI. In fact, the history of machine learning is a history of data splitting. (I explain why in Machine Learning is Automated Inspiration.)

How to use the best idea in data science

To take advantage of the best idea in data science, all you have to do is make sure you keep some test data out of reach of prying eyes, then let your analysts go wild on the rest.

To win at data science, simply turn one dataset into (at least) two by splitting your data.

When you think they’ve brought you an actionable “insight” that reaches beyond the information they explored, use the your secret stash of test data to check their conclusions. It’s as easy as that!

Thanks for reading! How about an AI course?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

Liked the author? Connect with Cassie Kozyrkov

Let’s be friends! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.

--

--

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita