Making Data Useful

Can analysts and statisticians get along?

Inside the subtle war between the data science professions

Cassie Kozyrkov
Towards Data Science
9 min readJan 24, 2020

--

Image: SOURCE.

In a previous article, I explained that typical training programs in statistics and analytics endow graduates with different skillsets.

When you’re dealing with uncertainty, analysts help you ask better questions, while statisticians provide more rigorous answers. Seems like the makings of a collaboration dream, yet somehow these professions end up at one another’s throats. Let’s see if we can make sense of the strange war between analytics and statistics (and suggest a peace treaty).

Analysts and statisticians: incompatible species in the terrarium? Image: SOURCE.

Definitions

Since data science job titles can be an inaccurate reflection of what people actually do, let me define my terms:

  • Those who are concerned with looking at data to summarize it and extract inspiration are what I call analysts.
  • Those who are concerned with rigorously testing hypotheses for data-driven decision-making are what I call statisticians.
  • Those who know how to do both… are both. This article leaves analyst-statistician hybrids out, but you can catch my thoughts on them here.
  • Those who go through the motions of both while misunderstanding at least one are data charlatans. Head over to this article to learn more.
  • Those who know how to do both and also have ML/AI expertise I call data scientists. This kind of all-rounder is rare indeed. You can read about them in my other writing: [1], [2], [3], [4], [5]. Note that different organizations have different standards for how they define the data science role, so it’s best to check everyone is talking about the same thing before assuming.

Analytics helps you form hypotheses, while statistics lets you test them.

While analysts specialize in quickly exploring your tangled mess of a dataset, statisticians focus more on inferring what’s beyond it.

The burden of (data) poverty

Bottlenecked by the effort of collecting data and the cost of storing it on tiny 20th century hard drives, last century’s datasets tended to be small. It was hard to scrape together enough data for even a single respectable dataset, which meant that data-splitting was rarely an option. This forced professionals into a choice between two dramatically different mindsets.

Antagonism between the data professions is one of the lingering effects of data famine.

Image: SOURCE.

(To understand some of the nuances in this article, you’ll need to appreciate that a datapoint can be used to generate inspiration or to test a theory, but never both. With data-splitting, you can have your cake and eat it too. If you want a deeper foray into why this is true, read this.)

Whichever camp you’re in, you might think that the other camp is trying to do your job… and that they’re surprisingly bad at it.

If you received your data science training during the dark ages of data famine, you might be harboring a nasty stereotype that stems from a failure to understand that analysts and statisticians perform different roles. Whichever camp you’re in, you might think that the other camp is trying to do your job… and that they’re surprisingly bad at it.

Nasty stereotypes (and why you have them)

Image: SOURCE.

How an analyst appears to a statistician

In a word: sloppy. Unlike statisticians, most analysts aren’t trained in thinking rigorously about which conclusions are valid under uncertainty, but that’s okay… as long as they don’t try to make conclusions beyond their data. Instead, an analyst’s highest virtue is speed—finding out what’s in their dataset as quickly as possible.

The idea of undisciplined prancing around in data rubs many statisticians the wrong way. Recently, I was privy to a conversation in which a statistician (not me!) argued against the development of faster analytics tools because “it would invite misuse.” Yup. Way to stomp on the validity of the whole analytics career with one big muddy boot.

Image: SOURCE.

Here’s the thing: he was right that such tools would be bad for statisticians. The jobs are different, though. Unfortunately most people — including him — don’t understand that difference.

If you’re not able to split your data and you look at all of it before figuring out which questions to ask, then you’re doing analytics, not statistics. That’s not necessarily a bad thing; analytics is important and useful — it’s how we generate inspiration to figure out which directions to pursue. The trouble begins when analysts try to sell inspiration as something more rigorous.

Follow the one golden rule: call your shots before you take them or stick to describing what’s in front of you.

Real statisticians turn their noses up at your so-called “insights” if you failed to follow the one golden rule: call your shots before you take them. Otherwise, stick to describing your dataset and don’t reach beyond it. Please don’t take yourself too seriously and don’t ask anyone else to either.

A manual on how to respond to analytics with unsplit data.

In fact, we’d all be safest in our data reasoning if we treated everyone as doing descriptive analytics until proven otherwise.

“Insights” from unsplit data? That’s just, like, your opinion, man.

Until you show me that your theory lets you call your shots before you take them, I’ll assume that what you’re showing me only exists where you found it. People find patterns in all kinds of things — especially when they’re motivated to think as wishfully as possible — so you won’t impress me until you predict the presence of patterns before you’ve seen them. Unless you can guarantee (and prove — data access logs, anyone?) that your hypothesis preceded your data, anything you tell me should be treated as “that’s just, like, your opinion, man.”

Equations are not enough and they can’t turn a broken process into a trustworthy generalization.

For a foray into data to be something more than descriptive analytics, you’d have to follow a specific process. Just because your software spits out a p-value does not mean that real statistical inference took place. You have to go about framing the context and collecting data in a way that unlocks the philosophical validity of what you’re doing. Equations are not enough and they can’t turn a broken process into a trustworthy generalization. Let’s use our language carefully, calling everything “inspiration” or “analytics” until proven otherwise.

Enough analyst-bashing. Let’s whomp some statisticians!

How a statistician appears to an analyst

In a word: pedantic. Unlike analysts, most statisticians aren’t trained in doing the broad-and-shallow sweep that helps you know which rabbit-holes are worth going down. To an analyst, your garden variety statistician can seem like the royal time-waster, especially if they get involved at the wrong stage of the project.

Many statisticians love to do things Properly, even when those things aren’t always worth doing in the first place. It puts one in mind of a stern five-year-old treating a sandcastle as if it’s sacred and yelling at the four-year-old who wants to join in the building fun. It’s unsurprising that analysts see those statisticians as a sort of superglue that gloms onto the first thing that floats by. (And that holier-than-thou attitude doesn’t help either.) Image: SOURCE.

The last thing most decisions need is statistical tyranny.

Many of life’s decisions are simply not worth much effort and if we took a careful statistical approach to everything, we wouldn’t get much done. If you’re going all-in on the first thing catches your attention, are you sure you didn’t miss out on a much more valuable use of your time? (Sure, it’s not careful mathematics, but c’mon, I’m just ordering dinner here.)

When statisticians make a loud show of disapproval during unrigorous forays into exploratory data, they look ridiculous to business-minded folk. Image: SOURCE.

I’ve often wondered whether the rigor-for-rigor’s-sake phenomenon is the product of taking mathy classes where examples are trivial nonsense with increasingly ornate calculations. Kindergarten’s “If Sally has twenty rabbits in a field…” persists all the way through to grad school, where it requires a triple integral to get the gold star.

Who could blame a statistician for taking everything too seriously after so many rabbits? Those classes practically condition you to provide sophisticated answers to stupid questions, so what are you expecting from a workforce brought up on a decade of them? Hiring math/stat worshippers in droves will save you from some problems, but it exposes you to others, including bullies who make life hard for those who aren’t building every sandcastle with 110% care.

Setting my casual causal inferences aside, if you have a teammate who is going to pour their heart and soul into doing rigorous work, then hopefully that rigor is worth chasing. If your teammate lacks the skills to know which rabbit-hole to go down, they’re going to need someone to point them in the right direction.

With analysts helping them, statisticians no longer need to grope their way through the dark, building a universe in their heads to figure out how to ask their questions. Instead, they can let analysts to inspire their hypotheses and assumptions.

So why aren’t statisticians delighted to have analysts help them identify what’s worth doing and why aren’t analysts delighted to hand over the checking-our-conclusions-aren’t-nonsense bit to statisticians? Why the antagonism and lack of respect?

Unlocking collaboration

In the bad old days, datasets were too small to split, so you’d have to choose between using them for analytics and statistics. That means the two groups would have to fight over each dataset.

In organizations with a modern approach to data science, strong collaboration between analysts (inspiration / exploration) and statisticians (rigor / testing) is a part of the culture.

Thanks to improvements in hardware and lower storage costs, today many endeavors are breaking through the one-dataset ceiling, ushering in an era of data abundance.*

Split your data into an exploratory dataset that everyone can dredge for inspiration and a test dataset that will later be used by experts for rigorous confirmation of any “insights” found during the exploratory phase.

Now analysts and statisticians can receive their own piece of the original dataset, allowing exploration specialists to work in harmony with testing specialists, each group contributing what it does best… assuming they can let go of their habit of battling one another on sight.

The price of effective collaboration between generation and testing of hypotheses is data quantity.

Analysts can use their piece as a guided meditation to figure out what’s worth pursuing and when they’ve narrowed down what the business cares most about, the leftover piece gives statisticians a shot at rigorously checking if the analysts’ hunch is worth acting on.

Organizations can have a symbiosis between the data disciplines… and they should! Welcome to the modern era of data abundance!*

Image: Source.

*Exuberance damper

Although today’s typical datasets are much bigger (and more easily shared/accessed) than last century’s data, there are use cases which are trapped in the one dataset era because initial data collection is very effortful or expensive. An example from my career is fMRI data — even today, it is very expensive to scan a single human brain, so neuroscientific datasets featuring a few dozen scans are still considered impressive. That’s one reason it’s naive to assume that all data will be big data. Information is simply scarce in some topics, and those who work on those topics face a one-dataset reality.

If that sounds like your environment, try to be thoughtful about which camp rules the roost and respectful of folks from the other camp — they provide a fundamentally different service from yours and you would do well to remember that they are experts in their own right, even if your business has elected to choose your services over theirs.

Thanks for reading! How about an AI course?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

--

--

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita