Irreverent Demystifiers

There is no classification — here’s why

Published in

Towards Data Science

6 min readDec 11, 2020

If you’re the kind of person who likes to keep a tidy mind, here’s why your lip might curl in disgust when confronted with the title question in my recent article on the difference between classification, regression, and prediction: There is no classification.

There is no classification.

Let me explain.

There is no classification… and regression is something else entirely. Meme template from The Matrix.

Discrete versus continuous

Back when dinosaurs roamed the earth, it was fashionable to kick a statistics textbook off with a first chapter on the basics of data. To make sure that students had something to memorize for their first test, opening chapters usually featured this jargon:

Continuous data (measured, not counted), e.g. 173.5 cm (my height), 12% (free space on my phone), 3.141592… (pi), -40.00 (where Celsius meets Fahrenheit), etc.
Discrete data (counted, not measured), e.g. 1 short story, 6 words, 2 baby shoes, 0 times worn, etc.

What’s the difference between counting and measuring? How many other words are there for describing different kinds of data and what do they all mean? Take a small detour to my article on data types. (Don’t worry, I’ll still be here when you get back.)

How many data types can you name?

Continuous, discrete, categorical, cardinal, sequential… keep going!

towardsdatascience.com

Stepping onto the infinite scale

Alas, if you’re interested in applying what you’re learning to real world data, you’ll quickly discover that what’s possible in theory isn’t always available in practice. That’s true of continuous numbers, a concept that makes perfect sense in mathematics but not in data — not really.

There’s no such thing as truly continuous data. Even if you could keep records at billion decimal precision, your measurements would still be too coarse.

Is a scale with 1000 notches on it all that different from a scale with 7 or even 2 notches on it, especially if you compare it to a scale of infinite notches? Ever seen a scale like that? Me neither. Using the continuous scale, my true height is not 5'9'', it’s… wait, hang on while I measure the individual subatomic particles…

Sometimes it’s convenient and sometimes it isn’t.

The best argument for treating scales with 7 and 1000 notches differently is that our poor brains can cope with keeping the former in mind better than the latter. That’s not the most flattering argument to rest the case on — a bit like saying that 11 is a big number because that’s where you run out of fingers. I propose seeing the notion of continuity in applied data science from the perspective of convenience: sometimes it’s convenient and sometimes it isn’t.

Ever seen a scale with infinite notches? Me neither. Continuity is convention. It doesn’t exist in real data.

Theoretical distinctions for homework

After introducing you to the terminology of discrete and continuous, Baby’s First Stats Textbook tends to break its chapters up into separate ones for continuous data and discrete/categorical data. For example, you might learn about probability density functions (PDFs) for continuous data separately from probability mass functions (PMFs) for discrete data.

The black line is a probability density function (PDF) on top of a probability mass function (PMF). Curious about the intentionally-hidden axis labels? Find relief here.

As you move up towards grown up stats classes, you shrug the distinction off — your profs are calling everything a PDF and no one needs to keep reminding you which one gets the sum (discrete) which one gets the integral (continuous).

Similarly, you might have started out by learning simple linear regression as a tool for making predictions about continuous quantities and logistic regression as a tool for predicting categories, but when you eventually drag yourself to GLM (generalized linear models) courses, you will see the whole thing as one big gooey mess whose distinctions were patently artificial.

What used to look like {cat, cat, dog, cat} to you now looks like {1, 1, 0, 1} and what used to look like {90%, 95%, 15%, 82%} or {0.9, 0.95, 0.15, 0.82} also looks like {1, 1, 0, 1} when you need it to.

And then you start taking courses on GLM…

Why are these distinctions taught in the first place?

While I can’t tell you what’s in the dark hearts of all statistics professors, here are two excellent reasons to draw beginner boundaries between discrete and continuous data:

Pedagogical training wheels. There’s plenty to be said for not opening the nuance floodgates on unsuspecting toddlers. To help students focus on the course objectives, it makes sense to reduce the extent to which they need to keep their wits about them. Wise professors won’t expect beginners to demonstrate the data veteran’s common sense, as in “of course we can’t think about an average taken in categorical data (where it indicates the proportion of cases coded with a 1 instead of a 0) the same way as for continuous data (where it’s a measure of central tendency) — duh” so they teach distinctions as rules of thumb that reduce reliance on heightened “common” sense.
Convenience and software specs. Given that different types of data lend themselves to different manipulations, not every bit of math makes sense on every bit of data. For example, try this calculating this: “What’s last Saturday’s attempt at staying hydrated minus Sunday’s?” I can tell you my number (-250 milliliters) because I track the number of glasses of water I drink each day. But if you’re trying to conjure up an answer from memory, you might produce only categorical data (“overzealous / good / meh / parched”) instead of specific volumes. In that case, attempting mindless subtraction will produce brainless results. If you’re building your own tools, you’ll tailor your approach to the nature of the information you’ve got. But what if you’re borrowing someone else’s tools? Well, then it helps to have a sense of the kind of input their tool is designed to crunch. Having words for broad-strokes data characteristics is mighty useful when you’re shopping in the software marketplace.

Classification and all that jazz

Okay, but what has this got to do with classification, prediction, and regression?

There is no spoon, that’s what.

The distinctions are there to amuse/torture machine learning beginners. Essentially, classification is what you call it when your desired output is categorical. But when you learn to see the matrix — the actual data matrix — you can put aside these distinctions. There’s a grey area between categorical and continuous (what if the output is a probability that you convert to a category as a last step, as above?) and you start to see that the grey area quickly engulfs the whole question.

When it comes to applied disciplines, it’s not unusual to see artificial boundaries and pedestrian concepts rebranded along the lines of what’s hot, newly-feasible, or methodologically-convenient. Data science is no exception.

Don’t assume that the distinctions your textbook makes tell you something fundamental about the universe. They might be there as a matter of convenience. Or, worse, former convenience.

But what kind of machine learning textbook/course would it be if it didn’t follow the tradition of treating categorical and continuous differently? I, too, must bend my knee to tradition and teach you these distinctions, along with a subversive message in a bottle: I hope you’ll see through them one day.

In the meantime, our lesson continues discreetly at “Classification, regression, and prediction — what’s the difference?”

Thanks for reading! How about an AI course?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

Connect with Cassie Kozyrkov

Let’s be friends! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.