Irreverent Demystifiers

A field guide to the most popular parameters

Take a “moment” to explore some fundamentals

Published in

Towards Data Science

8 min readMar 5, 2021

This article takes you on a tour of the most popular parameters in statistics! If you’re not sure what a statistical parameter is or you’re foggy on how probability distributions work, I recommend scooting over to my beginner-friendly intro here in Part 1 before continuing here.

Get your distribution basics in Part 1 if you’re new to this space. Image by the author.

Note: If a concept is new to you, follow the link for my explanation. If the early stuff feels too technical, feel free to skip to the cuddly critter memes lower down.

Ready for the list of favorites? Let’s dive right in!

Mean

This word is pronounced “average.”

Expected value

An expected value, written as E(X) or E(X=x), is the theoretical probability-weighted mean of the random variable X.

You find it by weighting (multiplying) each potential value x that X can take by its corresponding probability P(X = x) and then combining them (with an integral for continuous variables like height or a sum for discrete variables like height-rounded-to-the-nearest-inch): E(X) = ∑ x P(X=x)

If we’re dealing with a fair six-sided die, X can take each value in {1, 2, 3, 4, 5, 6} with equal probability 1/6, so:

E(X) = (1)(1/6) + (2)(1/6) +(3)(1/6) +(4)(1/6) +(5)(1/6) +(6)(1/6) = 3.5

In other words, 3.5 is the probability-weighted average for X and nobody cares that 3.5 isn’t even an allowable outcome of the die roll.

Variance

For reasons that I’ll explain in a moment, replacing X with (X — E(X))² in the E(X) formula above gives you something very useful. But first, let me empower you to calculate it whenever the urge strikes you:

V(X) = E[(X — E(X))²] = ∑[x — E(X)]² P(X=x) = E[(X )²] — [E(X)]²

That last bit is a handy reformulation whose proof I’ll leave as homework for the interested reader. (Don’t be surprised if it looks like steps were left out. They were.) Hmm, not a fan of the proof in that link? Try this one.

Let’s take the formula for a spin to get the variance for a fair die: V(X) = ∑[x -E(X)]² P(X=x) = ∑(x -3.5)² P(X=x) = (1–3.5)² (1/6) + (2–3.5)² (1/6) + (3–3.5)² (1/6) + (4–3.5)² (1/6) + (5–3.5)² (1/6) + (6–3.5)² (1/6) = 2.916666…

Moment

Ha, “in a moment” — a tiny pun in the previous section (almost surely amusing no one but myself). Ahem.

Moments are special kinds of expected values. There’s a pattern to them:

1st raw moment: E[(X)] ……… 1st central moment: E[(X — 𝜇)]
2nd raw moment: E[(X)²] ……. 2nd central moment: E[(X — 𝜇)²]
3rd raw moment: E[(X)³] …..… 3rd central moment: E[(X — 𝜇)³]
…………………………..and so on…………………………………

Moments are worth knowing because they tell you about the shape of a distribution. Well, up to a point. Scaling the 3rd central moment gives you the distribution’s skewness, while scaling the 4th central moment gives you the distribution’s kurtosis (“tailedness”).

Higher moments

As for higher moments… well, the only reason I’d bring up the 5th moment is that it’s the name of the statistics cover band my friends formed in grad school. (Yes, we’re nerds. I know.)

The Fifth Moment performing a song about linear regression.

Tails

Hang on, “tailedness”?! Yup, that’s a word that comes up aplenty in statistics. I remember a friend emailing me this after opening his first stats textbook: “Tails? WTF, are these numbers or are these lemurs?”

On behalf of statistics, I apologize profusely. Whoever named them “tails” was probably smoking something, er, medicinal at the time and thought that the distribution was shaped like a critter with a tail. Or two tails. Nope, nothing zoologically dubious to see here…

Kurtosis

Kurtosis is a way to describe the chubbiness of the tail. Chances are that you won’t be referring to kurtosis frequently, so resist the urge to memorize anything about it. Instead, look it up when you need it (like I just did).

Skewness

It’s hard to remember which is which when it comes to left-skewed, right-skewed, positive-skewed, negative-skewed distributions… until you realize that the answer is in the direction that the tail is pointing. So, let’s try it with the dinosaur below:

Wherever that “tail” is pointing, that’s your answer. Image: SOURCE.

This would be a left-skewed (or negative-skewed) distribution since that’s where the tail is pointing. And if you squint your brain hard enough, you’ll start to seeing tails whenever you look at distributions too.

What on earth is this?! A Gaussian mixture? When it’s hard to decide which side the tail is pointing, there’s not much skew. Image: SOURCE.

Moment Generating Functions

What if you can’t get enough of moments? Then moment generating functions (MGFs) are for you. The cool thing about MGFs is that they uniquely determine the distribution (so you could use them in place of CDFs and PDFs) and they give you a quick way to calculate all the moments your heart could desire.

Or maybe you’ll won’t need them because you’ll mostly be working with the most popular moments:

First raw moment = 𝜇 = E[(X)]
Second central moment = 𝜎² = E[(X — 𝜇)²]

Look familiar? That’s right. Hello again, mean and variance!

Variance, again

The variance tells you how much a random variable varies from its mean. How “all-over-the-place” is it?

Consider how little an average tells you on its own. For example, if you looked at the average hours of sleep I get, you’d guess I’m a champion sleeper… until you saw the variance (oh dear). 4 hours one night and 12 hours another night is not the same as 8 and 8, despite an identical mean.

The lower the variance, the easier it is to make predictions.

Variance is a very important concept when it comes to probability, since the lower the variance, the easier it is to make predictions. When there’s no variance, you have your answer with certainty.

Alas, variance is not the most presentable way to convey this kind of information. That’s why it’s polite to take the square root before spreading your results in polite company. That square root has a special name…

Standard deviation

Standard deviation is the square root of variance: it’s 𝜎 instead of 𝜎², so it measures the same thing. Reporting results in terms of standard deviation instead of variance is friendlier since our buddy s.d. is on a scale that makes sense as a half-decent measure of distance.

Think of things that call themselves “deviations” as measuring the distance between diners and the central line of a long Hogwarts-style banquet table. Image: SOURCE.

In general, you can think of things that call themselves “deviations” as summarizing the distance between diners and the central line of a long Hogwarts-style banquet table.

In school, you’re taught that the standard deviation tells you the average spread of values around the mean… sort of.

Actually, the truest measure of that would be something called mean absolute deviation.

MAD

The mean absolute deviation is the one you intuitively imagine when I say “the average spread of your values around the average.”

Unfortunately, its formula can be an absolute pest to work with — the absolute value function (the one that kills minus signs) comes with the kind of sharp corner that makes some optimization techniques MAD, so we often prefer to work with standard deviation instead. Close enough.

This section has spread out more than enough, so let’s center ourselves.

Median

Median is pronounced “the middle thing.” The median is usually the quantity you want to be thinking about when you say “the average income.” If a group of people has the following salaries in $000s: {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 100000000000}, then the median salary ($1K) on its own gives a more useful summary of what’s going on with the group (half are above $1K and half are below $1K) than the average salary ($7,692,307,693K). For those who are paying excessive attention, you’ll notice that a distribution with this shape would be described as one-tailed and right-skewed.

This is for the three of you who share my taste in movies.

Mode

Mode is pronounced “the most common value.” The mode corresponds is where a distribution/histogram has its peak. When you hear that a distribution is multimodal, it means there’s more than one peak. When a distribution is symmetric and unimodal, like the pretty little bell curve, the mode also happens to be the mean. If you want to be technically correct, you’d stop saying “the average Joe” when you actually mean “the modal Joe.”

Why mean and variance?

If people think more intuitively about medians and MADs, why do students learn about means and variances instead? The short answer is that the mean and variance functions are more convenient for performing various mathematical tricks.

The Central Limit Theorem (CLT) says that if you’re working with lots of data, you can safely assume that the distribution of the sample average is normal (bell-shaped), just like this plushy. In a future article, I’ll tell you all about the CLT and the terrible price of convenient calculations, but I’ll stop here for now to let you digest the barrage of information you so patiently absorbed. Image: SOURCE.

You’ll be amazed how often the “standard” ways of doing things involve answering the easy questions instead of the right questions. Don’t assume that techniques you learned in class are the ones you should be using for important work. The best statistical skill you can cultivate is the ability to think for yourself.

You’ll be amazed how often the “standard” ways of doing things involve answering the *easy* questions instead of the *right* questions.

Thanks for reading! How about an AI course?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

Connect with Cassie Kozyrkov

Let’s be friends! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.