Irreverent Demystifiers

Getting to know probability distributions

Back-to-basics on data science fundamentals

Cassie Kozyrkov
Towards Data Science
6 min readMar 4, 2021

--

Test yourself! How many of these core statistical concepts are you able to explain?

CLT, CDF, Distribution, Estimate, Expected Value, Histogram, Kurtosis, MAD, Mean, Median, MGF, Mode, Moment, Parameter, Probability, PDF, Random Variable, Random Variate, Skewness, Standard Deviation, Tails, Variance

Got some gaps in your knowledge? Read on!

Note: If you see an unfamiliar term below, follow the link for an explanation.

Random variable

A random variable (R.V.) is a mathematical function that turns reality into numbers. Think of it as a rule to decide what number you should record in your dataset after a real-world event happens.

A random variable is a rule for simplifying reality.

For example, if we’re interested in the roll of a six-sided die, we might define X to be the random variable that maps your gooey sensory experience of a real-world die roll to one of these numbers: {1,2,3,4,5,6}. Or maybe we’ll only record {0, 1} for odd/even. It all depends on how we choose to define our R.V.

Image: SOURCE.

(If that’s too technical, just think of a random variable as a way to indicate an outcome: if X is about die rolls, X=4 is a way to say that we rolled a 4. If it’s not technical enough, you’ll almost surely love taking a measure theory class.)

Random Variate

Many students confuse random variables with random variates. If you’re a casual reader, skip this, but enthusiasts take note: random variates are outcome values like {1, 2, 3, 4, 5, 6} while random variables are functions that map reality onto numbers. Little x versus big X in your textbook’s formulas.

Probability

P(X=4) would be read in English as “The probability that my die lands with the 4 facing up.” If I’ve got a fair six-sided die, P(X=4)=1/6. But… but… but… what is probability and where does that 1/6 come from? Glad you asked! I’ve covered some probability basics for you here, with combinatorics thrown in as a bonus.

Distribution

A distribution is a way to express the probabilities of the entire set of values that X can take.

A distribution gives you popularity contest results in graphical form.

Probability Density Function (PDF)

The best way to summon a distribution is to utter its true name: its probability density function. What does such a function signify? If we put X on the x-axis (yup), then the height on the y-axis shows the probability of each outcome.

A probability density function gives you popularity contest results for your whole population. It’s basically the population histogram. Horizontal axis: population data values. Vertical axis: relative popularity. To learn more about this graph and the details that I omitted, head over to here.

As I’ve explained in detail here, a distribution is essentially an imaginary idealized bar chart (for discrete R.V.s) or histogram (for continuous R.V.s).* In other words, the distribution is taller for more likely values of X. The distribution for a fair die has equal height for all outcomes (“discrete uniform”); not so for a weighted die.

Like distributions, you can think of bar charts and histograms as popularity contests. Or tip jars. That works too.

Cumulative Density Function (CDF)

This is the integral** of the probability density function. In English? Instead of showing how likely each value of X is, the function shows the cumulative probability for everything X and below. If you’re thinking of percentiles, awesome. The percentile is what’s on the x-axis and the percentage is what’s on the y-axis.

Probability: Getting a 3 on a six-sided die? 1/6
Cumulative: Getting a 3 or lower? 3/6
The 50th percentile is a 3. The 3 goes on the x-axis, 50% goes on the y-axis.

Choosing Your Distribution

How do you know what distribution is right for your X? Statisticians have two favorite approaches. They either (1) estimate empirical distributions from their data — using, you guessed it, histograms! — or they (2) make theoretical assumptions about which member of a popular distribution catalog looks most similar to how they believe their data source behaves. (If you have data, it’s a great idea to check those distribution assumptions with a hypothesis test.)

The standard approach to choosing a distribution involves plotting a histogram and comparing its shape with the shapes of theoretical distributions in a catalog, such as the list of distributions on Wikipedia, in your textbook, or on the sales page for the distribution plushies above. (And now you get to wonder just how much I’m kidding.) Image of the author’s personal plushie collection.

When we look at our catalog, we notice that the various distributions have names like “Normal” or “Chi-squared” or “Cauchy”… which gives students the mistaken impression that these are the only options. They’re not. They’re just the famous ones. Just like people, distributions might be famous for all the wrong reasons.

Just like people, distributions might be famous for all the wrong reasons.

On the plus side, named distributions come with neat PDFs and a bunch of calculations pre-done for you.

On the minus side, your application might not fit anything in a catalog. Thank goodness for the empirical option.

Parameters

Here’s the probability density function for a very popular distribution, the normal distribution (a.k.a. Gaussian or bell-shaped curve):

Let’s be honest — the insights aren’t exactly leaping off the page. That’s why we tend to prefer asking questions about specific parameters of interest to us. In statistics, parameters summarize populations or distributions. For example, if you’re asking whether the distribution peaks at zero, you’re asking about the location of its mode (a parameter). If you’re asking how fat the distribution is, you’re asking about its variance (another parameter). In a moment, I’ll take you on a tour of a few of my favorite parameters.

But before we do that, let me answer this question: instead of computing summary measures, why don’t we just plot this function and ogle it? We’re not ready yet.

If you look at the function above, you’ll notice that there are some Greek letters in there: μ and 𝜎.*** These are special parameters for this distribution; until we replace them with numbers, we’re not ready to plot anything. Without them, all we can do is get a vague sense of the abstract shape of the distribution, like so:

Image: SOURCE.

Want axes? Put numbers where the Greek letters are. For example, here’s what you get with μ = 0 vs 5 vs 10 and 𝜎 = 1:

Pink μ = 0, Blue μ = 5, Green μ = 10

There’s plenty more Greek to enjoy, since other distributions use other characters for their special quantities. Eventually, you’ll get sick of it and start using θ₁, θ₂, θ₃, etc. for all of them.

It’s also worth remembering that distributions and their parameters are theoretical objects involving assumptions about a population you haven’t got all the info on, whereas a histogram is a more practical object — a summary of sample data that you do have. You’ll avoid plenty of confusion if you keep concepts to do with samples and populations separate, so it might be worth brushing up on them here.

You can find my explanations here.

And now we’re ready for a tour of my favorite parameters, to be continued in Part 2.

Thanks for reading! How about an AI course?

If you had fun here and you’re looking for an applied AI course designed to be fun for beginners and experts alike, here’s one I made for your amusement:

Enjoy the entire course playlist here: bit.ly/machinefriend

Connect with Cassie Kozyrkov

Let’s be friends! You can find me on Twitter, YouTube, Substack, and LinkedIn. Interested in having me speak at your event? Use this form to get in touch.

Footnotes

*Technically, a discrete R.V.’s function is called a probability mass function instead of a probability density function, but I haven’t met anyone who cares if you call a PMF a PDF.

**If you have a discrete R.V., then it’s the sum instead of the integral.

***Nothing special about that π. It’s just the regular one we celebrate on March 14th.

--

--

Chief Decision Scientist, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita