A graduate student’s perspective on statistics

Why is statistics such a universally confusing field?

Olivia Angiuli
Towards Data Science

--

When I’m meeting someone new, and the inevitable question is asked, “So, what do you do?”, I respond that I’m a Ph.D. student in statistics, to which the response is a majority of the time some variation of this:

“Statistics?! I had one required statistics class in undergrad, and it was so confusing.”

The mere frequency of this event has got me thinking why statistics is so universally confusing. For me, the answer is complex, rooted in the translation of deterministic events (as we see them) into realizations of random processes (as statistics sees them). I attempt to unpack these complexities below.

Jumping from points to clouds

Up until Calculus, where the standard trajectory of mathematical education often ends, we live in a world of points, lines, curves, etc. — shapes where quantities are intertwined via deterministic relationships. I know that speed = distance/time, so if we need to get somewhere 60 kilometers away in 45 minutes, then I know I must drive at 80 km/hr. These functional relations are deterministic.

But upon entering a Statistics class, we step into a world of randomness. We are no longer working within a set of known relations, but rather are deducing about unknown mechanisms that produce the observations that we see.

For instance, we wait for a bus and, given its past reliability, we try to predict whether it will come on time today. There are many sources of randomness — the synchronization of traffic lights, the courteousness of other drivers on the road, the weather conditions — and these factors lead to different arrival times.

As a result, statistics implicitly deals with probability clouds, rather than points. A tight probability cloud (“the bus comes every 10±1 minutes”) signals that we have high certainty that the bus almost always comes close to its scheduled time. A dispersed cloud (“the bus comes every 10±10 minutes”) signals wide variability in the arrival times. This may occur if we simply haven’t observed enough information to be certain about its long-run patterns (maybe we’ve only waited for the bus twice), or our observations are widely scattered (sometimes it is early by 10 minutes, but other times it is late by 20).

I was never actually explicitly told that I was leaving a world of lines for a world of clouds, but I believe it to be a fundamental difficulty in jumping into the world of statistics. One major consequence is that we no longer deal with equivalences (e.g., y = x+3), but rather with ranges and probabilities (e.g., -5<x<5 with 95% probability).

Roses are red, violets are blue… but do I have a rose or a violet?

Often, statistics problem sets will pose a question that begins with:

“Assume X is distributed…”

In these cases, the theoretical behavior of the random variable X is told to us, and we come to conclusions about what we’re likely to observe about its realizations, which are denoted in lowercase by a small x. Given different sets of assumptions about X, we can conclude different properties about its behavior, as portrayed below:

With this in mind, you may assume that a crucial tool in statistics is a “sorting hat” which, given a set of empirical data, tests whether they’re closest to being normal, poisson, etc. This way, we know which set of properties to expect from our data.

Yet such tools are quite sparse! There exists the Kolmogorov-Smirnov Test, which tests the similarity of two distributions, but was only mentioned in passing in one of my undergraduate statistics classes. The QQ plot acts as an improved form of two overlaid histograms that visually compares whether the quantiles of an empirical distribution and its presumed theoretical distribution match up. It wasn’t until graduate school that I learned about methods like kernel density estimation that can be used to estimate non-standard density functions.

The field of statistical inference, which dedicates itself to linking an observed dataset with its underlying properties, tends to assume a distribution on the data and simply provide methods for estimating its parameters. Do the exact distributions themselves have as little effect on the resulting properties as the lack of tools for them seem to imply?

One attempt at a self-rebuttal lies in the central limit theorem. We often are interested in the behavior of means (on average, will the bus come on time?), and the central limit theorem says that the sum of independent random variables tends toward a normal distribution. Since the mean involves a sum, then all derivations about normal distributions can be applied towards the means of realizations of independent random variables.

Another reply would point to the increasing attention given to semi- and non-parametric methods in which fewer (or no) distributional assumptions are made. As should reasonably be expected, the increased flexibility and applicability of these methods, however, may come at the cost of decreased power.

And how about data scientists?

Worrying about the accuracy of our models of randomness may seem moot when we realize that data scientists must make deterministic decisions. Because of this, simply comparing the magnitude of two numbers, or plotting side-by-side histograms may be enough to determine the winning decision. Statistics operates in a world of uncertainty, but data scientists operate in a world where decisions must be made in a binary fashion. Yes or no.

In practice, randomness — usually in the form of a p-value — simply tells us how sure to be about our decision, but will never change the direction of our decision. With this in mind, does precise quantification of the degree of uncertainty actually matter? If we had chosen to model our data as one distribution versus another, would that have actually changed anything?

My reservations about Statistics

You often hear people describe the pursuit of a Ph.D. as knowing “more and more about less and less”. But that’s not quite what I’d say about my Ph.D. work so far — I certainly am continuing to learn the fundamental basis of Statistics (like Stein’s paradox!).

Given the above discussion, my concern moreso surrounds:

(1) How to carry over our understandings from theoretical statistics to the real world. Doing so seems to require an ability to verify the degree to which assumptions are held, an area of statistics that seems underdeveloped.

(2) The usefulness of such careful treatment of randomness in a deterministic world. If Statistics is used in order to make concrete choices about public policies, medication, website design, etc., in what way does randomness play an important role?

My thoughts are always evolving, and I heartily welcome comments, disagreements, or feedback below.

Acknowlegements

Much thanks to Zi Chong Kao, Philipp Moritz, Avi Feller, Theresa Gebert, and Zoe Vernon for reading and refining these ideas with me.

--

--

Ph.D. student in Statistics at UC-Berkeley. Interests: Climbing, data privacy, Michael Pollan. Previously: Quora, Harvard, Google, Akamai, Represent.Us