Back to the future of data analysis: A tidy implementation of Tukey’s vacuum

Published in

Towards Data Science

16 min readAug 16, 2020

One of John Tukey’s landmark papers, “The Future of Data Analysis”, contains a set of analytical techniques that have gone largely unnoticed, as if they’re hiding in plain sight.

Multiple sources identify Tukey’s paper as a seminal moment in the history of data science. Both Forbes (“A Very Short History of Data Science”) and Stanford (“50 years of Data Science”) have published histories that use the paper as their starting point. Springer included it in their collection Breakthroughs in Statistics. And I’ve quoted Tukey myself in articles about data science at Microsoft (“Using Azure to understand Azure”).

Independent of the paper, Tukey’s impact on data science has been immense: He was author of Exploratory Data Analysis. He developed the Fast Fourier Transform (FFT) algorithm, the box plot, and multiple statistical techniques that bear his name. He even coined the term “bit.”

But it wasn’t until I actually read “The Future of Data Analysis” that I discovered Tukey’s forgotten techniques. Of course, I already knew the paper was important. But I also knew that if I wanted to understand why — to understand the breakthrough in Tukey’s thinking — I had to read it myself.

Tukey does not disappoint. He opens with a powerful declaration: “For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt” (p 2). Like the opening of Beethoven’s Fifth, the statement is immediate and bold. “All in all,” he says, “I have come to feel that my central interest is in data analysis…” (p 2).

Despite Tukey’s use of first person, his opening statement is not about himself. He’s putting his personal and professional interests aside to make the much bolder assertion that statistics and data analysis are separate disciplines. He acknowledges that the two are related: “Statistics has contributed much to data analysis. In the future it can, and in my view should, contribute much more” (p 2).

Moreover, Tukey states that statistics is “pure mathematics.” And, in his words, “…mathematics is not a science, since its ultimate standard of validity is an agreed-upon sort of logical consistency and provability” (p 6). Data analysis, however, is a science, distinguished by its “reliance upon the test of experience as the ultimate standard of validity” (p 5).

Nothing on CRAN

Not far into the paper, however, I stumbled. About a third of the way in (p 22), Tukey introduces FUNOP, a technique for automating the interpretation of plots. I paged ahead and spotted a number of equations. I worried that — before I could understand the equations — I might need an intuitive understanding of FUNOP. I paged further ahead and spotted a second technique, FUNOR-FUNOM. I soon realized that this pair of techniques, combined with a third that I didn’t yet realized was waiting for me, make up nearly half the paper.

To understand “The Future of Data Analysis,” I would definitely need to learn more about FUNOP and FUNOR-FUNOM. I took that realization in stride, though, because I learned long ago that data science is — and will always be — full of terms and techniques that I don’t yet know. I’d do my research and come back to Tukey’s paper.

But when I searched online for FUNOP, I found almost nothing. More surprising, there was nothing in CRAN. Given the thousands of packages in CRAN and the widespread adoption of Tukey’s techniques, I expected there to be multiple implementations of the techniques from such an important paper. Instead, nothing. (Until now….)

FUNOP

Fortunately, Tukey describes in detail how FUNOP and FUNOR-FUNOM work. And, fortunately, he provides examples of how they work. Unfortunately, he provides only written descriptions of these procedures and their effect on example data. So, to understand the procedures, I implemented each of them in R. (See my repository on GitHub.) And to further clarify what they do, I generated a series of charts that make it easier to visualize what’s going on.

Here’s Tukey’s definition of FUNOP (FUll NOrmal Plot):

(b1) Let aᵢ₍ₙ₎ be a typical value for the ith ordered observation in a sample of n from a unit normal distribution.
(b2) Let y₁ ≤ y₂ ≤ … ≤ yₙ be the ordered values to be examined. Let y̍ be their median (or let ӯ, read “y trimmed”, be the mean of the yᵢ with ⅓n < i ≤ ⅓(2n).
(b3) For i ≤ ⅓n or > ⅓(2n) only, let zᵢ = (yᵢ - y̍)/aᵢ₍ₙ₎ (or let
zᵢ = (yᵢ - ӯ) /aᵢ₍ₙ₎).
(b4) Let z̍ be the median of the z’s thus obtained (about ⅓(2n) in number).
(b5) Give special attention to z’s for which both |yᵢ - y̍| ≥ A · z̍ and zᵢ ≥ B · z̍ where A and B are prechosen.
(b5*) Particularly for small n, zⱼ’s with j more extreme than an i for which (b5) selects zᵢ also deserve special attention… (p23).

The basic idea is very similar to a Q-Q plot.

Tukey gives us an example of 14 data points. On a normal Q-Q plot, if data are normally distributed, they form a straight line. But in the chart below, based upon the example data, we can clearly see that a couple of the points are relatively distant from the straight line. They’re outliers.

The goal of FUNOP is to eliminate the need for visual inspection by automating interpretation.

The first variable in the FUNOP procedure (aᵢ₍ₙ₎) simply gives us the theoretical distribution, where i is the ordinal value in the range 1..n and Gau⁻¹ is the quantile function of the normal distribution (i.e., the “Q” in Q-Q plot):

The key innovation of FUNOP is to calculate the slope of each point, relative to the median.

If y̍ is the median of the sample, and we presume that it’s located at the midpoint of the distribution (where a(y) = 0), then the slope of each point can be calculated as:

The chart above illustrates how slope of one point (1.2, 454) is calculated, relative to the median (0, 33.5).

Any point that has a slope significantly steeper than the rest of the population is necessarily farther from the straight line. To do this, FUNOP simply compares each slope (zᵢ) to the median of all calculated slopes (z̍).

Note, however, that FUNOP calculates slopes for the top and bottom thirds of the sorted population only, in part because zᵢ won’t vary much over the inner third of that population, but also because the value of aᵢ₍ₙ₎ for the inner third will be close to 0 and dividing by ≈0 when calculating zᵢ might lead to instability.

Significance — what Tukey calls “special attention” — is partially determined by B, one of two predetermined values (or hyperparameters). For his example, Tukey recommends a value between 1.5 and 2.0, which means that FUNOP simply checks whether the slope of any point, relative to the midpoint, is 1.5 or 2.0 times larger than the median.

The other predetermined value is A, which is roughly equivalent to the number of standard deviations of yᵢ from y̍ and serves as a second criterion for significance.

The following chart shows how FUNOP works.

Our original values are plotted along the x-axis. The points in the green make up the inner third of our sample, and we use them to calculate y̍, the median of just those points, indicated by the green vertical line.

The points not in green make up the outer thirds (i.e., the top and bottom thirds) of our sample, and we use them to calculate z̍, the median slope of just those points, indicated by the black horizontal line.

Our first selection criterion is zᵢ ≥ B · z̍. In his example, Tukey sets B = 1.5, so our threshold of interest is 1.5z̍, indicated by the blue horizontal line. We’ll consider any point above that line (the shaded blue region) as deserving “special attention”. We have only one such point, colored red.

Our second criterion is |yᵢ - y̍| ≥ A · z̍. In his example, Tukey sets A = 0, so our threshold of interest is |yᵢ - y̍| ≥ 0 or (more simply) yᵢ ≠ y̍. Basically, any point not on the green line. Our one point in the shaded blue region isn’t on the green line, so we still have our one point.

Our final criterion is any zⱼ’s with j more extreme than any i selected so far. Basically, that’s any value more extreme than the ones already identified. In this case, we have one value that’s larger (further to the right on the x-axis) than our red dot. That point is colored orange, and we add it to our list.

The two points identified by FUNOP are the same ones that we identified visually in Chart 1.

Technology

FUNOP represents a turning point in the paper.

In the first section, Tukey explores a variety of topics from a much more philosophical perspective: The role of judgment in data analysis, problems in teaching analysis, the importance of practicing the discipline, facing uncertainty, and more.

In the second section, Turkey turns his attention to “spotty data” and its challenges. The subsections get increasingly more technical, and the first of many equations appears. Just before he introduces FUNOP, Tukey explores “automated examination”, where he discusses the role of technology.

Even though Tukey wrote his paper nearly 60 years ago, he anticipates the dual role that technology continues to play to this day: It will democratize analysis, making it more accessible for casual users, but it will also enable the field’s advances:

“(1) Most data analysis is going to be done by people who are not sophisticated data analysts…. Properly automated tools are the easiest to use for [someone] with a computer.

“(2) …[Sophisticated data analysts] must have both the time and the stimulation to try out new procedures of analysis; hence the known procedures must be made easy for them to apply as possible. Again automation is called for.

“(3) If we are to study and intercompare procedures, it will be much easier if the procedures have been fully specified, as must happen [in] the process of being made routine and automatizable” (p 22).

The juxtaposition of “automated examination” and “FUNOP” made me wonder about Tukey’s reason for including the technique in his paper. Did he develop FUNOP simply to prove his point about technology? It effectively identifies outliers, but it’s complicated enough to benefit from automation.

Feel free to skip ahead if you’re not interested in the code:

FUNOR-FUNOM

One common reason for identifying outliers is to do something about them, often by trimming or Winsorizing the dataset. The former simply removes an equal number of values from upper and lower ends of a sorted dataset. Winsorizing is similar but doesn’t remove values. Instead, it replaces them with the closest original value not affected by the process.

Tukey’s FUNOR-FUNOM (FUll NOrmal Rejection-FUll NOrmal Modification) offers an alternate approach. The procedure’s name reflects its purpose: FUNOR-FUNOM uses FUNOP to identify outliers, and then uses separate rejection and modification procedures to treat them.

The technique offers a number of innovations. First, unlike trimming and Winsorizing, which affect all the values at the top and bottom ends of a sorted dataset, FUNOR-FUNOM uses FUNOP to identify individual outliers to treat. Second, FUNOR-FUNOM leverages statistical properties of the dataset to determine individual modifications for those outliers.

FUNOR-FUNOM is specifically designed to operate on two-way (or contingency) tables. Similar to other techniques that operate on contingency tables, it uses the table’s grand mean (x..) and the row and column means (xⱼ. and x.ₖ, respectively) to calculate expected values for entries in the table.

The equation below shows how these effects are combined. Because it’s unlikely for expected values to match the table’s actual values exactly, the equation includes a residual term (yⱼₖ) to account for any deviation.

FUNOR-FUNOM is primarily interested in the values that deviate most from their expected values, the ones with the largest residuals. So, to calculate residuals, simply swap the above equation around:

FUNOR-FUNOM starts by repeatedly applying FUNOP, looking for outlier residuals. When it finds them, it modifies the outlier with the greatest deviation by applying the following modification:

where

Recalling the definition of slope (from FUNOP)

the first portion of the Δxⱼₖ equation reduces to just yⱼₖ - y̍, the difference of the residual from the median. The second portion of the equation is a factor, based solely upon table size, meant to compensate for the effect of an outlier on the table’s grand, row, and column means.

When Δxⱼₖ is applied to the original value, the yⱼₖ terms cancel out, effectively setting the outlier to its expected value (based upon the combined effects of the contingency table) plus a factor of the median residual (~ xⱼ. + x.ₖ + x.. + y̍).

FUNOR-FUNOM repeats this same process until it no longer finds values that “deserve special attention.”

In the final phase, the FUNOM phase, the procedure uses a lower threshold of interest — FUNOP with a lower A — to identify a final set of outliers for treatment. The adjustment becomes

There are a couple of changes here. First, the inclusion of (–Bₘ · z̍) effectively sets the residual of outliers to FUNOP’s threshold of interest, much like the way that Winsorizing sets affected values to the same cut-off threshold. FUNOM, though, sets only the residual of affected values to that threshold: The greater part of the value is determined by the combined effects of the grand, row, and column means.

Second, because we’ve already taken care of the largest outliers (whose adjustment would have a more significant effect on the table’s means), we no longer need the compensating factor.

The chart below shows the result of applying FUNOR-FUNOM to the data in Table 2 of Tukey’s paper.

The black dots represent the original values affected by the procedure. The color of their treated values is based upon whether they were determined by the FUNOR or FUNOM portion of the procedure. The grey dots represent values unaffected by the procedure.

FUNOR handles the largest adjustments, which Tukey accomplishes by setting Aᵣ = 10 and Bᵣ = 1.5 for that portion of the process, and FUNOM handles the finer adjustments by setting Aₘ = 0 and Bₘ = 1.5.

Again, because the procedure leverages the statistical properties of the data, each of the resulting adjustments is unique.

Here is the code:

“Foolish not to use it!”

After describing FUNOR-FUNOM, Tukey asserts that it serves a real need — one not previously addressed — and he invites people to begin using it, to explore its properties, even to develop competitors. In the meantime, he says, people would “…be foolish not to use it” (p 32).

Throughout his paper, Tukey uses italics to emphasize important points. Here he’s echoing an earlier point about arguments against the adoption of new techniques. He’d had colleagues suggest that no new technique should be published — much less used — before its power function was given. Tukey recognized the irony, because much of applied statistics depended upon Student’s t. In his paper, he points out,

“Surely the suggested amount of knowledge is not enough for anyone to guarantee either

“(c1) that the chance of error, when the procedure is applied to real data, corresponds precisely to the nominal levels of significance or confidence, or

“(c2) that the procedure, when applied to real data, will be optimal in any one specific sense.

“BUT WE HAVE NEVER BEEN ABLE TO MAKE EITHER OF THESE STATEMENTS ABOUT Student’s t” (p 20).

This is Tukey’s only sentence in all caps. Clearly, he wanted to land the point.

And, clearly, FUNOR-FUNOM was not meant as an example of theoretically possible techniques. Tukey intended for it to be used.

Vacuum cleaner

FUNOR-FUNOM treats the outliers of a contingency table by identifying and minimizing outsized residuals, based upon the grand, row, and column means.

Tukey takes these concepts further with his vacuum cleaner, whose output is a set of residuals, which can be used to better understand sources of variance in the data and enable more informed analysis.

To isolate residuals, Tukey’s vacuum cleaner uses regression to break down the values from the contingency table into their constituent components (p 51):

The idea is very similar to the one based upon the grand, row, and column means. In fact, the first stage of the vacuum cleaner produces the same result as subtracting the combined effect of the means from the original values.

To do this, the vacuum cleaner needs to calculate regression coefficients for each row and column based upon the values in our table (yᵣₖ) and a carrier — or regressor — for both rows (aᵣ) and columns (bₖ). [Apologies for using “k” for columns, but Medium has its limitations.]

Below is the equation used to calculate regression coefficients for columns.

Conveniently, the equation will give us the mean of a column when we set
aᵣ ≡ 1:

where nᵣ is the number of rows. Effectively, the equation iterates through every row (Σᵣ), summing up the individual values in the same column (c) and dividing by the number of rows, the same as calculating the mean (y.ₖ).

Note, however, that aᵣ is a vector. So to set aᵣ ≡ 1, we need our vector to satisfy this equation:

For a vector of length nᵣ we can simply assign every member the same value:

Our initial regressors end up being two sets of vectors, one for rows and one for columns, containing either √(1/nₖ) for rows or √(1/nᵣ) for columns.

Finally, in the same way that the mean of all row means or the mean of all column means can be used to calculate the grand mean, either the row coefficients or column coefficients can be used to calculate a dual-regression (or “grand”) coefficient:

The reason for calculating all of these coefficients, rather than simply subtracting the grand, row, and column means from our table’s original values, is that Tukey’s vacuum cleaner reuses the coefficients from this stage in the procedure as regressors in the next. (To ensure aᵣ ≡ 1 and aₖ ≡ 1 for the next stage, we normalize both sets of new regressors.)

The second phase is the real innovation here. It’s take an earlier idea of Tukey’s, one degree of freedom for non-additivity, and applies it separately to each row and column. This, Tukey tells us, “…extracts row-by-row regression upon ‘column mean minus grand mean’ and column-by-column regression on ‘row mean minus grand mean’” (p 53).

The result is a set of residuals, vacuum cleaned of systemic effects.

Here’s the code for the entire procedure:

Takeaways

When I started this exercise, I honestly expected it to be something of an archaeological endeavor: I thought that I’d be digging through an artifact from 1962. Instead, I discovered some surprisingly innovative techniques.

However, none of the three procedures has survived in its original form. Not even Tukey mentions them in Exploratory Data Analysis, which he published 15 years later. That said, the book’s chapters on two-way tables contain the obvious successors to both FUNOR_FUNOM and the vacuum cleaner.

Perhaps one reason they’ve faded from use is that the base procedure, FUNOP, requires the definition of two parameters, A and B. Tukey himself recognized “the choice of Bₐ is going to be a matter of judgement” (p 47). When I tried applying FUNOR_FUNOM on other data sets, it was clear that use of the technique requires tuning.

Another possibility is that these procedures have a blind spot, which the paper itself demonstrates. One of Tukey’s goals was to avoid “…errors as serious as ascribing the wrong sign to the resulting correlation or regression coefficients” (p 58). So it’s perhaps ironic that one of the values in Tukey’s example table of coefficients (Table 8, p 54) has an inverted sign.

I tested each of Tukey’s procedures, and none of them would have caught the typo: Both the error (-0.100) and the corrected value (0.100) are too close to the relevant medians and means to be noticeable. I found it only because the printed row and column means did not add up to ones that I calculated.

The flaw isn’t fatal. And, ultimately, the utility of these procedures is beside the point. My real goal with this article is simply to encourage people to read Tukey’s paper and to make that task a little easier by providing the intuitive explanations that I myself had wanted.

To be clear, no one should mistake my explanations nor my implementations of Tukey’s techniques as a substitute for reading his paper. “The Future of Data Analysis” contains much more than I’ve covered here, and many of Tukey’s ideas remain just as fresh — and just as relevant — today, including his famous maxim: “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question” (pp 14–15).