The world’s leading publication for data science, AI, and ML professionals.

On Average, You’re Using the Wrong Average – Part II

Real & Simulated Data + Summary Statistic Dynamics using R & BigQuery

Real & Simulated Data + Summary Statistic Dynamics using R & BigQuery

Depiction of harmonic, geometric & arithmetic means of a simple geometric series, from Part I.
Depiction of harmonic, geometric & arithmetic means of a simple geometric series, from Part I.

Preface

Part I of this series built a practical & conceptual framework for understanding & using lesser known Pythagorean Means in data analysis. I went to lengths (~5k words) to build intuitions for a general audience, requiring little more than primary school maths.

This post will get a bit deeper & more technical, exploring the dynamics of these means under richer simulated datasets derived from several probability distributions. We’ll then juxtapose these with some comparable ‘real world’ big datasets. I’ll also aim for more concision & economy, assuming a level of comfort with a bit higher level maths & probability theory.


Pythagorean Mean recap

Recall that there are 3 Pythagorean means, which conform to the inequality:

harmonic meangeometric meanarithmetic mean

They are only mutually equivalent in the limiting case where every number in the dataset is the same number.

  • The arithmetic mean is produced via simple addition & division.
  • The geometric mean is produced via multiplication & roots.
  • The harmonic mean is produced via reciprocals, addition & division.

Their equations are:

via Wikipedia
via Wikipedia

As such, each can be derived or expressed as reconfigurations of each other. For example:

  • The geometric mean is simply the antilog of thearithmetic mean of the log transformed values of the dataset. It can also sometimes preserve the ordering of arithmetic meansof ratios scaled to a common denominator, or at least produce similarly credible summary values.
  • The harmonic mean is just the reciprocal of the arithmetic mean of the reciprocals of the dataset. It can also be reproduced via an appropriately weighted arithmetic mean of the dataset.

Rules of thumb:

  • The arithmetic mean is best suited to additive, linear, ‘symmetrical‘, ‘normal‘ / Gaussian datasets
  • The geometric mean is best suited to multiplicative, ‘geometric‘, exponential, lognormal, skewed datasets, as well as ratios on different scales & compound growth rates.
  • The harmonic mean is the rarest of the three, but is well suited averaging ratios in terms of the numerator, such as rates of travel, some financial indexes & various specialized applications from physics to sabermetrics & the evaluation of machine learning models.

Limitations:

  • Due to their relative rarity, geometric & harmonic means can be difficult to interpret & misleading to audiences expecting more conventional meanings of ‘average’
  • The geometric mean is practically ‘unitless’, as scale & interpretable units are lost in the multiplicative operations (with factors on different scales)
  • Geometric & harmonic means cannot accommodate inputs ≤ 0

See Part I for a more detailed discussion. We’ll now see some of this in action.


Simulated Datasets

In Part I we observed the dynamics of Pythagorean means under trivial additive & multiplicative datasets. Here we’ll extend this exercise by simulating richer datasets from various probability distributions in R.

For our additive or linear dataset, we’ll draw 10,000 samples from a random normal distribution, with a mean of 100 & standard deviation of 20 to avoid draws ≤ 0:

hist( 
  rnorm( 10000, 100, 20 ) 
  )

Next we’ll simulate three types of multiplicative datasets (which are often difficult to distinguish, despite meaningful differences): a lognormal, exponential & power law distribution.

There are many ways to produce a lognormal distribution – basically any multiplicative interaction of i.i.d. random variables will produce it – which is why it appears so often in the real world, particularly among human endeavors. For simplicity & interpretability, we’ll simply raise Euler’s number to a random exponent drawn from a normal distribution, then add 100 to put it in the range of our normal distribution:

hist(
  exp(1)^rnorm(10000,3,.9) + 100,
  breaks = 39
 )

This is technically a special case of an exponential distribution, but we’ll simulate another variety via R’s rexp function, where we simply specify the number of samples & the rate of decay (again, adding 100 to the result):

hist(
  rexp(10000, 1/10) +100
 )

Finally, we’ll produce a power law distribution by raising samples from a normal distribution to a constant exponent (Euler’s number again, keeping it natural), then adding 100:

(note that this is the inverse of our lognormal method, where Euler’s number was the base & the distribution was the exponent)

hist(
  rnorm(10000, 3, 1)^exp(1) + 100
 )

Alright, now we’ll use the ggridges package to better plot our distributions. We’ll also load the tidyverse package, like civilized useRs:

library(tidyverse)
library(ggridges)

Let’s put our distributions into a tidy tibble with ordered factors:

Now lets take a more artful look at our distributions:

A shapely lot. Let’s compute some summary Statistics.

Since R has no native geometric or harmonic mean functions, we’ll have to be a bit crafty:

output:

# A tibble: 4 x 5
  distribution median    am    gm    hm
  <fct>         <dbl> <dbl> <dbl> <dbl>
1 normal         99.6  99.9  97.7  95.4
2 lognormal     120   129   127   125  
3 exponential   107   110   110   109  
4 power law     120   125   124   122

…and add them to our plots:

We immediately notice the impact of skewed densities as well as long & fat tails on the spread of the means and their relationship to the median:

  • In our normal distribution, being mostly symmetrical, the median & arithmetic mean are nearly identical. (99.6 & 99.9 respectively, per the output table above).
  • The rightward skew of our other distributions pull all the means to the right of the median, which tends toward the denser humps of the dataset

It’s all a bit crowded here though, so let’s take a closer look by adjusting the x-axis limits, aka xlim():

...xlim(90, 150)...
  • Again notice that in the normal, symmetrical dataset, geometric & harmonic means meaningfully understate the the ‘midpoint’ of the data, but are roughly equally spaced (each about 2 units apart).
  • In the lognormal distribution, the long (moderately thin) tail pulls the means far from the ordinal midpoint represented by the median, & even skews the distribution of means so that the arithmetic is farther from the geometric than the harmonic is.
  • In our exponential distribution, values are so densely packed & the short thin tail of exponential decay vanishes so quickly that our means also crowd close together – though the heavy skew still displaces them from the median.
  • Our power law distribution has the slowest decay, & thus the fattest tail. Still, it is nearly normal in the ‘body’, with the mildest skew of the asymmetrical distributions. Its means are roughly equidistant, but still removed from the median.

In Part I & above I mentioned the logarithmic relationship between the geometric & arithmetic means:

The geometric mean is simply the antilog of thearithmetic mean of the log transformed values of the dataset.

To demonstrate this, let’s take another look at our summary statistic table:

# A tibble: 4 x 5
  distribution median    am    gm    hm
  <fct>         <dbl> <dbl> <dbl> <dbl>
1 normal         99.6  99.9  97.7  95.4
2 lognormal     120   129   127   125  
3 exponential   107   110   110   109  
4 power law     120   125   124   122

Note our lognormal geometric mean: 127.

Now we’ll compute the arithmetic mean of the log-transformed values:

dist2$x %>% log() %>% mean()
# output:
[1] 4.84606

…& then take the anti-log of that number:

exp(1)^4.84606
# output
[1] 127.2381

Et voilà.

Now, to drive home the obvious, let’s see why this is so (& why the lognormal distribution earns its name):

# subtract 100 we originally added before taking the log
(dist2$x - 100) %>% log() %>% hist()

True to its name, a log-transformation of the lognormal distribution yields a normal distribution. Thus, the additive basis of the arithmetic mean yields the same result in a normal distribution that the multiplicative nature of the geometric mean yields in a lognormal distribution.

We shouldn’t be too impressed with the impeccable normal distribution resulting from the log-transformed lognormal dataset, since we specified the exact data generating process which produced the lognormal values, & literally just reversed that process to reproduce the underlying normal distribution.

Things are seldom so tidy in the real world, where generating processes are typically more complex & unknown or unknowable. Thus the confusion & controversy over how to model & describe empirically derived datasets.

Let’s take a look at some such datasets & see what the fuss is about.


Real World Data

While usually less docile than our simulated distributions, real world datasets often resemble at least one of the four classes above.

Normal distributions – the ballyhooed ‘bell curve’ – arise most often in natural & biological organisms & scenarios with few interactions. Height & weight are canonical examples. As such, my first instinct is to reach for the trusty [iris](https://rpubs.com/moeransm/intro-iris) dataset. It meets the requirement, but there’s just something underwhelming about n = 50 (the number of observations of a single species of flower in the dataset). I’m thinking bigger.

So let’s load the [bigrquery](https://github.com/r-dbi/bigrquery) package, & do bigRqueries.

library(bigrquery)

Google’s BigQuery makes available numerous public datasets of real data, some quite large, from genomics to patents to wikipedia article stats.

For our initial purposes, the natality dataset appears sufficiently biological:

project <- "YOUR-PROJECT-ID"
sql <- "SELECT COUNT(*) 
        FROM `bigquery-public-data.samples.natality`"
query_exec(sql, project = project, use_legacy_sql = F)

(protip: there’s lots of data here, & you get charged by the amount of data accessed, but your first TB per month is free. Further, while SELECT * is highly discouraged for obvious reasons, SELECT COUNT(*) is actually a free operation, & a good idea to scope things out)

Output:

0 bytes processed
  f0_
1 137826763

That’s more like it, 137 million baby records. We don’t quite need all of those, so let’s randomly sample 1% of the baby weights, take the first million results & see what we get:

sql <- "SELECT weight_pounds
        FROM `bigquery-public-data.samples.natality`
        WHERE RAND() < .01"
natal <- query_exec(sql, project = project, use_legacy_sql = F,
                    max_pages = 100)

Output:

Running job /: 7s:1.0 gigabytes processed
Warning message: 
Only first 100 pages of size 10000 retrieved. Use max_pages = Inf to retrieve all.

hist(natal$weight_pounds) yields:

Normal af imo.

Now to find some multiplicative data with some skew, let’s move beyond the biological, to the sociological.

We’ll look at the New York dataset, which contains all sorts of urban info, including trips in yellow & green cabs.

sql <- "SELECT COUNT(*) 
        FROM `nyc-tlc.green.trips_2015`"
query_exec(sql, project = project, use_legacy_sql = F)

Output:

0 bytes processed
  f0_                                                               
1 9896012

Under 10 million, so lets grab trip distances for the whole thing:

(this can take a while)

sql <- "SELECT trip_distance FROM `nyc-tlc.green.trips_2015`"
trips <- query_exec(sql, project = project, use_legacy_sql = F)
hist(trips$trips_distance)

-_-

Looks like there are some extreme outliers pulling our x-axis out to 800 miles. Hell of a cab ride. Lets trim this down to trips under 20 miles & take another look:

trips$trip_distance %>% subset(. <= 20) %>% hist()

There we are, & with a telltale sign of a lognormal distribution: the long tail. Let’s check our data for lognormalcy by plotting the histogram of the logs:

trips$trip_distance %>% subset(. <= 20) %>% log() %>% hist()

Normalish for sure, but we overshot the mark a bit & now observe a bit of left-skew. Alas, the real world strikes again. But safe to say that a lognormal distribution wouldn’t be a completely preposterous model to employ here.

Moving on. Let’s find some more heavy-tailed data, this time in a domain closer to my own professional affinities of digital web analytics: the Github dataset:

sql <- "SELECT COUNT(*)
        FROM `bigquery-public-data.samples.github_nested`"
query_exec(sql, project = project, use_legacy_sql = F)

Output:

0 bytes processed
  f0_
1 2541639

2.5 million rows. Since I’m starting to worry about my local machine’s memory, we’ll employ our random sampler & million row delimiter to get a count of "watchers" of each github repository:

sql <- "SELECT repository.watchers 
        FROM `bigquery-public-data.samples.github_nested`
        WHERE RAND() < .5"
github <- query_exec(sql, project = project, use_legacy_sql = F,
                     max_pages = 100)
github$watchers %>% hist()

Again, extreme long tail dynamics, so let’s do some trimming:

github$watchers %>% subset(5 < . & . < 3000) %>% hist()

Exponential af.

But is it lognormal again?

github$watchers %>% subset(5 < . & . < 3000) %>% log() %>% hist()

Nay.

But behold the rare beast: the (approximately) LOGUNIFORM DISTRIBUTION!

Let’s take one more draw from the big data urn, this time looking at scores of Hacker News posts:

sql <- "SELECT COUNT(*)
        FROM `bigquery-public-data.hacker_news.full`"
query_exec(sql, project = project, use_legacy_sql = F)

Output:

0 bytes processed
  f0_
1 16489224

sql <- "SELECT score
        FROM `bigquery-public-data.hacker_news.full`
        WHERE RAND() < .1"
hn <- query_exec(sql, project = project, use_legacy_sql = F,
                 max_pages = 100)
hn$score %>% hist()

hn$score %>% subset(10 < . & . <= 300) %>% hist()

A slower decay (once trimmed). And as for the log-transform?

hn$score %>% subset(10 < . & . <= 300) %>% log() %>% hist()

Again roughly loguniform with a rightward decay.

My cursory search for a power law distribution in the wild proved fruitless, which is perhaps not surprising as they are most often posited in network science (&, even there, appear to be scarcer than initially claimed).

In any event, lets organize our real world datasets as we did with our simulated distributions & plot them comparatively. We’ll normalize them & again center them around 100:

Now to plot:

True to the real world they come from, a bit rougher around the edges than our simulated distributions. Still, remarkably similar. Let’s plot them side by side with the new (& excellent) patchwork package from the indispensable Thomas Lin Pedersen:

At a glance, our simulated distributions would provide reasonable models for their adjacent real world datasets, with the exception of `power law -> hacker news scores`, where the the concavity of the decay & the weight of the tails are substantially different.

There are of course many more rigorous tests of model fit etc, but let’s just take another look by plotting the summary statistics of the distributions as we did before.

Unfortunately, something about the normalized real world data is screwing with my summary statistic calculations, which are all coming back more or less indistinguishable from each other. I suspect this has something to do with computers’ struggles with floating point arithmetic (but might just have to do with my own struggles with numeracy). In any event, we’ll have to use our unnormalized real world data, which means we’ll have to plot them individually & attempt to line up the ggridge patchwork manually.

First to compile our unnormalized distributions & compute summary statistics:

Now to plot (this is ugly since we have to create individual plots for each real distribution, but patchwork allows us to elegantly define the layout):

Interesting that our real world dataset summary stats seem to have substantially more spread than our simulated distributions. Let’s zoom in for a closer look.

I won’t repost all the code, but basically just modified pm1 to have xlim(90, 150) & uncommented the xlim() lines in the rest of the plots:

Here even starker, but for the normalish distributions: summary statistic dynamics diverge widely, which should give us pause when facilely ‘eyeballing’ the fit of an idealized model to real world data.

This concludes our exploration of Pythagorean Means under simulated & real world distributions.

If you haven’t already, check out Part I for a lower level, more explicit & intuitive introduction. And see below for references & further reading.

Also, follow me for more like this!

— Follow on twitter: @dnlmc LinkedIn: linkedin.com/in/dnlmc Github: https://github.com/dnlmc


Select References & Further Reading

Part I

Part II


Related Articles