Pixabay

Unraveling the Law of Large Numbers

The LLN is interesting as much for what it does not say as for what it does

Towards Data Science
15 min readJul 12, 2023

--

O n August 24, 1966, a talented playwright by the name Tom Stoppard staged a play in Edinburgh, Scotland. The play had a curious title, “Rosencrantz and Guildenstern Are Dead.” Its central characters, Rosencrantz and Guildenstern, are childhood friends of Hamlet (of Shakespearean fame). The play opens with Guildenstern repeatedly tossing coins which keep coming up Heads. Each outcome makes Guildenstern’s money-bag lighter and Rosencrantz’s, heavier. As the drumbeat of Heads continues with a pitiless persistence, Guildenstern is worried. He worries if he is secretly willing each coin to come up Heads as a self-inflicted punishment for some long-forgotten sin. Or if time stopped after the first flip, and he and Rosencrantz are experiencing the same outcome over and over again.

Stoppard does a brilliant job of showing how the laws of probability are woven into our view of the world, into our sense of expectation, into the very fabric of human thought. When the 92nd flip also comes up as Heads, Guildenstern asks if he and Rosencrantz are within the control of an unnatural reality where the laws of probability no longer operate.

Guildenstern’s fears are of course unfounded. Granted, the likelihood of getting 92 Heads in a row is unimaginably small. In fact, it is a decimal point followed by 28 zeroes followed by 2. Guildenstern is more likely to be hit on the head by a meteorite.

Guildenstern only has to come back the next day to flip another sequence of 92 coin tosses and the result will almost certainly be vastly different. If he were to follow this routine every day, he’ll discover that on most days the number of Heads will more or less match the number of tails.

With 92 Heads in a row, what Guildenstern got was an exceptionally biased sample. And what he witnessed were two fascinating behaviors of our universe — convergence in probability, and the Law of Large Numbers.

The Law of Large Numbers explained in simple English

The LLN, as it is called, comes in two flavors: the weak and the strong. The weak LLN can be more intuitive and easier to relate to. But it is also easy to misinterpret. I’ll cover the weak version in this article and leave the discussion on the strong version for a later article.

The weak Law of Large Numbers concerns itself with the relationship between the sample mean and the population mean. I’ll explain what it says in plain text:

Suppose you draw a random sample of a certain size, say 100, from the population. Make a mental note of the term sample size. The size of the sample is the ringmaster, the grand pooh-bah of this law. Now calculate the mean of this sample and set it aside. Next, repeat this process many many times. What you’ll get is a set of imperfect sample means. The sample means are imperfect because there will always be an error between the sample means and the true population mean. Let’s assume you’ll tolerate a certain error. If you select a sample mean at random from this set of means, there is a chance that the absolute difference between the sample mean and the population mean will exceed your tolerance of error.

The weak Law of Large Numbers says that the probability of the error between the sample mean and the population mean falling within your selected level of tolerance will tend to a perfect 1.0, a certainty, as the sample size grows to either infinity or to the size of the population.

No matter how much you shrink down the tolerance level, as you draw sets of samples of ever increasing size, it’ll become increasingly unlikely that the mean of a randomly chosen sample will exceed this tolerance.

A real-world illustration of how the weak LLN works

To see how the weak LLN works we’ll run it through an example. And for that, allow me if you will to take you to the cold, brooding expanse of the Northeastern North Atlantic Ocean.

Every day, the Government of Ireland publishes a dataset of water temperature measurements taken from the surface of the North East North Atlantic. This dataset contains hundreds of thousands of measurements of surface water temperature indexed by latitude and longitude. For instance, the data for June 21, 2023 looked like this:

Dataset of water surface temperatures of the North East North Atlantic Ocean (CC BY 4.0)

It’s hard to imagine what eight hundred thousand surface temperature values look like. So let’s create a scatter plot to visualize this data. I’ve shown this plot below. The vacant white areas in the plot represent Ireland and the United Kingdom.

A color-coded scatter plot of sea surface temperatures of the Northeastern North Atlantic
A color-coded scatter plot of sea surface temperatures of the Northeastern North Atlantic (Image by Author) (Data source: Dataset)

Now as a practitioner, you’ll never have access to the ‘population’. So you’ll be correct in severely chiding me if I declare this set of 800,000 temperature measurements as the ‘population’. But bear with me for a little while. You’ll soon see why it helps us to consider this data as the ‘population’.

So let’s assume that this data is…uh…the population. The average surface water temperature across the 810219 locations in this population of values is 17.25840 degrees Celsius. We’ll designate this value as the population mean, μ. Remember this value. You’ll need to refer to it often.

Now suppose this population of 810219 values is not accessible to you. Instead, all you have access to is a meager little sample of 20 random locations drawn from this population. Here’s one such random sample:

A random sample of size 20
A random sample of size 20 (Image by Author)

The mean temperature of the sample is 16.9452414 degrees C. This is our sample mean X_bar which is computed as follows:

X_bar = (X1 + X2 + X3 + … + X20) / 20

You can just as easily draw a second, a third, indeed any number of such random samples of size 20 from the same population. Here are a few random samples for illustration:

Random samples of size 20 each drawn from the population
Random samples of size 20 each drawn from the population (Image by Author)

A quick aside on what a random sample really is

Before moving ahead, let’s take a quick detour and get some perspective on what a random sample really is. And to get that perspective, we’ll look at a casino slot machine:

Pixabay

The slot machine shown above contains three slots. Every time you crank down the arm of the machine, the machine fills each slot with a picture that the machine has selected randomly from an internally maintained population of pictures such as a list of fruit pictures. Now imagine a slot machine with 20 slots named X1 through X20. Assume that the machine is designed to select values from a population of 810219 temperature measurements. When you pull down the arm, each one of the 20 slots — X1 through X20 — fills with a randomly selected value from the population of 810219 values. Therefore, X1 through X20 are random variables that can each hold any value from the population. Taken together they form a random sample. Put another way, each element of a random sample is itself a random variable.

In the slot machine example, X1 through X20 have a few interesting properties:

  • The value that X1 acquires is independent of the values that X2 thru X20 acquire. The same applies to X2, X3, …,X20. Thus X1 thru X20 are independent random variables.
  • Because X1, X2,…, X20 can each hold any value from the population, the mean of each of them is the population mean, μ. Using the notation E() for expectation, we write this result as follows:
    E(X1) = E(X2) = … = E(X20) = μ.
  • X1 thru X20 have identical probability distributions.

Thus, X1, X2,…,X20 are independent, identically distributed (i.i.d.) random variables. The mean of each one of these variables is the population mean μ. And for any particular observed sample, the mean of the observed values of X1, X2,…,X20 is the sample mean X_bar.

…and now we get back to showing how the weak LLN works

Let’s compute the mean (denoted by X_bar) of this 20 element sample and set it aside. If you crank down the machine’s arm again, out will pop another 20-element random sample. Go right ahead and compute its mean too and set it aside. If you repeat this process one thousand times, you will have computed one thousand sample means.

Here’s a table of 1000 sample means computed this way. We’ll designate them as X_bar_1 to X_bar_1000:

A table of 1000 sample means. Each mean is computed from a random sample of size 20

Now consider the following statement carefully:

Since the sample mean is calculated from a random sample, the sample mean is itself a random variable.

At this point, if you are sagely nodding your head and stroking your chin, it is very much the right thing to do. The realization that the sample mean is a random variable is one of the most penetrating realizations one can have in statistics.

Notice also how each sample mean in the table above is some distance away from the population mean, μ. Let’s plot a histogram of these sample means to see how they are distributed around μ:

A histogram of sample means
A histogram of sample means (Image by Author)

Most of the sample means seem to lie close to the population mean of 17.25840 degrees Celsius. And there are some that are a considerable distance from μ. Suppose your tolerance for this distance is 0.25 degrees Celsius. If you were to plunge your hand into this bucket of 1000 sample means, grab whichever mean falls within your grasp and pull it out, what will be the probability that the mean you’ve pulled out is outside this tolerance threshold? To estimate this probability, you must count the number of sample means that are at least 0.25 degrees away from μ in both directions, and divide this number by 1000. In the above table, there are 422 such sample means. Thus, the probability of pulling out one such mean at random is:

P(|X_bar — μ | ≥ 0.25) = 422/1000 = 0.422

Let’s park this probability for a minute.

Now repeat all of the above steps, but this time use a sample size of 100 instead of 20. So here’s what you will do: draw 1000 random samples each of size 100, take the mean of each sample, store away all those means, count the ones that are at least 0.25 degrees C away from μ in either direction, and divide this count by 1000. If that sounded like the labors of Hercules, you were not mistaken. So take a moment to catch your breath. And once you are all caught up, look below for what you have got as the prize for your labors.

The table below contains the means from the 1000 random samples, each of size 100:

A table of 1000 sample means. Each mean is computed from a random sample of size 100

Out of these one thousand means, fifty-six means happen to deviate by least 0.25 degrees C from μ. So this, the probability that you’ll run into such a mean is only 56/1000 = 0.056. This probability is decidedly smaller than the 0.422 we computed earlier when the sample size was only 20.

If you repeat this experiment with different sample sizes that increase incrementally, you’ll get yourself a table full of probabilities. I’ve done this exercise for you by dialing up the sample size from 10 through 490 in steps of 10. Here’s the outcome:

A table of probabilities. Shows P(|X_bar — μ | ≥ 0.25) as the sample size is dialed up from 10 to 490
A table of probabilities. Shows P(|X_bar — μ | ≥ 0.25) as the sample size is dialed up from 10 to 490 (Image by Author)

Each row in this table corresponds to 1000 different samples that I drew at random from the population of 810219 temperature measurements. The sample_size column mentions the size of each of these 1000 samples. Once drawn, I took the mean of each sample and counted the ones that were at least 0.25 degrees C apart from μ in either direction. The num_exceeds_tolerance column mentions this count. The probability column is the fraction num_exceeds_tolerance / sample_size.

Notice how this count attenuates rapidly as the sample size increases. And so does the corresponding probability P(|X_bar — μ | ≥ 0.25). By the time the sample size reaches 320, the probability has decayed to zero. It blips up to 0.001 occasionally but that’s because I drew a finite number of samples. If each time I draw 10000 samples instead of 1000, not only will the occasional blips flatten out but the attenuation of probabilities will also become smoother.

The following graph plots P(|X_bar — μ | ≥ 0.25) against sample size. It puts in sharp relief how the probability plunges to zero as the sample size grows.

P(|X_bar — μ | ≥ 0.25) against sample size
P(|X_bar — μ | ≥ 0.25) against sample size (Image by Author)

In place of 0.25 degrees C, what if you chose a different tolerance — either a lower or a higher value? Will the probability decay irrespective of your selected level of tolerance? The following family of plots illustrates the answer to this question.

The probability P(|X_bar — μ | ≥ ε) decays (to zero) as the sample size increases. This is seen for all values of ε
The probability P(|X_bar — μ | ≥ ε) decays (to zero) as the sample size increases. This is seen for all values of ε (Image by Author)

No matter how frugal, how tiny, is your choice of the tolerance (ε), the probability P(|X_bar — μ | ≥ ε) will always converge to zero as the sample size grows. This is the weak Law of Large Numbers in action.

The weak Law of Large Numbers, stated formally

The behavior of the weak LLN can be formally stated as follows:

Suppose X1, X2, …, Xn are i.i.d. random variables that together form a random sample of size n. Suppose X_bar_n is the mean of this sample. Suppose also that E(X1) = E(X2) = … = E(Xn) = μ. Then for any non-negative real number ε the probability of X_bar_n being at least ε away from μ tends to zero as the size of the sample tends to infinity. The following exquisite equation captures this behavior:

The weak Law of Large Numbers
The weak Law of Large Numbers (Image by Author)

Over the three centuries of history of this law, mathematicians have been able to progressively relax the requirement that X1 through Xn be independent and identically distributed while still preserving the spirit of the law.

The principle of “convergence in probability”, the “plim” notation, and the art of saying really important things in really few words

The particular style of converging to some value using probability as the means of transport is called convergence in probability. In general, the principle of convergence in probability can be stated as follows:

Convergence in Probability
Convergence in Probability (Image by Author)

In the above equation, X_n and X are random variables. ε is a non-negative real number. The above equation says that as n tends to infinity, X_n converges in probability to X. Notice that if you replace X with μ, you get the equation of the WLLN.

Throughout the immense expanse of statistics, you’ll keep running into a quietly unassuming notation called plim. It’s pronounced ‘p lim’, or ‘plim’ (like the word ‘ plum’ but with in ‘i’), or probability limit. plim is the short form way of saying that a measure such as the mean converges in probability to a specific value. Using plim, the weak Law of Large Numbers can be stated pithily as follows:

The weak Law of Natural Numbers expressed using very less ink
The weak Law of Natural Numbers expressed using very less ink (Image by Author)

Or simply as:

(Image by Author)

The brevity of notation is not the least surprising. Mathematicians are drawn to brevity like bees to nectar. When it comes to conveying profound truths, mathematics could well be the most ink-efficient field. And within this efficiency-obsessed field, plim occupies podium position. You will struggle to unearth as profound a concept as plim expressed in lesser amount of ink, or electrons.

But struggle no more. If the laconic beauty of plim left you wanting for more, here’s another, possibly even more efficient notation that conveys the same meaning as plim:

The weak Law of Natural Numbers
The weak Law of Natural Numbers expressed using even lesser ink (Image by Author)

Busting some myths about the weak LLN

At the top of this article, I mentioned that the weak Law of Large Numbers is noteworthy for what it does not say as much as for what it does say. Let me explain what I mean by that. The weak LLN is often misinterpreted to mean that as the sample size increases, its mean approaches the population mean or various generalizations of that idea. As we saw, such ideas about the weak LLN harbor no attachment to reality.

In fact, let’s bust a couple of myths about the weak LLN.

MYTH #1: As the sample size grows, the sample mean tends to the population mean.

This is quite possibly the most frequent misinterpretation of the weak LLN. The WLLN assuredly makes no such assertion. To see why that is, consider the following situation: you have managed to get your arms around a really large sample. While you gleefully admire your achievement, you should also pose yourself the following questions: Just because your sample is large, must it also be well-balanced? What’s preventing nature from sucker punching you with a giant sample that contains an equally giant amount of bias? The answer is absolutely nothing. In fact, isn’t that what happened to Guildenstern with his sequence of 92 Heads? His was, after all, a completely random sample. If the large sample just so happens to have a large bias — which is not very likely, but not impossible— then despite the large sample size, the large bias will blast away the sample mean to a point that is far away from the true population value. Conversely, a small sample can prove to be exquisitely well-balanced — again, not very likely, but not impossible. The point is, as the sample size increases, the sample mean isn’t guaranteed to dutifully advance toward the population mean. Nature just doesn’t provide such an absolute guarantee. The guarantee that nature provides is that as the sample size increase, the probability of a sample mean lying within any selected error threshold of the population mean progressively increases, in otherwise, the WLLN.

MYTH #2: As the sample size increases, pretty much everything about the sample — its median, its variance, its standard deviation — converges to the population values of the same.

This sentence is two myths bundled into one easy-to-carry package. Firstly, the weak LLN postulates a convergence in probability, not in value. Secondly, the weak LLN applies to the convergence in probability of only the sample mean, not any other statistic. The weak LLN does not address the convergence of other measures such as the median, variance, or standard deviation.

How to know if the weak Law of Large Numbers actually works?

It’s one thing to state the WLLN, and even demonstrate how it works using real-world data. But how can you be sure that it always works? Are there circumstances in which it will play spoilsport — situations in which the sample mean simply does not converge in probability to the population value? To know that, we must prove the WLLN and, in doing so, precisely define the conditions in which it will apply.

It so happens that the WLLN has a delectable proof that uses as one of its ingredients, the endlessly tantalizing Chebyshev’s Inequality. If that whets your appetite, stay tuned for my next article on the proof of the weak Law of Large Numbers.

Revisiting Guildenstern

It will be impolite to take leave without assuaging our friend Guildenstern’s worries. Let’s develop an appreciation for just how unquestionably unlikely a result it was that he experienced. We’ll simulate the act of tossing 92 unbiased coins using a pseudo-random generator. Heads will be encoded as 1 and tails as 0. We’ll record the mean of the 92 outcomes. The mean is the fraction of times that the coin came up Heads. For e.g. if the coin came up Heads 40 times, and Tails the remaining 52 times, the mean is 40/92=0.43478. We’ll repeat this act of 92 coin tosses, ten thousand times to obtain ten thousand means, and we’ll plot their frequency distribution. The X-axis of this distribution represents the mean value and the Y axis is a count of how many times the value was observed. After completing this exercise, we get the following kind of histogram plot:

A histogram of sample means of 10000 samples
A histogram of sample means of 10000 samples (Image by Author)

We see that most of the sample means are grouped around the population mean of 0.5. This is the result that you are most likely to observe. Guildenstern’s result — getting 92 Heads in a row — corresponds to a mean of 92/92=1.0. It’s an exceptionally unlikely outcome. In the plot, you can see that it’s frequency of occurrence in a sequence of 10000 experiments is basically zero. But contrary to Guildenstern’s fears, there is nothing unnatural about this outcome, and the laws of probability continue to operate with their usual assertiveness. Guildenstern’s outcome of a perfect 1.0 is simply lurking inside the remote regions of the right tail of the plot, waiting with infinite patience to pounce upon some luckless coin-flipper whose only mistake was to be unimaginably unlucky.

References and Copyrights

Data set

The North East Atlantic Real Time Sea Surface Temperature data set downloaded from DATA.GOV.IE under CC BY 4.0

Images

All images in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

Thanks for reading! If you liked this article, please follow me to receive tips, how-tos and programming advice on regression and time series analysis.

--

--