Intuition behind Central Limit Theorem

Published in

Towards Data Science

16 min readMar 8, 2021

Photo by Leonardo Baldissara on Unsplash

Central Limit Theorem (CLT) is one of the most fundamental concepts in the field of statistics. Without it, we would be wandering around in the real world with more problems than solutions. Its applications are bountiful — from parameter estimation to hypothesis testing, from the pharmaceutical industry to eCommerce businesses. There might be hardly any empirical research in any industry that could go without leveraging the principles of CLT. The following quote from over a century ago sums up the importance of CLT.

“I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the [Central Limit Theorem]. The law would have been personified by the Greeks and deified, if they had known of it.”
— Sir Francis Galton, 1889, Natural Inheritance.

Before we dive deep into the mysteriously elusive world of Central Limit Theorem (CLT), we need to understand few terms to comprehend and speak the underlying language. In this blog, we will kick off with an understanding of population vs. sample, parameter vs. statistic, relevance of inferential statistics, and sampling distribution. We will also extract a few samples out of a population and compare them, setting up grounds for CLT. We will eventually conclude the article with a formal definition of CLT, its limitations and an example problem to see in action the principles of CLT.

Population vs. Sample

What is a population and how is it different from a sample?

Population refers to every member of a group/set of data. E.g. All the citizens of USA, all the residents of a city, all the students of a school, etc. The process of conducting a survey to collect data for the entire population is called a census.

Sample refers to a subset of members drawn from the population that time and resources allow one to measure. The process of conducting a survey to collect data for a sample is called a sample survey.

How is sampling useful?

Let us consider a corporate chain Cake N Bake that has a noticeable presence in the capital region of India. The firm wants to launch a new logo to better connect with its target audience — adults up to the age of 30 years. The firm must choose the best among the three relevant designs that have been shortlisted by a panel of internal stakeholders. How does the firm go about it? It would be wise to collect feedback of the target audience and consider its opinion before determining which design would resonate best with the group.

However, should the firm call for a census on all the customers (existing or potential) to find out the most suitable design? Wouldn’t it turn out to be a prohibitively expensive exercise from both time and cost perspectives?

Here comes the magic of “sampling”. All the firm needs to do is to select a limited set of members from the population and conduct the survey on this sample set. Later, using the results of the sample survey, the firm can draw (reasonable) conclusions about the preferences (characteristics) of the population and select the best logo design to move forward with. Keep in mind that the sample needs to be unbiased and representative of the population, and needs to be extracted randomly out of the relevant population. Only then, the sample would serve as an appropriate proxy for the population to solve the problem at hand.

The sample needs to be unbiased and representative of the population, and needs to be extracted randomly out of the relevant population.

Parameter vs. Statistic

What is a parameter and how does it differ from a statistic?

A parameter is a number/quantity that describes (a characteristic of) the entire population, whereas a statistic is a number/quantity that describes a sample. The number used to describe characteristic of a population/sample could be a measure of central tendency (mean, median or mode) or a measure of dispersion (variance, standard deviation, range or inter-quartile range).

Here are the notations and formulae of few important and widely used measures in context of population and sample. In the table below, x denotes individual value in the population or sample set.

Relevance of Inferential Statistics

What is the use of a sample statistic?

With the help of the science of inferential statistics, we can use statistic (of a sample) to make an educated guess and draw/infer conclusion about the parameter (of the population). At the same time, we can also understand the confidence (or uncertainty) in the inferred conclusion. The uncertainty associated with the estimate of a population parameter arises because we can collect and study only limited dataset rather than the entire population.

Confidence Interval is an interval estimate (range between two values) used to express the uncertainty associated with an estimate of a population parameter. Given a sample statistic, we can estimate, with some confidence, an interval for the true population parameter (such as mean). In other words, the interval simply indicates that with a certain level of confidence, the true population parameter lies within the estimated range. For instance, I am 90% confident that the true population mean lies somewhere between 9.05 and 11.25.

If you are more interested in the concept of Confidence Interval and its calculation, refer to the article.

Uniqueness and Limitation of Sample

Why wouldn’t a statistic value provide us with the exact value of the parameter rather than an estimate?

Let’s say that John, Mike and Kevin each have been tasked separately to infer the mean age of the residents of a hypothetical town of Hawkins, Indiana. Let us assume that all three have been allocated separate resources for the problem at hand, and that they do not collaborate with each other during the entire exercise. Each person samples exactly 100 random ages from the population.

Do you think the mean age (statistic) of any sample would be the same as the mean age of the city’s population (parameter)? Well, most likely not, because a sample has access to limited information about the population due to its restricted size (n), and hence, not skilled enough to discern the exact parameter value per se. However, any of the three sample statistics could still turn out to be equal to the parameter value by chance, even though the possibility of such an event happening is quite bleak.

Do you think the statistic value (mean age) of Mike’s sample would match exactly with the statistic value of Kevin’s sample? Again, most likely not, because each sample would probably end up with a different set of values from the population, thereby yielding a unique statistic value. However, two different samples could still end up having the same set of values by chance, even though the probability of such an event is, again, very low.

Let’s extract three different samples from the population to witness the “uniqueness” and “limitation” of a sample in action.

Here’s the (simulated) distribution of ages of all 150k residents of the small town of Hawkins. The distribution’s shape is gamma, and the mean age of the population (parameter) is 18 years.

Simulated Population of Hawkins, Indiana

Let’s prepare separate samples for each John, Mike and Kevin, by randomly and independently selecting 100 ages out of the population. “Independently” ensures that any selection is not dependent on the previous selections. As a result, we could very well pick up the same age in a sample again. This condition can be guaranteed only by replacing a value back in the population before the next selection happens.

The following figures show distributions of the three samples in the form of rug plots. The member values of each sample are denoted by ticks touching the upper side of the x-axis.

Did you happen to notice the uniqueness in the set of ages, and consequently, in the statistic (mean age) of each sample? All the statistic values (17.84, 16.21 and 19.12) are different from each other, and also, from the population parameter (18.0). However, the statistic values seem to be revolving around the parameter.

The variance in statistic values (or the sampling error) arises because each sample is peeking into (and accessing) the population through its own unique and limited window — uniqueness arises due to randomization in sampling and limitation due to the restricted size of the sample.

What would happen if we were to draw thousands of samples from the population? Could the power of many (statistic values), somehow, reveal the true parameter?

Sampling Distribution

If we draw thousands of similarly constructed samples (randomly and independently, and each of size 100) out of the population and calculate mean age (statistic) of each sample, we can plot a histogram, known as sampling distribution, to showcase the frequency of thousands of mean statistic values gathered. Remember that John’s, Mike’s and Kevin’s would be just three among thousands of samples that form the sampling distribution below.

Sampling distribution is a probability distribution consisting of all possible values of a sample statistic. It shows the frequency of every possible value a statistic could take in every possible sample (of size n) drawn from the same population.

Remember that the statistic could be mean, median, variance or any other measure that quantitatively describes (a characteristic of) a sample. However, for the current problem at hand, we have deliberately chosen mean as the statistic, since the CLT principles are applicable to the mean statistic/parameter only.

Here are a few key observations from the sampling distribution above.

With the sample size (n) of 100, the shape of the sampling distribution turns out to be approximately normal (aka gaussian) and symmetrical, even though the original population distribution was skewed to the right. As we will see later, the more skewed or unsymmetrical the population distribution, the higher the n must be for the sampling distribution to approximate a normal shape.
The average of statistic values/sample means, aka the expected value of the statistic (mean age), is 18.01 years, which is so close to the parameter (mean age of the population = 18.0 years) that it could be unreservedly considered equal to the parameter. Expected value of a statistic, denoted by E(), is its long-range average value that the statistic assumes over repeated sampling, and thus, our best guess of its value on any particular trial.
The standard error of the statistic, which is simply the standard deviation of sample means, is 1.83 years. It denotes how spread-out the sampling distribution is or how much variance is observed in the mean ages of samples. As we will see later, the higher the n, the lower the standard error becomes.

The entire process — sampling, calculating statistic (mean) and finding expected value of the statistic — is the underpinning of CLT and can be summarized pictorially as follows.

Impact of Sample Size on Sampling Distribution

Does the sample size have an effect on the sampling distribution? If yes, why?

As it turns out, the sample size does have a profound impact on the sampling distribution’s shape and width. Let’s see how the sampling distribution changes when the sample size (n) is altered.

Below are the sampling distributions generated out of the population (ages) of Hawkins, but with different sample sizes (notice the legend on the top right of each figure).

Upon varying the sample size n, we observe two important things.

The sampling distribution approximates normal shape with an increase in the sample size. The sampling distribution with n=15 is not symmetrical — it is slightly skewed to the right (the width of the left segment is roughly 9.75 years, while that of the right segment is over 14 years), mimicking the original population distribution. Increasing the sample size to at least 100 makes the distribution assume symmetry, and hence, a normal shape.

The higher the sample size is, the better the shape approximates to a normal curve.

As the sample size (n) increases, the standard error of the statistic (width of the distribution) decreases.

Think about this for a moment. If a sample had the ability to pick up (nearly) all the members of the population, wouldn’t it be able to tell us the mean of the population with utmost confidence? Why would there be any (noticeable) variance in the means of such samples? Notice in the last sampling distribution with n=150k (equal to the size of the entire population) that the standard error of mean ages is minimal, at 0.03 years.

By the same line of reasoning, if each sample had the ability to pick up much fewer members of the population during the sampling process, and hence, had access to much less information about the population, the means of such samples would show a higher variance. This is evident from the sampling distributions above with n=15, 100, and 500. As we increase n, the variance in sample means decreases. Do you remember the means of John’s, Mike’s and Kevin’s samples, each of which were limited to a size of 100? Do you recall that they were not equal, and hence, showed some variance?

As the sample size increases, the standard error of the statistic decreases.

Central Limit Theorem

Here is a more formal definition of CLT; it is essentially what we have discussed thus far.

If we were to draw samples, each of size 𝑛, and calculate the mean of each sample, we would expect to obtain a distribution of values, known as sampling distribution of the means. CLT states that for large random samples, the sampling distribution of means is approximately normal regardless of how the population/process being sampled is distributed. The larger the sample size n, the better the sampling distribution will approximate the exact normal distribution.

The notation for representing the sampling distribution (of sample means) is as follows, where ‘~’ is read as “distributed as” and ‘N’ as “Normal”.

CLT establishes the relationship of the features of the sampling distribution with those of the associated population distribution as follows.

Notice that as n increases, the standard error of the mean statistic decreases. It is exactly what we observed in the previous section as well.

Given the formulae above, we can alternatively say that the sample means are distributed as normal (shape) with their mean (aka expected value) equal to the mean of the population and their variance equal to the variance of the population divided by the size of samples forming the sampling distribution. Therefore, the essence of CLT can be captured in the following formula.

According to the CLT principles, (approximate) normality of a sampling distribution is achieved when the sample size is large. But how large should the sample size be in order to achieve a normal shape of the sampling distribution?

Though many books/articles recommend n>30 as a rule of thumb, the size of samples depends on the shape of the population. If the underlying (population) distribution were approximately symmetrical, then our sampling distribution would already be pretty normal. In such cases, we could even get away with n as low as 10. However, if the underlying distribution were heavily skewed, then we would need to increase our sample size to get a normal shape of the sampling distribution. Remember that in the case of ages of residents of Hawkins, which was skewed to the right, we needed a sample size of 100 to get a normal shape of the sampling distribution?

The below picture shows sampling distributions for four different types of population distributions. To achieve a decent normal shape in the corresponding sampling distribution, minimum sample size of 10, 25, 100 and 250 would be required for the given normal, uniform, gamma and bimodal population distributions, respectively.

Limitations of CLT

CLT principles can be applied to a problem only when the following conditions are met.

a) The population must have a finite mean and variance. For instance, Cauchy distribution is one such distribution that does not have a defined mean and variance, and hence, not eligible for CLT principles. That said, most of the real-world cases do involve finite mean and variance.

b) The relationship established by CLT between the features of population distribution and those of sampling distribution is applicable on mean statistic only, and cannot be applied to variance, standard deviation, mode, or median of samples.

c) For normal, uniform or other symmetrical population distributions, the size of samples could be as low as 10. However, for an unsymmetrical population, depending on how skewed the distribution is, the sample size needs to be increased to something large (perhaps, in hundreds). Hence, unless we are sure that the population is symmetrically distributed, we should apply CLT principles only when the sample size is large.

d) We need to be sure that the sampling is conducted in random and independent fashion. In a sample, if selection of any of its member is dependent on selection of any other member, then the sampling is not independent. In such a case, we cannot apply CLT principles to discern the features of the sampling distribution based on those of the population distribution, or vice versa.

Example Problem

Here is a problem with a solution that showcases one of the applications of CLT principles.

Problem: Google Waymo is on a mission to upend the automobile industry by introducing the safest self-driving electric vehicles in the market. The firm cares not only about the society and the environment but also about its employees. HR department is planning a picnic for its employees over a Friday this summer. 100 individuals have responded to attend the 4-hour outing. The organizer of the event has arranged for 53 gallons of water for the attendees. During an outing of four hours in summers, an adult needs half a gallon of water, on an average; the variance in the water requirement is 0.01 gallons-square. The organizer needs to know if the probability of the team running out of available water before the picnic ends is greater or less than her threshold of 5%.

Solution:

Since the requirement of water varies from person to person, the amount of water required by an adult during a 4-hour outing in summer can be considered as a distribution with the mean of 0.5 gallons and the standard deviation of 0.1 gallons, which is square-root of the variance of 0.01. We can regard this distribution as the population of concern with unknown shape.

If we were to pick up a random sample of 100 water requirements from the population distribution, we would end up with a unique (statistic) value of average amount of water required per person. If we were to draw fifty thousand of similarly constructed samples, we would end up with a sampling distribution of average amount of water required by a person.

Based on the CLT principles, irrespective of the shape of original population distribution, for large random samples, the sampling distribution should approximate to normal shape with its expected value equal to the population mean and its standard error equal to the standard deviation of population divided by the square-root of sample size. Of course, we assume that the population distribution is not absurdly far from a symmetrical shape, and hence, 100 qualifies as a “large” enough sample size for the CLT principles to be applicable and valid on the problem at hand.

The sampling distribution can be represented as follows.

With 53 gallons of water for 100 employees, we have an average of 0.53 gallons available per individual. All we need to do is to figure out the probability of a sample requiring more than 0.53 gallons of water per person. In layman terms, how many samples out of 50k samples (in the sampling distribution) would require more than 0.53 gallons per person?

Given the standard deviation of the sampling distribution as 0.01, it is easy to realize that the sample statistic (0.53) is, on the right side, 3 standard deviations away from the mean value of 0.5. This is known as z-score of the sample statistic and tells us how far the sample statistic is from the mean in terms of the distribution’s standard deviation. (You can read more about z-score here).

As shown below, in a normal distribution, roughly 99.7% of the data falls within 3 standard deviations away from the mean. Since 0.53 is sitting on the right side, at 3 standard deviations away from the mean, the probability of its occurrence is 0.15% — refer to the tiny area on the right-most side of the normal distribution below.

Answer: The organizer can rest assured that the probability of the team running out of 53 gallons of water is way less than her threshold level of 5%.

Summary: Here is the gist of steps performed to solve the problem at hand.

The parameters — mean and standard deviation of the population — were provided to us,
Using CLT principles, the sampling distribution’s features — expected value and standard error of the statistic — were computed,
The position of the given statistic (0.53 gallons per person) in the sampling distribution was determined, and
The probability of occurrence of that statistic based on the known features of a normal distribution was estimated.

I hope that the article helped you build a good understanding of the context and the intuition behind Central Limit Theorem, along with its principles and applications. If it feels overwhelming at first, give it another read after a day or two. It is easy to get confused with some technical terms such as sample, sample mean, and sampling distribution, but if you patiently go about comprehending one at a time, in the same order as described in the article, then everything will start making sense.

Should you have any question or feedback, feel free to leave a comment. You can also reach me through my LinkedIn profile.