EDUCATION

Despite being amongst the few fundamental concepts in Data Science, the Central Limit Theorem (CLT) is still misunderstood.
Questions around such fundamental statistical concepts do pop up in data science interviews. Yet, you’d be surprised how often aspiring data scientists invest their learning time on the latest trends and new algorithms but miss the trick by not revisiting basic concepts and get stumped at interviews.
This post will help you better understand the CLT theorem at an intuitive level. It will also help you better appreciate its importance, and the key assumptions when it is used.
Plain English Explanation
In a somewhat formal language,
the Central Limit Theorem states that the sampling distribution of the mean of any distribution will be normal, provided that the sample size is large enough.
Let’s decouple the above definition with simpler words using a more concrete example.
Imagine a hypothetical country of 2 million households divided into two key regions: Tom and Jerry. For simplicity, let’s assume that 1 million households live in the Tom region, and 1 million live in the Jerry region.
A popular fast-food chain has recruited you to help them decide if they should invest and open a branch in the country. And if they do, should they open it in the Tom or the Jerry region.
Let’s assume that a useful metric to assess the existing eating habits of people in each region is the weekly number of fast-food restaurant visits per household. Your task is to work these out for Tom, Jerry, and the country overall.
In our hypothetical country, the mean number of weekly household visits for Tom is 1.5, with the distribution given in Figure 1.

And the mean number of weekly household visits for Jerry is 3.5 with the distribution given in Figure 2.

Collectively, the distribution of the mean number of weekly visits is provided in Figure 3, with a mean of 2.5

In theory, we can ask about the eating habits of every person in the country and then compute the mean weekly visit rate. This is, however, not feasible in real-world projects.
What we, instead, do is "sample" from the total population. By "sample", we mean that we can go and ask only a small group of people (often chosen randomly for good reasons) from the total population.
Let’s assume that we go out and sample 100 randomly chosen households from the Tom region and then calculate the mean. This is one experiment.
If we repeat the same experiment, we will get a different mean. If we repeat the same experiment 100 times, we will get 100 different (samples) mean values.
If we then plot the distribution of these sample means, it will look like a normal distribution. The mean of this sample distribution will be very close to the true population mean.
Figure 4 shows the distribution of 10,000 mean values from the Tom region (simulated in R). Each mean value is computed by sampling 100 randomly chosen households.

Figure 5 shows the distribution of 10,000 mean values from the Jerry region. Again, each mean value is computed by sampling 100 randomly chosen households.

Both the distributions in Figures 4 and 5 are normal. At this point, you may think that these sample distributions are normal because the population distributions (from which these have been derived) are normal.
However, it may come as a surprise to beginners.
It doesn’t matter what the population (original) distribution is. If we sample, and the sample is large enough, the final distribution of the sample mean will be normally distributed. Further, the mean of this sampling distribution will be approximately equal to the population mean.
What you just read above is the CLT theorem in plain English.
Let’s demonstrate the CLT using our earlier example. Let’s randomly choose 100 households from the entire country consisting of Tom and Jerry regions and compute the mean, and repeat the same experiment 100,000 times. Figure 6 shows the distribution of these 10,000 mean values.

Clearly, the country population distribution is not normal. Even then, the sampling distribution is normal, with the mean very close to the population mean.
This is the beauty of CLT. We don’t need to know what the underlying distribution of a random variable is. We can still find out the mean of the population by sampling, and correctly assuming that the sampling distribution will be approximately normal.
What makes CLT Useful?
In most useful, real-world projects, we can’t go out and collect data from the entire population due to time and resource constraints. However, CLT empowers us to confidently go out, collect data from a subset of the population, and then use Statistics to draw conclusions about the population.
The CLT is foundational to hypothesis testing, the branch of inferential statistics that helps us draw conclusions about a population from only a representative subset of data.
Final Thoughts
In the hypothetical example, the population distributions for Tom and Jerry were normal while the distribution for the entire country was not normal (it had two peaks). However, the sampling distributions were normal in all three cases. This is the result of the Central Limit Theorem. Regardless of the population distribution, the sampling distribution of the mean is normally distributed provided that the sample is large enough. In most practical applications, a sample size greater than 30 is usually considered sufficient.
The CLT theorem is only valid if the mean and variance of the distribution to be modeled are finite. Consequently, this theorem is not applicable in the case of Cauchy distribution. If you want to dig further, check this simulation walk-through of two examples, one where CLT is applicable, and one where it is not.
All the figures in this post were generated in R. The code with comments is available on my GitHub.
Happy Learning!
Read every story from Ahmar Shah, PhD (Oxford) (and thousands of other writers on Medium)