The world’s leading publication for data science, AI, and ML professionals.

The Central Limit Theorem – What Exactly Is It?

Discussing one of the most important concepts in statistics

Image Source
Image Source

The Central Limit Theorem – if you are studying statistics or data science, then this is definitely a term you have heard before. But given its importance, it can be a bit confusing to understand when you are first learning it (I know it was for me!). In this blog post, I’m going to explain the central limit theorem in a short, concise way that will hopefully stick with you and help you become a better statistician or data scientist!

By definition, the central limit theorem declares that independent, random variables that are added together will progressively be distributed into a normal distribution as the number of variables is increased. This is useful and important, especially when you want to start sampling statistics in order to estimate the parameters of whatever population you may be studying. This theorem pairs very well with another important statistical concept, which is that when the means (averages) of samples are calculated, they will also form a normal distribution. So, when you know these two concepts, it allows you to set up boundaries, in a sense, when you are making your estimates for your population. Also, this knowledge can be used to estimate the probability of certain samples possessing outlier or extreme values that may differ significantly from the mean of your population.

If that paragraph was confusing, I don’t blame you, so I’ll use an example to try and explain this even better. Let’s make the assumption that we know the standard deviation and mean for rates of obesity in the USA. Let’s take a random city – like Portland, Oregon – and we measure the mean of a sample there and discover that their mean is much higher than that of the overall population of the USA. As a scientist, this would probably bring up some questions. For me, the first question I would ask is, "What is the probability that this mean measurement was just the circumstance of random chance?" If that probability is calculated and it turns out that possibility of random chance is in fact very high, then we can make an estimation that the actual obesity rates in Portland are lower than what our mean may have suggested because of that high probability of random chance.

So, let’s relate the above example back to the central limit theorem. As a reminder, the theorem states that as the averages of various samples are calculated, they will form a normal distribution. We know the mean of our population, which is the average obesity rates for the United States. From our normal distribution of all the sample means, we can compare the mean of a specific sample against the distribution of means from all the sample means. From there, we can see how many standard deviations away from the mean of all sample means our specific sample mean is. If it turns out that the specific sample mean is 2 or 3 standard deviations away from the mean of all sample means, then we have an indication that that specific sample is worthy of further exploration. This is because these circumstances, in relation to a normal distribution, are rare – 2 standard deviations is only about 2.35% of samples for each tail and 3 standard deviations is only about 0.15% of each tail.

Although over time I have come to understand the central limit theorem better in practice, I think back to my very early days studying Data Science and just being completely bamboozled trying to understand this. If you are that person or student, that’s who this article was meant for. If this article helped you – sweet! If not, read on young learner! I have faith that you will find your resource where you will have your "Aha!" breakthrough moment.

Thank you for reading!

LinkedIn


Related Articles