Central Limit Theorem Explained with Python Code

Sujeewa Kumaratunga PhD
Towards Data Science
6 min readJan 30, 2020

--

Photo by Marvin Ronsdorf on Unsplash

A simulation to explain Central Limit Theorem: even when a sample is not normally distributed, if you draw multiple samples and take each of their averages, these averages will represent a normal distribution.

All roads lead to Rome! Wait, no! All roads lead to Shibuya! Wait, no! All sample means lead to the population mean.

Central Limit Theorem suggests that if you randomly draw a sample of your customers, say 1000 customers, this sample itself might not be normally distributed. But if you now repeat the experiment say 100 times, then the 100 means of those 100 samples (of 1000 customers) will make up a normal distribution.

This line is important for us: ‘this sample itself might not be normally distributed’. Why? Because most things in life are not normally distributed; not grades, not wealth, not food, certainly not how much our customers pay in our shop. But everything in life has a Poisson distribution; better yet, everything in life can be explained by a Dirichlet distribution, but let’s stick with a Poisson for simplicity (Poisson is a simplified case of Dirichlet, in fact).

But actually in an e-commerce shop, most of our customers are non-buying customers. So the distribution actually looks like an exponential, and since a Poisson can be derived from an exponential, let’s make some exponential distributions to reflect our customers’ purchases.

Let’s Try an Example

Let us assume our customer base has an average order value of $170, so we will create exponential distributions with this average. We will attempt to get this value by looking at some sample averages.

Draw only four samples

Here, I draw a sample of 1000 customers. Then repeat this 4 times. I get the following four distributions (To get graphs similar to this, use the code at the end with repeat_sample_draws_exponential(4, 1000, 170, True) ):

4 random samples of 1000 customers drawn. Each sample has an exponential distribution with mean $170.

And here is each of those 4 averages plotted (To get graphs similar to this, use the code at the end with repeat_sample_draws_exponential(4, 1000, 170, True)):

Mean of each of the four samples plotted.

The mean of the above 4 sample-means is $165.48
The standard deviation of the above 4 sample-means is $5.36

So if you stopped your experiments here, at four samples, you will report that your customer base has an average order value of $166 with 68% (1 standard deviation encompasses 68%) of your customers purchasing between $161 and $171 (from $166+/-$5). And that would be a wrong conclusion because we made all these simulations from the original assumption that the population mean is $170!

Draw 100 samples, but keep each sample size the same as above at 1000 customers

What happens if we repeat this experiment 100 times? i.e., Draw 100 samples of 1000 customers?

Use the code at the end with:
repeat_sample_draws_exponential(100, 1000, 170, True)

We will get 100 exponential distributions like so (To get graphs similar to this, use the code at the end with repeat_sample_draws_exponential(100, 1000, 170, True) ):

100 random samples of 1000 customers drawn. Each sample has an exponential distribution with mean $170.

And the distribution of the 100 averages now starts looking like a normal distribution (To get graphs similar to this, use the code at the end with repeat_sample_draws_exponential(100, 1000, 170, True) ):

Mean of each of the 100 samples plotted.

The mean of the above 100 sample-means is $169.55
The standard deviation of the above 100 sample-means is $5.18

So now you would conclude that your customer base has an average spending of $170 (yey! this was our very first assumption actually!) with 68% of our customers buying between $165 and $175.

Side note on Standard Deviation

Notice how the standard deviation did not change: it was $5 when we drew 4 samples and it was $5 when we drew 100 samples. You could run the code below for 10000 samples and see that the standard deviation would not change from $5. Why is this?

This brings us to the often overlooked second part of the Central Limit Theorem: the standard deviation of the sample-means is equal to the standard deviation of the population divided by the square-root of the sample size.

Our population is an exponential distribution of mean $170. An exponential has the special property that its standard deviation equals its mean, so the standard deviation of the population mean is $170. So then in this case we would expect the standard deviation of the sample means to be 170/sqrt(1000) = 5.4

And this is exactly what we get, a standard deviation of the sample-means that is around $5. No matter how many times we repeat the sampling, the standard deviation will not change, since it only depends on the sample size (it also depends on the population mean, but that remains a constant for a given population).

Conclusion

So that is Central Limit Theorem. The more samples you draw, the better you get at making the correct prediction of the population mean. We can not possibly test/survey all our customers (or all of our population), so the next best thing we can do is to draw several samples of a smaller number and get the average of those averages. The key here is to draw as many samples as possible. When we do that we have:

And the standard deviation of the sample mean distribution will change if you increase your sample size (the sample size is not the number of samples; in our example the sample size is 1000, the number of samples is 100):

Code to reproduce this :

import numpy as np
import matplotlib.pyplot as plt

def repeat_sample_draws_exponential(n, samp_size, mu, show_all=False):
means = []

samples = []
for ii in range(0, n):
samples.append(np.random.exponential(mu, samp_size))
means.append(np.mean(samples[ii]))

if show_all:
pltdim = np.math.ceil(np.math.sqrt(n))
fig, axs = plt.subplots(pltdim, pltdim, figsize=(8, 8), gridspec_kw={'hspace': 0.2}, sharex=True, sharey=True)
fig.suptitle('Individual Samples\' Order Value Distribution')
fig.text(0.5, 0.04, 'Order Values ($)', ha='center')
fig.text(0.04, 0.5, 'Number of Customers', ha='center', rotation='vertical')
axs = axs.flatten()
for ii in range(0, n):

plt.sca(axs[ii])

plt.gca().hist(samples[ii], bins=int(50), histtype='step',
label='$mean = {0:.2f}$'.format(np.mean(samples[ii])), range=[0, 2 * mu])
if n < 10:
plt.gca().set_title('Sample #{0} : average={1:.2f}'.format(ii, np.mean(samples[ii])))
for item in ([axs[ii].title, axs[ii].xaxis.label, axs[ii].yaxis.label] +
axs[ii].get_xticklabels() + axs[ii].get_yticklabels()):
item.set_fontsize(8)

plt.savefig('expdist_{0}_mu_{1}_sample_{2}_sampsize'.format(mu, n, samp_size))

plt.clf()
plt.hist(means, bins=int(10), histtype='step')
plt.title('Overall Average of {} Samples\' Average Order Value'.format(n))
plt.xlabel('Average of Individual Sample\'s Order Value ($)')
plt.savefig('average_of_expdist_{0}_mu_{1}_sample_{2}_sampsize'.format(mu, n, samp_size))
print('mean of the samples is {0:.2f}'.format(np.mean(means)))
print('standard deviation of the samples is {0:.2f}'.format(np.std(means)))

repeat_sample_draws_exponential(100, 1000, 170, True)

--

--

I am a physicist turned data scientist, currently working in a giant e-commerce company in Japan, after some years doing AI/ Data science in Canada.