Sampling Distributions with Python Implementation

Sampling distributions, the Central Limit Theorem, and Bootstrapping explained with Python Examples

Nazanin Zinouri
Towards Data Science

--

Photo by Jametlene Reskp on Unsplash

Sampling distributions are the distribution of a statistic (it can be any statistic). You might ask why are sampling distributions important? Well, they are the key in Inferential Statistics where the goal is to draw conclusions about a population based on data collected from a sample of individuals from that population.

In this article, the population of interest is 10,000 Australian Shepherd puppies. I have always been curious to know what proportion of Aussie dogs have blue eyes and what proportion is hazel (assuming these are the only two eye colors they can have which is not true in reality!). So the parameter we want to estimate is the proportion of blue-eyed puppies in our population. Let’s assume I was able to find 20 puppies to participate in an experiment as our sample. The blue-eyed proportion (proportion is the mean of 0, 1 values) in those 30 puppies is our statistic. Below you can see this sample is simulated

Sampling distributors have the following two properties:

  1. The sampling distribution is centered on the original parameter value.
  2. The variance of the sampling distribution decreases as the sample size becomes larger.

We see from above that the mean of our original sample is 0.75 and the standard deviation and variance are correspondingly 0.433 and 0.187. Let’s explore the above properties in our example:

We see the proportion for samples is centered around the original mean (0.75). This is better observed for a larger sample size (20 as shown in orange on the plot). We can also see that the variance is decreasing for a sample of size 20 compared with 5. In fact, the variance of each sample is the variance of original data divided by sample size.

Two important mathematical theorems are often discussed when working with sampling distributions:

  1. The law of large numbers
  2. The central limit theorem

The Law of Large Numbers

The law of large numbers states that as a sample size increases, the sample mean will get closer to the population mean. Let’s check this with our example.

We can see the mean is much closer to the original mean in a sample of size 100.

The Central Limit Theorem

This theorem states that with a large enough sample size, the sampling distribution of the mean will be normally distributed. This theorem does not apply to all statistics. It does apply to the following:

  1. Sample means
  2. The difference between sample means
  3. Sample proportions
  4. The difference between sample proportions

To check this, let’s change our example to a sample with non-normal distribution.

We see that the distribution is skewed to the right for samples of sizes 3 and 5 and closer to normal for sample size 30.

Bootstrapping

Relying on mathematical theorems such as the Central Limit Theorem can lead to gaps. We may not always be able to obtain a large enough sample (gathering 5 puppies is hard enough! And as we saw in the Central Limit Theorem section, 5 is not a large enough sample) or we may be using sample statistics where these theorems do not apply to.

In these situations, we can use bootstrapping. Bootstrapping in statistics means sampling with replacement and it allows us to simulate the creation of sampling distribution. np.random.choice(sample, size, replace=True can be used for bootstrapping. In fact, we have been bootstrapping from our sample of 20 puppies in all the previous sections. By bootstrapping and then calculating the repeated values of our statistics, we can gain an understanding of the sampling distribution of our statistics.

--

--