Sampling Techniques in Statistics

A light introduction to different sampling techniques in statistics

Published in

Towards Data Science

7 min readApr 14, 2021

Whenever we come across any statistical study, we hear a lot of different statistical terms. 😳 One of the most common terms we hear is sampling. In this article, we will try to understand what sampling is and then get into the details of different sampling techniques.

Sampling

Sampling, in simple terms, means selecting a group (a sample) from a population from which we will collect data for our research. Sampling is an important aspect of a research study as the results of the study majorly depend on the sampling technique used. So, in order to get accurate results or the results that can estimate the population well, the sampling technique should be chosen wisely.

Now let us first understand what exactly sample and population are in terms of statistics.

Population is a pool or collection of elements or individuals from which we draw a statistical sample for a study. It is the entire group about which we want to draw a conclusion. The number of elements or individuals in a population is called the population size.

Note: In a statistical study, the population does not always depict people. It can be anything, the population of sheep in India; the population of all students in elementary school in the US; the population of all blogging websites on the internet.

Sample, on the other hand, is a subset of the population. It is the specific group from which you collect data. The number of elements or individuals in a sample is called the sample size. The process of selecting a sample is called sampling. For example, the sample of sheep in Rajasthan, India; the sample of elementary school students in New York, US; the sample of data science blogging websites on the internet.

Note: The size of the sample is always less than the size of the population.

But, why do we need a sample? 🤔 That is a great question. 👏 Let us understand this first.

Why do we need a sample?

The answer is simple and pretty straightforward. It is nearly impossible to collect data from (or about) each and every individual (or element) of the population. Thus, sampling helps us in attaining information about the entire population. It is obvious that the results can’t be completely accurate but the closest approximation of the population. Also, it is important that the selected group should be representative of the population and not biased in any manner.

This is a simple illustration of the population and a sample drawn from it.

There are a lot of sampling techniques out there but we will just talk about a few common sampling techniques in statistics. Please note that we will not go much into the comparison between these techniques.

Sampling techniques

Simple Random Sampling (SRS):

Suppose we have a population of 20 people and we need to get a sample of 7 people from this population. For the sake of understanding, let us number these people. Now, we will randomly choose 7 numbers between 1 and 20 and the people against those numbers will be a part of our sample. If the person against the chosen number is already in our sample, we will just skip that number and choose another number.

Suppose we chose 4, then 7, then 11, then 20, then 1, then 12, then 20. Since 20 has already been chosen, we will choose another number, and let’s say it's 19 this time. For the sake of understanding, we will cross out the selected people.

Note:

We are skipping the repeated number because we do not survey or interview the same person twice.
There are different ways to generate random numbers. You can do it programmatically or by placing all these numbers in a bag and selecting one each time.

This type of sampling is called simple random sampling. This sampling is most appropriate when the population is homogeneous. We can notice that every member of this sample has an equal chance (probability) of selection. In this case, the probability of selection is 1/20.

Stratified Sampling:

Let us take the same sample as above. Let us say we want a sample of size 9 this time. Let us arrange these people in different groups and let these groups be based on the color of the clothes these people are wearing.

Based on color, we will get 4 groups from these 20 people. Each one of these smaller groups is called a stratum and each stratum can be defined by a characteristic which in this case is the color of the clothes. Thus, strata are created based on prior information about the members of the sample. The members in a stratum are homogeneous and the members of one stratum are heterogenous from the members of another stratum. Thus, it is used when the population is heterogeneous in itself and a homogeneous stratum can be isolated from it.

Now, the members are selected from each of these strata, that is, a sample is taken from each stratum. When we sample a population with many different strata, we generally require that the proportion of each stratum in the sample should be the same as in the population.

A toy example just for understanding the concept.
Proportion of black
= (Number of black/Total number)*Sample size = (9/20)*8 = 3.6
Proportion of red = (4/20) * 8 = 1.6
Proportion of blue = (4/20) * 8 = 1.6
Proportion of green = (3/20) * 8 = 1.2
If we approximate these numbers, we can select 4 black, 2 red, 2 blue and 1 green to represent the population well.

Note: Random sampling or any other sampling technique can be used to sample members from each stratum.

Cluster Sampling:

Cluster sampling is often confused with stratified sampling but both these sampling techniques are different from each other. The main difference is that with cluster sampling you have natural groups separating your population. For example, clusters like city blocks, school districts, age, sex, etc.

Let us consider our population again and suppose people in the first row live on 36th street and people in the second row live on 11th street. Each one of these is a cluster.

Now, we can choose a cluster from these two clusters (which can be done by simple random sampling). Suppose we choose 11th street, so we will now survey every people living on 11th street.

Note: We can choose as many clusters as we want.

Cluster sampling can be done in two ways:

Single-stage cluster sampling where we randomly choose cluster(s) and survey each and every member of the cluster(s) or two-stage cluster sampling where we first randomly choose cluster(s) and then randomly choose members from these clusters.

Systematic Sampling:

In this sampling technique, we systematically select members. In particular, members are chosen at regular intervals of the population by putting all the members in a sequence first.

Let us consider our sample population of 20 people. Suppose we want to choose 5 people and our system is that we will start with the third person and we will choose every fourth person. We will keep on doing this until we will get 5 people for our sample. (The tick marks denote the selected people.)

Note:

For every member to have an equal chance of selection, it is advised to choose the first (start) member using random sampling.
Systematic sampling can lead to bias.

Convenience Sampling:

It is one of the easiest sampling techniques but it is one of the most dangerous sampling techniques as the samples are selected based on availability. For example, surveying every person in your office, researching about every cat in the locality. Such samples are not representative of the population.

Note: Randomization should be used so that our sample represents our population well and could lead us to close to accurate results about our population.

The sampling techniques — simple, cluster, stratified and systematic are all probability sampling techniques and involve randomization. However, convenience sampling is a non-probability (or non-random) sampling technique as it relies on the researcher’s ability to select the sample. Non-probability sampling techniques can lead to biased samples and results.

There are other sampling techniques too. For example, purposive, quota, referral/snowball sampling are all non-probability sampling techniques. Multistage sampling is a probability sample technique. However, it is beyond the scope of this article to cover all the sampling techniques. 😐

Note: I have used people as the example here. For the sake of simplicity, I have colored different people with different colors but these colors do not represent anything.

I hope this article helped you in understanding the basic concept behind these sampling techniques. 😀

References:

This article is inspired by some of the great videos by Steve Mays on YouTube. 🙏 Also, check out this amazing channel named StatQuest.

Thank you, everyone, for reading this. Do share your valuable feedback or suggestion. Happy reading! 📗 🖌