The world’s leading publication for data science, AI, and ML professionals.

Probability Sampling Methods Explained with Python

Why do we need Sampling?

PYTHON FOR PROBABILITY AND STATISTICS

Four Types of Sampling Methods all Data Scientists Must Know

Photo by Churrasqueira Martins on Kindpng
Photo by Churrasqueira Martins on Kindpng

Sampling is used when we try to draw a conclusion without knowing the population. Population refers to the complete collection of observations we want to study, and a sample is a subset of the target population. Here’s an example. A Gallup poll¹, conducted between July 15 to 31 last year, found that 42% of Americans approve of the way Donald Trump is handling his job as president. The results were based on telephone interviews of a random sample of ~4500 calls (assuming one adult per call. ~4500 adults), aged 18 and older, living in the U.S. The poll was conducted during a period of controversy over Trump’s social media comments. For this survey, the population is ALL the U.S citizens aged 18 and older, and the sample is 4500 adults.

If sampling is done wrong, it will lead to biases that affect the accuracy of your research/survey results. To avoid selection biases, we have to carefully choose a subset of a population **** that can be representative of the group as a whole.

Types of Probability Sampling

Simple Random Sampling

Image by Author using Powerpoint
Image by Author using Powerpoint

Simple random sampling means we randomly select samples from the population where every unit has the same probability of being selected.

Pros: there’s no need to divide the population into subgroups or take any other additional steps before selecting members of the population at random.

Cons: the samples might not be representative, and it could be time-consuming for large populations.

Use Case: it’s used when we don’t know too much about the population.

#let's create a dataframe first!
import numpy as np
import Pandas as pd
from numpy.random import randn
# Define total number of customers
number_of_customers = 10
# Create data dictionary
data = {'customer_id':np.arange(1, number_of_customers+1).tolist(),
       'customer_life_time_value':randn(10)}
# Transform dictionary into a data frame
df = pd.DataFrame(data)
# View data frame
df
Created a data frame (Image by Author)
Created a data frame (Image by Author)
#only using random(), we can generate 4 samples from this dataset
# Obtain simple random sample
simple_random_sample = df.sample(n=4).sort_values(by='customer_id')
simple_random_sample
Randomly selected customers (Image by Author)
Randomly selected customers (Image by Author)

Stratified Sampling

Image by Author using Powerpoint
Image by Author using Powerpoint

For stratified Sampling the population is divided into subgroups (called strata), then randomly select samples from each stratum.

Pros: it captures key population characteristics, so the sample is more representative of the population.

Cons: it’s ineffective if subgroups cannot be formed.

Use Case: it’s commonly used in geographic sampling where strata can be states, countries, or ecoregions.

#Let's add subgroup labels to the dataset
df['strata']=[0, 0, 0, 1, 1, 1, 1, 1, 2, 2]
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for x, y in sss.split(df, df['strata']):
    stratified_random_sample = df.iloc[y].sort_values(by='customer_id')
stratified_random_sample
Randomly selected customers from each stratum (Image by Author)
Randomly selected customers from each stratum (Image by Author)

Cluster Sampling

Image by Author using Powerpoint
Image by Author using Powerpoint

For clustering sampling, the population is divided into different clusters. Then a fixed number of clusters are randomly sampled and all units within each of the selected clusters are included in the sample.

Pros: it reduces variability, and it’s easy to conduct.

Cons: it is possible to introduce bias during sampling.

Use Case: it’s used when all individuals in each cluster can be representative of the populations.

#create 4 different clusters based on customers' lift time values
df['cluster'] = pd.cut(df['customer_life_time_value'], bins=4, labels=False) +1
Added "cluster" column (Image by Author)
Added "cluster" column (Image by Author)
# predefine which clusters/groups we want to select samples from
n=[2,4]
def clustering_sampling(df,n):
    df_list=[]
    for i in range(len(n)):
        df1=df[df['cluster']==n[i]]
        df_list.append(df1)
    final_df=pd.concat(df_list, ignore_index=True)

    return final_df
clustering_sampling(df,n)
Selected all customers from cluster 2 and 4 in the sample (Image by Author)
Selected all customers from cluster 2 and 4 in the sample (Image by Author)

Systematic Sampling

Select every other person from the population (Image by Author using Powerpoint)
Select every other person from the population (Image by Author using Powerpoint)

A systematic sample is drawn by selecting units systematically from a sample frame. (i.e every other unit is included in the sample)

Pros: it can eliminate clustered selection, and it’s simple to execute.

Cons: we need to predetermine the estimated population size. It doesn’t work well if the population has a type of standardized pattern.

Use Case: it’s used when the relevant data does not exhibit patterns.

def systematic_sampling(df, step):

    indexes = np.arange(0,len(df),step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample

systematic_sampling(df, 1)
Every other customer_id is included in the sample (Image by Author)
Every other customer_id is included in the sample (Image by Author)

Reference

  1. https://news.gallup.com/poll/262694/latest-trump-job-approval-rating.aspx
  2. https://www.investopedia.com/

Related Articles