Types of Samplings in PySpark 3

The explanations of the sampling techniques in Spark with their case by case implementation steps in Pyspark

Pınar Ersoy
Towards Data Science

--

Sampling is the process of determining a representative subgroup from the dataset for a specified case study. Sampling stands for crucial research and business decision results. For this reason, it is essential to use the most appropriate and useful sampling methods with the provided technology. This article is mainly for data scientists and data engineers looking to use the newest enhancements of Apache Spark in the sub-area of sampling.

There are two types of methods Spark supports for sampling: sample and sampleBy as detailed in the upcoming sections.

1. sample()

If the sample() is used, simple random sampling is applied, and each element in the dataset has a similar chance of being preferred. Variable selection is made from the dataset at the fraction rate specified randomly without grouping or clustering on the basis of any variable.

This method works with 3 parameters. The withReplacement parameter is set to False by default, so the element can only be selected as a sample once. If this value is changed to True, it is possible to select a sample value in the same sampling again. There may be a slight difference between the number of withReplacement = True and withReplacement = False since the elements can be selected more than once.

Another parameter, the fraction field that is required to be filled, and as stated in Spark’s official documentation, may not be divided by the specified percentage value.

If any number is assigned to the seed field, it can be thought of as assigning a special id to that sampling. In this way, the same sample is selected every time the script is run. If this value is left as None, a different sampling group is created each time.

Below I add an example I coded on my local Jupyter Notebook with the Kaggle dataset.

In the following example, withReplacement value is set to True, the fraction parameter is set to 0.5, and the seed parameter is set to 1234 which is an id that can be assigned as any number by the user.

withReplacement enabled with “True” flag
Figure 1. sample() method with “withReplacement = True” (Image by the author)

In the following example, withReplacement value is set to False, the fraction parameter is set to 0.5, and the seed parameter is set to 1234 which is an id that can be assigned as any number by the user.

Figure 2. sample() method with “withReplacement = False” (Image by the author)

Below, has a detailed explanation of the sample() method.

sample (withReplacement=None, fraction=None, seed=None)

This method returns a sampled subset of a DataFrame.

Parameters:

withReplacement — The sample with a replacement or not (default value is set as False). (Optional)

— withReplacement=True: The same element has the probability to be reproduced more than once in the final result set of the sample.

— withReplacement=False: Every feauture of the data will be sampled only once.

fraction — The fraction of rows to generate, range [0.0, 1.0]. (Required)

seed — The seed for sampling (default a random seed) (Optional)

NOTE: It is not guaranteed to provide exactly the fraction specified of the total count of the given DataFrame.

2. sampleBy()

The other technique that can be used as a sampling method is sampleBy(). The methodology that is applied can be called stratified sampling, that is, before sampling, the elements in the dataset are divided into homogeneous subgroups and a sampling consisting of these subgroups is performed according to the percentages specified in the parameter.

The first parameter, the col field, determines which variable will be subgrouped and sampled in the sampling process.

For example, if the location is written in this field, sampling will be done on the basis of location values. The percentage with which the values ​​under location will be included in the sampling is determined in the fraction field, which is another parameter. It is not mandatory to fill, if it is not, then it is set as 0, and the values ​​without a specified fraction rate will not be included in the sampling.

Figure 3. Distribution of the location feature in the dataset (Image by the author)

In the example below, 50% of the elements with CA in the dataset field, 30% of the elements with TX, and finally 20% of the elements with WI are selected. In this example, 1234 id is assigned to the seed field, that is, the sample selected with 1234 id will be selected every time the script is run. If the seed value is left as None, a different sample is selected each time during execution.

Figure 4. All values of the “location” variable are specified in the “fractions” parameter (Image by the author)

Another example below, 60% of the elements with CA in the dataset field, 20% of the elements with TX are selected, and since the percentages of all the other elements are not specified, so they are not included in the final sampling result set. In this example, again, 1234 id is assigned to the seed field, that is, the sample selected with 1234 id will be selected every time the script is run. If the seed value is left as None, a different sample is selected each time during execution.

Figure 5. Only two of the values of the “location” variable are specified in the “fractions” (Image by the author)

sampleBy (col, fractions, seed=None)

This method returns a stratified sample without replacement based on the fraction given on each stratum.

Parameters:

col — the column that defines strata

fractions — The sampling fraction for every stratum. In case of a stratum is not specified, its fraction is treated as zero.

seed —The random seed id.

Questions and comments are highly appreciated!

--

--