Sampling from a population of unknown size

Introduction to uniform sampling algorithms for streaming items

Andrea C.
Towards Data Science

--

Lottery with numbers
Photo by dylan nolte on Unsplash

A sampling task

Bob works in quality control at an electronic components factory. Among his tasks, every day he must inspect k different products in order to identify possible defects and to assess the overall smoothness of the production process. The products must be chosen randomly and following a uniform probability of extraction (i.e., uniform sampling).

However, the number N of products leaving the factory changes every day and cannot be set in advance. Moreover, Bob cannot wait for all the N products to be out: he must check them as they go out, so that the sampling process must be online.

We need to sample streaming items — image by author

Clearly, Bob needs a strategy to effectively accomplish his task. We are talking about the problem of uniform sampling from a population of unknown size.

Definition of the problem

To solve the sampling problem we need an algorithm able to randomly sample k objects without replacement, by making a single pass over a population of size n unknown a priori.

As additional requirements, we will also assume that:

  • n is a much larger number than k (n≫ k) and therefore it is impossible to store all n items in memory and perform the extraction on them;
  • the algorithm cannot extract items that have already passed.

In other words, the algorithm processes one item at a time, sequentially, and decides whether to extract it or not.

The algorithm

Such an algorithm exists and it consists of the following steps:

  1. Initially, Bob extracts all the first k items(remember that n≫k);
  2. then, for i>k+1, he extracts the n-th product with a probability equal to k/n.
  3. When a new item is added to sample, another item randomly chosen in the current sample is discarded accordingly.

Following this process, the extracted sample results at each iteration (and with some approximation) uniformly distributed between 1 and n.

Proof of the algorithm

Writing on a black sheet
Photo by Aaron Burden on Unsplash

Interestingly, we can think the algorithm as a Markov chain. Following this approach, a state at step n is the sample containing the items collected so far. When a new item is processed, a state transition occurs whether the new item is added to sample or not.

Sample at step n is a subset of natural numbers between 1 and n — image by author

Basing on how our algorithm works, transition probabilities for n>k are given by:

Transition probabilities after step k — image by author

Now we need to prove that for each step n≥k all states have the same probability to be the sample. We can prove it by induction.

Thus, let’s suppose that at step n≥k all subsets Sn have equal chances of being the sample. The probability of each state S to be the sample at step n is therefore:

Probability of a subset to be the sample at a given step — image by author

At step n+1, new sample S can be the same as before, or it can include the new item in it. In the first case, we have a probability of:

Probability that new sample is the same as before — image by author

In the second case, the probability of having a new sample is given by:

Probability that new sample includes a new item — image by author

The factor 1/(n+1) is obtained as product of k/(n+1 ) — i.e. the probability that a new item is extracted — and 1/k — the probability that a given sample element is replaced by the new item.

The factor n-k+1 in the formula accounts for all the possible subsets in which the i-th item in the sample (and also one of the n-k ones not included in that sample) is replaced by the new item n+1.

Each new sample can be derived from multiple previous samples — image by author

So at each step n≥k we have overall n-k+2 new possible samples and each of them has the same probability to be chosen.

As induction basis we take the sample at n=k that, according to our probability function, has probability 1 to be chosen (and this is right, because we have only one possible subset here). Q.E.D.

A basic implementation: algorithm R

Gears
Photo by Laura Ockel on Unsplash

With its definition in mind, a working algorithm can be coded very easily as in the following Python snippet.

import numpy as np# Sampling parameters.
n_items = 1000
sample_size = 10
sample = np.zeros(sample_size, dtype = int)
steps = np.arange(n_items)
# Initial sample.
for nn in range(sample_size):
sample[nn] = nn
for nn in range(sample_size, n_items):# Compute the current item.
ii = np.random.randint(0, nn, dtype = int)
if ii < sample_size:
sample[ii] = nn

The picture shows the results obtained for n=1000 and k=10. We can see the elements of sample randomly migrating from an index to another while the items are processed, altough some of earliest ones can persist over the time.

Running the algorithm — image by author

The implementation discussed above is commonly known as Algorithm R, initially proposed by A. Waterman and discussed by J. Vitter in [3]. Since each item requires a computation, whether it is kept or not, the complexity of this algorithm is O(n).

An optimized variant: algorithm L

Algorithm R belongs to the family of reservoir sampling, that includes algorithms designed to choose a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items.

An optimized variant of algorithm R is given by the so-called algorithm L. Discussed in [4], this variant computes how many items are discarded before the next item enters the reservoir: this is made possible because the number of discarded items is distributed according to a geometric distribution and can therefore be computed ahead-of-time. As a result, the complexity of this algorithm decreases to O(k(log(n/k)+1)).

Other variants: random sampling with weights

Antique scale
Photo by Piret Ilver on Unsplash

There exist other variants that deal with weighted items. In this case, a weight is associated to each weight. For example, one might need to sample posts on Twitter which have weights basing on the number of views (or interactions) received.

We can look at the weights for defining two different selection probabilities for each step:

  1. the probability of an unselected item to be selected is proportional to the ratio between its weight and the sum of the weights of all unselected items;
  2. the probability of any item to be selected is proportional to the ratio between its weight and the sum of all weights.

Basing on which definition is used, several algorithms have been developed with different degrees of efficiency. Just to mention some of them, algorithms A-Res and A-ExpJ for the definition no.1, A-Chao and A-Chao with Jumps for definition no.2.

For an extensive review on this matter, especially from a programmatic point of view, I recommend this blog by Startin.

As always, the code used for this article is available here: https://github.com/andrea-ci/misc-stuff.

--

--