Sampling from a population of unknown size
Introduction to uniform sampling algorithms for streaming items
A sampling task
Bob works in quality control at an electronic components factory. Among his tasks, every day he must inspect k different products in order to identify possible defects and to assess the overall smoothness of the production process. The products must be chosen randomly and following a uniform probability of extraction (i.e., uniform sampling).
However, the number N of products leaving the factory changes every day and cannot be set in advance. Moreover, Bob cannot wait for all the N products to be out: he must check them as they go out, so that the sampling process must be online.
Clearly, Bob needs a strategy to effectively accomplish his task. We are talking about the problem of uniform sampling from a population of unknown size.
Definition of the problem
To solve the sampling problem we need an algorithm able to randomly sample k objects without replacement, by making a single pass over a population of size n unknown a priori.
As additional requirements, we will also assume that:
- n is a much larger number than k (n≫ k) and therefore it is impossible to store all n items in memory and perform the extraction on them;
- the algorithm cannot extract items that have already passed.
In other words, the algorithm processes one item at a time, sequentially, and decides whether to extract it or not.
The algorithm
Such an algorithm exists and it consists of the following steps:
- Initially, Bob extracts all the first k items(remember that n≫k);
- then, for i>k+1, he extracts the n-th product with a probability equal to k/n.
- When a new item is added to sample, another item randomly chosen in the current sample is discarded accordingly.
Following this process, the extracted sample results at each iteration (and with some approximation) uniformly distributed between 1 and n.
Proof of the algorithm
Interestingly, we can think the algorithm as a Markov chain. Following this approach, a state at step n is the sample containing the items collected so far. When a new item is processed, a state transition occurs whether the new item is added to sample or not.
Basing on how our algorithm works, transition probabilities for n>k are given by:
Now we need to prove that for each step n≥k all states have the same probability to be the sample. We can prove it by induction.
Thus, let’s suppose that at step n≥k all subsets Sn have equal chances of being the sample. The probability of each state S to be the sample at step n is therefore:
At step n+1, new sample S can be the same as before, or it can include the new item in it. In the first case, we have a probability of:
In the second case, the probability of having a new sample is given by:
The factor 1/(n+1) is obtained as product of k/(n+1 ) — i.e. the probability that a new item is extracted — and 1/k — the probability that a given sample element is replaced by the new item.
The factor n-k+1 in the formula accounts for all the possible subsets in which the i-th item in the sample (and also one of the n-k ones not included in that sample) is replaced by the new item n+1.
So at each step n≥k we have overall n-k+2 new possible samples and each of them has the same probability to be chosen.
As induction basis we take the sample at n=k that, according to our probability function, has probability 1 to be chosen (and this is right, because we have only one possible subset here). Q.E.D.
A basic implementation: algorithm R
With its definition in mind, a working algorithm can be coded very easily as in the following Python snippet.
import numpy as np# Sampling parameters.
n_items = 1000
sample_size = 10sample = np.zeros(sample_size, dtype = int)
steps = np.arange(n_items)# Initial sample.
for nn in range(sample_size):
sample[nn] = nnfor nn in range(sample_size, n_items):# Compute the current item.
ii = np.random.randint(0, nn, dtype = int)if ii < sample_size:
sample[ii] = nn
The picture shows the results obtained for n=1000 and k=10. We can see the elements of sample randomly migrating from an index to another while the items are processed, altough some of earliest ones can persist over the time.
The implementation discussed above is commonly known as Algorithm R, initially proposed by A. Waterman and discussed by J. Vitter in [3]. Since each item requires a computation, whether it is kept or not, the complexity of this algorithm is O(n).
An optimized variant: algorithm L
Algorithm R belongs to the family of reservoir sampling, that includes algorithms designed to choose a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items.
An optimized variant of algorithm R is given by the so-called algorithm L. Discussed in [4], this variant computes how many items are discarded before the next item enters the reservoir: this is made possible because the number of discarded items is distributed according to a geometric distribution and can therefore be computed ahead-of-time. As a result, the complexity of this algorithm decreases to O(k(log(n/k)+1)).
Other variants: random sampling with weights
There exist other variants that deal with weighted items. In this case, a weight is associated to each weight. For example, one might need to sample posts on Twitter which have weights basing on the number of views (or interactions) received.
We can look at the weights for defining two different selection probabilities for each step:
- the probability of an unselected item to be selected is proportional to the ratio between its weight and the sum of the weights of all unselected items;
- the probability of any item to be selected is proportional to the ratio between its weight and the sum of all weights.
Basing on which definition is used, several algorithms have been developed with different degrees of efficiency. Just to mention some of them, algorithms A-Res and A-ExpJ for the definition no.1, A-Chao and A-Chao with Jumps for definition no.2.
For an extensive review on this matter, especially from a programmatic point of view, I recommend this blog by Startin.
As always, the code used for this article is available here: https://github.com/andrea-ci/misc-stuff.
References
[1] Wikipedia, Reservoir Sampling
[2] StackExchange, Algorithm for sampling fixed number of samples from a finite population
[3] Vitter, Jeffrey S., “Random sampling with a reservoir”, 1985
[4] Li, Kim-Hung, “Reservoir-Sampling Algorithms of Time Complexity O(n(1+log(N/n)))”, 1994
[5] Richard Startin, Reservoir Sampling