
The world of Mathematics is full of fascinating puzzles and paradoxes that challenge our understanding of the world around us. The coupon collector problem is one such puzzle, an age-old problem that has intrigued mathematicians and statisticians for generations. The coupon collector problem is as follows:
Consider a contest involving collecting n unique coupons. It is given that every cereal box, contains one coupon, with every coupon being equally likely to be present in a given box. How many boxes do you need to open before you have at least one coupon of each type? This seemingly simple question has a surprisingly complex answer that has applications in fields ranging from computer science to finance. In this article, we will delve into the Coupon Collector’s Problem, explore its framework, and discuss its many fascinating implications.
Digression: Geometric Random Variables
Before solving this problem, let’s revisit some of the properties of the geometric random variable. This will provide an efficient mechanism to compute various parameters describing the architecture of the solution to the problem. Definition: A random variable X is said to follow a geometric distribution with parameter p if its Probability mass function is given as follows:

Intuitively, a geometric random variable gives the distribution of the number of independent trials required to achieve success, under the assumption that the probability of success remains constant i.e., p. So, finding P(X = k) is equivalent to finding the probability that k trials are required to achieve a success i.e., there are k – 1 failures before the first success. The required probability is thus:

Geometric random variables have many applications in real-world problems, such as modeling the number of attempts needed to win a game or the number of calls required to reach a customer at a call center. The Coupon Collectors problem can also be modeled effectively using geometric random variables (specifically a sum of independent, nonidentically distributed geometric variables, which we’ll be seeing later). A geometric random variable has the following properties:

Optional Proof: First, we will determine the expected value of a geometric random variable. Recall that the expected value of a random variable is like a weighted average of its values, with the weights given by its probability mass function. It can be shown that the expected value of a geometric random variable is the reciprocal of its parameter p i.e., the probability of success. The proof is as follows:

To evaluate the above expression, we use a small trick. It is known that p is a continuous parameter with values bounded between 0 and 1 (as it’s a measure of probability). Finding the sum of an infinite series can become very easy if we can model it as a geometric series with a constant common ratio r. Then,

We need to find a way to convert the given infinite series sum into a geometric series that can be easily evaluated using the above expression. To do so, we recall that the derivative of x^i is i*x^(i−1), something very similar to the form above, provided we replace x with 1 – p. Thus, we have:

We can take the derivative outside the sum as the sum of derivatives is equal to the derivative of the sum. Thus,

We may now use the infinite geometric series sum to obtain:

The above result is standard in most probability textbooks. For this article, understanding how to use the above result is more important than understanding how the result is derived. Thus, we’ve shown that the expected value of a geometric random variable is the reciprocal of its parameter p. Now, we proceed to find the variance of the geometric random variable. The math is slightly complicated, you may choose to skip over it if you’re simply interested in solving the given problem.
Calculating the variance of the geometric distribution: We use the following formula for variance:

We now proceed to calculate E(X(X − 1)):

Just as before, we observe that the second derivative of x^i with respect to x is i(i − 1)x^(i−2) Thus:

Thus,

Therefore, we have obtained the variance and expectation of the geometric distribution.
Back to the Problem: Modeling
To solve the problem at hand, we need to first model it, which involves identifying its parameters, assumptions, and relevant variables.
- PARAMETERS: The problem provides us with one parameter, which is n, the total number of unique coupons.
- ASSUMPTIONS: We assume that each coupon has an equal probability of being present in a given box. Another subtle assumption is that we have infinite boxes to open i.e., there is no restriction on how many boxes we can open.
-
VARIABLES: To find the solution to the problem, we need to determine the number of boxes that must be opened before we have at least one coupon of each type. This requires us to define appropriate variables that encapsulate the elements necessary for the solution. The nature and properties of these variables depend upon the field from which the problem derives its essence. Since the Coupon Collector’s Problem is rooted in the fields of Statistics and probability, the nature of the variables will be stochastic or random. Thus, we define the random variable T to refer to the minimum number of boxes that need to be opened to collect all n unique coupons. We use a random variable for this problem because it is not guaranteed that T will take a certain value. In practice, T could be any value from n to infinity. The only questions we can answer are: • What is the expected (or average) value of T? • What is the variance (or average spread in the value) of T? • What is the probability that more than c boxes need to be opened? In other words, what is the probability that T > c?
After modeling the various elements of the problem, we have a clear understanding of its settings, scope, and expectation. We now proceed to work on its solution.
Coming up with a Solution
Let’s start by analyzing the distribution of T. However, one will soon observe that the distribution of T is rather complex, and cannot be ascribed to any of the well-known probability distributions. This is because even though, by assumption, each coupon has an equal probability of being present in a given box, the probability of finding a new coupon decays as we collect more and more coupons. Consider the scenario where we are required to collect 100 unique coupons. When we open the first box, we are guaranteed to find a new coupon since we don’t have any coupons yet i.e.,

However, when we open the second box, the probability of finding a new coupon is slightly reduced as there is a chance (1/100) that we might find the same coupon that we found in the first box.

As we continue opening more boxes, the probability of finding a new coupon keeps decreasing (not strictly) since there are fewer unique coupons left to collect. Thus, the presence of non-constant probability values makes it rather difficult to assign a common probability distribution to T. Thus, we resort to the method of divide and conquer, a very popular strategy to solve probability problems. This involves breaking down a problem into smaller chunks of reduced complexity and approaching each chunk independently. Since T measures the time or the number of boxes that need to be opened to collect n unique coupons, we break it down by proposing Tᵢ to measure the time or the number of boxes that need to be opened to collect the ith coupon. We can express T as a sum of all Ti :

What’s the benefit? Well, the distribution of Tᵢ (unlike the distribution of T) is parameterized by a constant probability value. Let’s understand what this means.
We can think of each box opening as a trial in a sequence of Bernoulli trials, where success means finding a new, previously unseen coupon. (Recall that Bernoulli trials are a type of random experiment with only two possible outcomes: success and failure. For example, flipping a coin is a Bernoulli trial, where success might be defined as getting heads, and 3 4 failure as getting tails.) At the start, we have not collected any coupons, so the probability of success on the first trial is n/n, or simply 1.0:

On the second trial, there are n-1 unique coupons left to collect, out of a total of n coupons, so the probability of success is (n-1)/n. This is because there is only one coupon that would result in failure (i.e., a repeat of the coupon collected in the first trial), and n-1 coupons that would result in success (i.e., a new coupon that has not yet been collected).

On the third trial, there are n-2 unique coupons left to collect, out of a total of n coupons, so the probability of success is (n-2)/n.

Similarly, on the fourth trial, the probability of success is (n-3)/n, and so on. We can extrapolate the formula to find the probability of collecting the ith unique coupon (i.e., the probability of finding a unique coupon on opening a box given that we have already collected i – 1 unique coupons):

Recall that Tᵢ measures the number of independent trials to collect the ith unique coupon. This sounds familiar; yes it’s the geometric random variable! Each of these trials corresponds to a Bernoulli trial where success means finding a new, previously unseen coupon given that i – 1 unique coupons have been collected. Thus,

Therefore, we can see T as a sum of non-identical geometric distributions:

We can now proceed to answer the questions proposed earlier.
What is the expected value of T?
To find the expected value of T, we use the property that the expected value of a sum of random variables is the sum of their expected values:

Since Tᵢ is a geometric random variable:

Thus,

where H(n) refers to the nth harmonic number:

It can be asymptotically approximated (for large values of n) as:

where γ ≈ 0.577215665 is the Euler–Mascheroni constant. We can think of it as an estimate of the average number of boxes one would need to open to collect all n coupons if one repeated the coupon collection process many times. For example, suppose you wanted to collect all 100 unique coupons from a set of promotional cereal boxes, and you plan to buy one box at a time until you have collected them all. The formula n * H(n) would estimate the average number of boxes you would need to buy to complete your collection, assuming that each box has an equal chance of containing any of the 100 coupons. The following python code allows us to calculate this value:
import math
def coupon_collectors(n):
res = 0
for i in range(1, n + 1):
res += 1/i
return res*n
def coupon_collectors_approx(n):
return n*math.log(n) + 0.5 + n*0.577215665
print(coupon_collectors(100))
# 518.737751763962
print(coupon_collectors_approx(100))
# 518.7385850988092
Of course, in reality, the actual number of boxes you would need to buy to collect all 100 coupons would vary from trial to trial, depending on the specific coupons you get in each box. However, the formula n * H(n) gives us an idea of what to expect on average, and it can be a useful tool for planning and budgeting.
For larger values of n, the formula predicts that more boxes will need to be opened to collect all n unique coupons. This is because the probability of finding a new coupon decreases as the number of coupons already collected increases. The harmonic number H(n) grows logarithmically with n, so the expected number of boxes grows roughly proportional to n ln n. This implies that collecting a large number of unique coupons can be a challenging and time-consuming task, which matches our intuition.
![Line plot of E(T) against n [Image by Author]](https://towardsdatascience.com/wp-content/uploads/2023/03/11kyTxSLU459oIkZ-HvHjKw.png)
What is the variance of T?
Next, we try to calculate the variance of T to have an idea of how the number of boxes we need to collect varies from trial to trial. Since all trials are independent (by the assumption that each coupon has an equal probability of being present in a given box), the random variables Tᵢ, Tⱼ; i != j are also independent. Thus, the variance of their sum is the sum of their variances. Thus,

Since Tᵢ is a geometric random variable:

Thus,

Where we have used Euler’s approach to the Basel problem:

For example, suppose you wanted to collect all 100 unique coupons, and you repeated the process many times. The variance would give you an estimate of how much the number of boxes required varies from trial to trial. If the variance is small, then you can expect to need a similar number of boxes in each trial. If the variance is large, then the number of boxes required could vary widely from trial to trial.
def coupon_var(n):
return n**2 * (math.pi**2)/6
print(coupon_var(100))
# 16449.340668482262
The variance is proportional to n^2, so for larger values of n, the variance grows much faster than the expected number of boxes required. This implies that as n gets larger, the number of boxes required to collect all n coupons becomes more and more unpredictable, and it becomes increasingly difficult to estimate how many boxes you will need to buy on average.
![Line plot of Var(T) against n: Comparison of orders of growth of E(T) and Var(T) [Image by Author]](https://towardsdatascience.com/wp-content/uploads/2023/03/1hKeq8_so9EOy6ikvFRCLvg.png)
What is the probability that more than c boxes need to be opened?
Finally, we try to calculate (or at least bound) the probability that the number of boxes that need to be opened to collect all n unique coupons exceeds (or equals) c i.e., the probability that T takes a value more than or equal to c. Since the distribution of T is rather complex, it’s difficult to find the exact value of such a probability. However, statistics provides us with many useful inequalities that can help us bound the value of the probability. Specifically, we shall use the Markov inequality to upper bound the probability that T takes a value more than or equal to c.
Markov’s inequality is a powerful tool in probability theory that allows us to make general statements about the probability of a random variable exceeding a certain value. The inequality is named after the Russian mathematician Andrey Markov, who first introduced it in the late 19th century. The Markov inequality states that for any non-negative random variable X and any positive number a, the probability that X is greater than or equal to a is less than or equal to the expected value of X divided by a. In mathematical notation, this can be written as:

Intuitively, the Markov inequality says that if we want to know how likely it is for a random variable to take on a value greater than or equal to a, we can bound this probability by dividing the expected value of the random variable by a. This bound is often a very loose one, but it can be useful in situations where we have limited information about the distribution of the random variable.
Since T is non-negative (the number of boxes cannot be negative), we can use the Markov inequality:

The above approximation can be useful for estimating the probability when the value of n is very large. For example, suppose we want to estimate the probability that we need to open more than 1000 boxes to collect all 100 unique coupons. We can use the inequality to obtain an upper bound on this probability as:
def coupon_collectors_approx(n):
return n*math.log(n) + 0.5 + n*0.577215665
def coupon_prob(n, c):
ev = coupon_collectors_approx(n)
return ev/c
print(coupon_prob(100, 1000))
# 0.5187385850988091
Therefore, if we know the values of n and c, we can use this bound to estimate the probability that we need to open more than c boxes to collect all n unique coupons.
Conclusion
In conclusion, the coupon collector’s problem is a classic problem in probability theory that has a wide range of practical applications. The problem involves collecting a set of N unique coupons, where each coupon is equally likely to be found in a given box. We have discussed various aspects of this problem, including the expected value and variance of the time it takes to collect all n unique coupons as well as the probability that more than a given number of boxes are needed to collect all n unique coupons.
The coupon collector’s problem is a fascinating problem with many interesting properties and applications. It is an important tool for understanding probability and statistics, and its solutions have a wide range of real-world applications, from designing surveys and collecting data to analyzing customer behavior and predicting market trends. By understanding the coupon collector’s problem, we can gain valuable insights into the behavior of complex systems and make more informed decisions in a variety of contexts.
Thanks for reading! Hope you enjoyed this article!