The world’s leading publication for data science, AI, and ML professionals.

Dissecting the Birthday Paradox

Statistics isn't always very intuitive

Photo by Adi Goldstein on Unsplash
Photo by Adi Goldstein on Unsplash

Introduction

Try to answer this question without actually working it out.

How many people do you need in a room to ensure that there’s a 50% chance of any two people having the same birthday?

What would you answer purely based on intuition? What would a reasonable answer look like? 50 people? 100? 183? 365?

When I came across this problem, I guessed the answer to be about 70–80 people. I did the math after that and the answer surprised me.

Apparently, only 23 people are required!

Let’s prove this.


Table of Contents

* The Math -- proving it mathematically
* Exploring Real World Data -- checking our theory
* Why this isn't so intuitive
* Code and References

The Math

Assumptions

We’ll make the following assumptions to simplify the problem:

  • no leap years
  • each day is equally likely to be a potential birthday

The first assumption lets us assume that there are exactly 365 possible birth dates.

The second assumption means that we’re solving for the worst-case since any imbalance would make it more likely for two dates to collide ( as in, two birthdays falling on the same date).

The Trivial Case

Let’s say we have exactly two people. What’s the Probability of them having the same birthday?

The first person can have their birthday on any day of the year, but the second person needs to have their birthday on exactly the same day as the first person.

So probability can be worked out like this:

P_2 = number of possible days for the 1st person * number of possibilities for the 2nd Person / total number of possibilities.

This equates to:

P_2 = 365 * 1 / 365 * 365 = 1/365

This is just about 0.3%.

For a larger number of people

When we have more people to consider, say, 5, it’s easier to compute the probability of there being no birthday matches first, and then finding the complement of this probability.

For 5 people:

  • the first person can have any of the 365 birthdays – 365 ways
  • the second person can have anything but the first person’s birthday – 364 ways.
  • Similarly, the third, fourth and the fifth person have 363, 362 and 361 possible birthdays
  • We multiply these and then divide by the total number of possible dates – 365 times 365 times.. (5 times) – 365 ^ 5
  • We subtract this result from 1 to get the probability of there being a match, since what we just computed was there being no matches

Our result here is thus:

P(A) = 1 - 365 * 364 * 363 * 362 * 361 / 365 ^ 5
     = 0.027

General Solution

Let’s work this out for a group of n people.

Following the same logic as above, we’ll compute the complement first and then subtract it from 1 to get the answer we want. The first person can have a birthday on any given day – 365 possibilities. The second person can have a birthday on any day except that of the first person – 364 possibilities, and so on.

P_n(A)' = 365*364*363*...(365-(n-1)) / 365 ^ n

From the pigeon-hole principle (and even through an intuitive approach), for n > 365, P_n == 1.

This can be simplified further like so.

computing the complement probability - Wikipedia
computing the complement probability – Wikipedia
P_n(A)' = (365 p n) / 365 ^ n

where (365 p n) represents the number of permutations for n items out of a total of 365.

So, to calculate what we actually need, we have:

P_n(A) = 1 - P_n(A)'

For n = 23, we get just over 50% (50.7% to be precise), which coincides with our original answer of 23 people required for a 50% chance of birthdays matching.

Real-World Data

Let’s now look at some real data, to see if our assumptions are good enough to model a real-world scenario.

We’ll use the birth dates dataset from FiveThirtyEight. It’s linked in the references below. It’s under the Creative Commons Attribution 4.0 International license, as mentioned on their website.

The dataset is birth frequencies between the years 2000 and 2014, and it looks something like this.

A few rows from the dataset - image by author
A few rows from the dataset – image by author

A more extensive EDA will be added to the accompanying notebook, linked at the end.

Preprocessing

It’s rare to find a dataset this clean. All we have to do is remove a few columns and create a DateTime column.

# drop an irrelevant column
df = df.drop('day_of_week', axis=1)
# rename day_of_month to day -- helps with the datetime conversion
df = df.rename({"date_of_month": "day"}, axis=1)
# convert year, month and day into datetime
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
# drop the now redundant year, month and day cols
df = df.drop(['year', 'month', 'day'], axis=1)

Our data frame df now looks something like this.

The processed data frame - image by author
The processed data frame – image by author

Analyzing the data

We’ll use the calmap library to plot a heatmap of the births for the year 2014. The library accepts data indexed with DateTimeIndex, so let’s create that first.

# accessing the rows belonging for 2014
df_14 = df[df['date'].dt.year == 2014]
# duplicating the "birth" column into a separate series
df_dates_14 = pd.Series(df_14['births'])
# setting the index to df_14's date column
df_dates_14.index = df_14['date']

Let’s now create the calendar plot.

plt.figure(figsize=(30, 10))
_ = calmap.yearplot(df_dates_14)
The calendar heatmap for 2014— image by author
The calendar heatmap for 2014— image by author

It’s clear that the uniformity assumption doesn’t hold, just as we expected. You can see that there are significantly fewer births on the weekends and some public holidays. Let’s look at this a little more closely.

Challenging the uniformity assumption

While our first assumption of having no leap days is safe enough to not affect our answer significantly, the second assumption we made (that each date is equally likely) seems to be a little unrealistic. Well, it is.

Data shows that birthdates have two potential patterns – the American pattern and the European pattern. The American pattern of births shows a significant peak in September, while the European pattern has a large peak in spring followed by a smaller peak in September.

Another abnormality is seen on holidays – since more hospitals are closed, births are less likely to occur on days like public holidays.

Similarly, there are slightly fewer births on weekends than on weekdays.

All this points to the fact that real-world data doesn’t go by the assumption of uniformity. We’ll see how much this variation affects our answer shortly.

Finding the probabilities for different sample sizes

Let’s take a year, say, 2014. We’ll randomly sample the dates for different sample sizes and calculate the probability of there being at least two same birthdays.

We’ll perform two kinds of experiments – one with the uniformity assumption and one without.

The utility function

Let’s first create a utility function that accepts:

  • a data frame
  • a sample size
  • number of such samples to test on
  • a flag to trigger uniformity

At a high level, this function simply creates multiple samples of a given sample size from our data frame and checks if that sample has a birthday match or not. It then returns the fraction of these samples that had a match.

It’ll look something like this:

The utility function - image by author
The utility function – image by author

Now, let’s calculate this fraction of samples with at least one matching birthday for different sample sizes.

We won’t impose the uniformity assumption for now. Remember that we now expect a higher probability at a sample size of 23 than the calculated 0.507, since the non-uniformity of birthdates makes it easier to get a birthday match.

Calculating the probabilities for different sample sizes - image by author
Calculating the probabilities for different sample sizes – image by author

Let’s plot the results.

Plot of probabilities of a match occurring for different sample sizes (non uniform) - image by author
Plot of probabilities of a match occurring for different sample sizes (non uniform) – image by author

As expected, we got a value slightly higher than 0.5 .

Let’s repeat this experiment. But this time, we’ll weight the dates uniformly. For a sample size of 23, we expect a probability a little greater than 0.5 but less than what we got in the previous case.

Plot of probabilities of a match occurring for different sample sizes (uniform) - image by author
Plot of probabilities of a match occurring for different sample sizes (uniform) – image by author

It’s very similar to the values we got when we didn’t have the uniformity assumption. The key point here is that for n = 23, we got a slightly lower value with the uniformity assumption than without it.

This is consistent with the intuition that it’s more likely for two people to have the same birthday if some dates are more likely than others. Though this was only for a single year out of the 15 years in our dataset, averaging over all the years would probably give more consistent results (though we won’t be doing it as the article is getting pretty long).

How significant is the uniformity assumption?

Keeping these findings in mind, how significant is this assumption? Would our calculated answer of 23 be too high by 1? or 3? Maybe more?

A paper by Mario Cortina Borja and John Haigh explains this in-depth. They use a recursive formula to compute how much the answer would vary. They arrived at a multiplier of 0.99917, which means that the answer still remains 23 for the most part, even if we relax the uniformity rule.

This paper is linked in the references section.

If we checked this trend over the fifteen years from the dataset (2000–2014), we get a plot that reinforces this.

Trends in the probability of a match with sample size 23 - image by author
Trends in the probability of a match with sample size 23 – image by author

For almost all the years, the "uniform" plot is much lower than the "non-uniform" one. The very light blue and orange horizontal lines behind the trend lines represent the means. Both are close, and the uniform mean is lower as expected.


Why it’s not so intuitive

As you can probably imagine, we give a lot of importance to the fact that for birthdays to match between two specific people, it’s a 0.3%. While this probability is indeed very small, we don’t consider the fact that when we have 23 people, we are talking about 253 pairs of individuals.

Thinking about it this way, the chance of any two people sharing a birthday doesn’t seem that small, especially since there are only 365 unique birthdays.


Sample Code and References

Dataset

I’ve used the births dataset (Creative Commons License) from here.

Our Data

Paper talking about the uniformity assumption

https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2007.00246.x

Code

BlogCode/BirthdayParadox at main · Polaris000/BlogCode


Updates

26.4.22

add missing result


In Conclusion

I wrote this article to share a statistical problem where the answer isn’t immediately obvious without actually working it out. It shows that Statistics isn’t very intuitive in some scenarios.

I hope this was an interesting read and you were able to take something away from it. Thanks for reading.


Related Articles