A famous statistical phenomenon proven programmatically

Tomorrow is the 18th of May, which is my birthday 🥰🎈 . This inspired me to write an article about a phenomenon of Probability theory called the birthday paradox.
The essence of this problem is about the probability that in a random group of n people at least two persons will have the same birthday. In particular, in a group of 23 people, such a probability is slightly higher than 50%, while in the case of 70 people it increases to 99.9%. When the number of people reaches 366 (or 367, for a leap year), the probability of a shared birthday becomes 100%. The probability values from the first two cases seem unexpectedly high and counterintuitive, maybe because in the everyday life we don’t often meet people with the same birthday as ourselves. Hence it’s called a paradox, even though it’s perfectly proven statistically.
To calculate the probability of having a shared birthday for a group of n randomly selected people, we can use the following formula:

where P(365,n) – a permutation, i.e. an ordered arrangement of n birthdays sampled without replacement from 365 days. For this formula to be valid, we made the following assumptions:
- we don’t consider leap years,
- all 365 days are equally probable, without considering seasonal variations and historical birthday data,
- there are no twins in the group.
An exponential approximation of this formula:

The easiest way to calculate the probability of a shared birthday for any number of people is to use an available birthday paradox calculator (you can also find many similar ones).
There are some adjacent birthday problems:
- What is the probability for a person to be born a given day of the year (for example, my birthday)? The answer here is 1/365*100 = 0.27%.
- How many people are born on a given day of the year (for example, my birthday)?
- Given a selected probability, what is the greatest number of people for which the probability is smaller than the given value (or, just the opposite, the smallest number of people for which the probability is greater than the given value)? The last problem is also referred to as a reverse birthday paradox.
Let’s now prove the birthday paradox programmatically using a less known but very useful and multi-functional Python library called Faker (installation: pip install Faker
). If you aren’t familiar with this amazing tool yet, it’s high time to discover it. With Faker, we can create a wide range of fake data including names, surnames, contact details, geographical information, job positions, company names, colors, etc. For example:
from faker import Faker
fake = Faker()
fake.name()
Output:
'Katherine Walker'
An identical syntax can be used for creating any other kind of fake data. All we need to do is to substitute name
with a suitable self-explanatory method: address
, email
, phone_number
, city
, country
, latitude
, longitude
, day_of_week
, month_name
, color
, job
, company
, currency
, language_name
, word
, boolean
, file_extension
, etc. We even can create fake passwords (using the password
method), bank data (iban
, swift
, credit_card_number
, credit_card_expire
, credit_card_security_code
), and entire fake files (csv
, json
, zip
), optionally adjusting some additional parameters if available. Moreover, some methods have more granular versions. For example, instead of creating just a random name (i.e. name+surname) applying the name
method, we can use name_female
, name_male
, name_nonbinary
, analogically with first_name
, last_name
, prefix
, and suffix
. Additionally, it’s possible to select a language of the output, ensure the output reproducibility or uniqueness, etc. For more detail, please refer to the Faker documentation.
However, let’s return to our birthdays and their paradox. First, we need to collect fake birthdays to work with. To create a fake date, the library offers the following methods:
date
– a date string of'%Y-%m-%d'
format between 01.01.1970 and now (actually, for our purposes, the year doesn’t matter). We can change the output format using thepattern
parameter.date_between
– a random date between two given dates. By default, from 30 years ago (start_date='-30y'
) till now (end_date='today'
).date_between_dates
– similar to the one above, but here we have to specifydate_start
anddate_end
parameters.date_of_birth
– a random date of birth that we can optionally constrain byminimum_age
(0 by default) andmaximum_age
(115 by default).date_this_century
– any date from the current century. The parametersbefore_today
andafter_today
can be added; by default, only the dates before today are considered. Analogically, we can create a random date of this decade, year, or month.future_date
– a random date between 1 day from now and a given date. By default, future dates of one month ahead are considered (end_date='+30d'
).
Almost all of these methods return a datetime object, while date
returns a string:
fake.date()
Output:
'1979-09-04'
Let’s use this method to test the birthday paradox. We’re going to create a function that:
- Creates a list of n random birthdays extracting only month and day from each date.
- Runs this operation 1000 times, and for each list of birthdays, checks if all the birthdays of that list are unique (usually we expect that it isn’t so). Adds the result (
True
orFalse
) to a list (shared_bday_test
). - Runs the whole cycle 100 times, calculates the probability of having a list with a shared birthday in each
shared_bday_test
(meaning practically that we have to calculate the percentage of allFalse
values). - Creates a list of all probabilities, for each of the 100 cycles.
- Finds the average of the probability list and prints out the result.
def test_bday_paradox(n):
probabilities = []
for _ in range(100):
shared_bday_test = []
for _ in range(1000):
bdays=[]
for _ in range(n):
bdays.append(fake.date()[-5:])
shared_bday_test.append(len(bdays)==len(set(bdays)))
p = (1000 - sum(shared_bday_test))/1000*100
probabilities.append(p)
p_mean = round(sum(probabilities)/len(probabilities),1)
print(f'The probability of a shared birthday among a group of {n} random people:'
f'n{p_mean}n')
Now, we’re ready to check the cases mentioned at the beginning of this article: the probability of a shared birthday for the groups of 23, 70, and 366 people. We expect the following values: ~50%, 99.9%, and 100% correspondingly.
test_bday_paradox(23)
test_bday_paradox(70)
test_bday_paradox(366)
Output:
The probability of a shared birthday among a group of 23 random people:
50.5
The probability of a shared birthday among a group of 70 random people:
99.9
The probability of a shared birthday among a group of 366 random people:
100.0
Conclusion
To sum up, we discussed a curious statistical phenomenon of birthday paradox, the ways of calculating the probability of a shared birthday, the assumptions for the formulas to be valid, and some interesting but seemingly counterintuitive results. Besides, we got familiar with a rarely used but very helpful library for creating fake data in Python, explored some of its numerous applications, and, finally, applied Faker to confirm the birthday paradox.
Thanks for reading, and happy birthday to me! 🥳 And with the probability of 0.27%, also to you! 😀
If you liked this article, you can also find interesting the following ones:
Creating a Waterfall Chart in Python