The birthday paradox is the surprising result that if you have 23 people in a room, there is a 50% chance that two of them share a birthday. I want to explore this topic and verify that this is true.

Instead of delving into the maths with formulas, we will be using Monte Carlo methods with random simulations.

When answering questions about a group of people having shared birthdays, we need to assume one of two things.

  1. The birthdays are distributed uniformly, and each day is equally likely to be someone’s birthday.
  2. Some birthdays are more likely than others, and people tend to have kids more at certain times of the year.

Uniformly-distributed birthdays

First of all, I will assume that any day of the year is equally likely to be a birthday. In reality, some days are most likely more common than others. But this would only make the probability of same birthdays higher, so it’s not a bad assumption.

In [2]:
def make_room(N):
    days = range(365)
    return random.choices(days, k=N)

make_room(5)
Out [2]:
[21, 174, 343, 350, 119]

Let’s make a small function that checks if a room has a shared birthday.

In [3]:
def check_shared_birthday(room):
    bdays = set()
    
    for b in room:
        if b in bdays:
            return True
        bdays.add(b)
    return False

check_shared_birthday([1, 2, 3, 4])
check_shared_birthday([4, 2, 3, 4])
Out [3]:
False
Out [3]:
True

Now we just do a bunch of trials to figure out the probability to some accuracy.

In [4]:
def birthday_probability(N, runs):
    results = np.zeros(runs)
    
    for i in range(runs):
        room = make_room(N)
        shared = check_shared_birthday(room)
        results[i] = shared
    
    return np.mean(results)
In [5]:
birthday_probability(23, 5_000_000)
Out [5]:
0.5071562

Looks reasonable. Let’s plot this over a range of Ns.

In [6]:
uniform_probs = [birthday_probability(x, 100_000) for x in range(75)]

_ = plt.plot(uniform_probs)

# Plot 0.5 probability
_ = plt.axhline(0.5, color='orange', alpha=0.45)
_ = plt.axvline(23, color='orange', alpha=0.45)
Out:
<Figure size 900x600 with 1 Axes>

This worked well, but it’s not convenient to run a bunch of simulations every time you want to use this graph. Plus it only work for integers. Let’s fix those by approximating the function with a regression.

In [7]:
p = approx_poly(list(range(70)), uniform_probs, 7)

p.weights
Out [7]:
[0.003, -0.294, 8.822, -8.145, -12.648, 22.073, -8.815]
In [8]:
_ = plt.plot(uniform_probs)
plot_fn(p, 0, 75, 0.01)
Out:
<Figure size 900x600 with 1 Axes>

Much smoother, and much faster to execute.

Inverse of the function

Now we can use this approximation to calculate how many people are needed in a room for exactly fifty percent chance of a birthday collision.

In [9]:
f = lambda x: abs(p(x) - 0.5)

x = 0
for i in np.arange(22.5, 23, 0.000001):
    if f(i) <= f(x):
        x = i

x
p(x)
Out [9]:
22.80230900031076
Out [9]:
0.49999998707117743

Having 22.8 people in a room means the probability of having a birthday collision is 50%. Let’s plot this as well.

In [10]:
def birthday_reverse(x):
    f = lambda z: abs(p(z) - x)
    
    return min(list(np.arange(0, 75, 0.01)), key=f)

plot_fn(birthday_reverse, 0, 1, 0.01)
Out:
<Figure size 900x600 with 1 Axes>

Birthday distribution based on real data

Earlier we mentioned that birthdays are not perfectly uniform, and people tend to give birth more at specific parts of the year. Because of this, our previous calculations based on the assumption of uniform birthdays will be slightly off.

In order to make this more accurate, we need a dataset of birthdays. Rob Murphy 1 provides just that.

  1. https://www.panix.com/~murphy/ 

In [11]:
URL = "https://www.panix.com/~murphy/bdata.txt"
bdata = requests.get(URL).text.split("\r\n")[1:-1]
In [12]:
days = []

for row in bdata:
    day, count = row.split()
    count = int(count)
    for i in range(count):
        days.append(day)

Now all we need to do is to change make_room to use this dataset instead of uniform random numbers.

In [13]:
def make_room(N):
    return random.choices(days, k=N)
In [14]:
birthday_probability(23, 5_000_000)
Out [14]:
0.5075588

Looks like in a room with 23 people, the probability of finding a shared birthday slightly went up. Let’s see how the graph looks.

In [15]:
bdata_probs = [birthday_probability(x, 100_000) for x in range(75)]
In [16]:
_ = plt.plot(uniform_probs)
_ = plt.plot(bdata_probs)
Out:
<Figure size 900x600 with 1 Axes>

Doesn’t look any different. Probably not worth bothering with this, assuming a uniform distribution will get you pretty much identical results.

https://www.panix.com/~murphy/bday.html

https://www.panix.com/~murphy/bdata.txt