The Poisson Distribution

Published in

Towards Data Science

6 min readNov 19, 2018

The other day on regular commute, I listened to another brilliant episode of Linear Digressions named “Better Know a Distribution: The Poisson Distribution” and I thought this will be a nice topic to explain aided with some code (in R) as a blog post. So here goes.

As per wikipedia, the Poisson distribution, named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event.

Lets understand what exactly that means.

Environment Setup

Clean up

Cleanup

Load libraries

Data

For this exercise, I was looking for FIFA matches data and using the latest new resource from our friends over at Google (Google Dataset Search), I found this amazing dataset International football results from 1872 to 2018. This is a dataset for all the soccer matches from 1872 to 2018, all 39,669 of them!
Reading in.

Explore

Looks like the data is complete and tidy. A few interesting observations -

We have data from Nov 30th 1872 to July 10th 2018. Whoo!
A max home_score value of 31 and max away_score of 21?! Some matches to look into!
About 25% of the matches are played in neutral territory. Are these all World Cup matches?

Lets generate some more interesting features

When is the Poisson Distribution appropriate?

For a random variable k to be Poisson, it needs to hold the following 4 conditions (wikipedia)

k is the number of times an event occurs in an interval and k can take values 0, 1, 2, …. i.e. k needs to be an integer (a major distinction from the more popular Gaussian Distribution, where the variable is continuous).
The occurrence of one event does not affect the probability that a second event will occur. That is, events occur independently.
The rate at which events occur is constant. The rate cannot be higher in some intervals and lower in other intervals.
Two events cannot occur at exactly the same instant; instead, at each very small sub-interval exactly one event either occurs or does not occur.

The actual probability distribution is given by a binomial distribution and the number of trials is sufficiently bigger than the number of successes one is asking about.

Now, lets first identify our k and the interval and see if they hold the above 4 conditions. Lets explore the following 3 options -

k is total number of goals and interval is 1 year.
k is total number of goals and interval is 1 day.
k is total number of goals and interval is 1 match.

Although we have kept our 3 options such that Condition 1 & 2 will always hold, i.e. The number of goals is always an integer and 1 goal is independent of another (for the most part). But we will need to explore Conditions 3 & 4 for each of these options.

1. `k` is total number of goals and `interval` is 1 year

As we see in the above 2 plots, even though the mean number of goals remains more or less constant over the years, but the total number of goals per year increases, this violates our condition 3 for it to be a Poisson distribution. Also, as per condition 4, number of trials should be sufficiently bigger than number of successes, which is also violated in this case because we have 147 trials (i.e. number of years in the data set) and successes to the order of ~1000 or more (i.e. total number of goals per year).
Even logically, we can think that if there are more number of matches in a year, then there will be more number of total goals in that year, which violates condition 3.

Based on above, we can also assume that our option 2 (i.e. total number of goals in 1 day), although will be closer to being a Poisson distribution as compared to option 1, but it still wont be because more number of matches in a day will mean more number of goals which will violate condition 3 that the rate at which events occur needs to be constant. Lets visualize this for option 2.

2. `k` is total number of goals and `interval` is 1 day

So, even though number of successes is fairly low compared to number of trials (condition 4 satisfied), rate of event occurring is not constant and is dependent on number of matches played for option 2. Therefore, we reject option 2 as a Poisson distribution as well.
Lets finally explore option 3.

3. `k` is total number of goals and `interval` is 1 match.

Eureka! We have a constant rate of number of goals per match with a peak at around 3 goals and a mean of 2.935642 goals per match. Number of goals scored (‘the event’ being a goal being scored) is an integer where one goal is independent of another and the number of matches (i.e. trials) is way higher than number of goals (i.e. successes) per match. Therefore, we have found our Poisson Distribution!

Probability of events for a Poisson Distribution

Now that we have our Poisson distribution, we can calculate the probability of k events happening in an interval using the following:

P(k events in an interval) = e ^{-λ } * λ^{k}/k! where,
λ = Mean number of events per interval, i.e. mean number of goals per match.
k = Number of events for probability estimation, i.e. number of goals,
e = is the Euler number and
k! = is the factorial of k.

As per our exploration above, we have mean number of goals as λ = 2.935642, we can plug-in this value to the formula above to calculate the probability of any number of goals being scored in a match.

For example,

P(5 goals scored in a match) = e^-2.935642 * 2.935642^5/5!

P(5 goals scored in a match) = 0.09647195841

Lets use R to calculate the above.

## [1] 0.09647199

And we see the same value as calculated above.
We can also see how the probability varies as we increase the number of events i.e. number of goals from 0 to 8.

Summary

Poisson distribution’s probability calculation formula can be a nifty little trick under anyone’s belt to evaluate the probability of an event happening. It is also widely used in the industry with applications like estimating the probability of k number of customers arriving at a store in order to optimize resources or probability that a web-page has seen some k updates in order to optimize the rate at which to crawl a web-page by a search engine.