Probability Distributions in Public Transport Delays
A method for calculating delay probabilities on a given connection using Python
1. Introduction
Delays have always been part of our daily lives, mainly because most of us have used public transportation from a young age to move from one point to another.
Let’s now ask ourselves a question. Given a connection, how likely is a delay? Wouldn’t it be beneficial to be able to select the route with the least delay so you can arrive at the destination on time?
Today I would like to talk to you about probability distributions in public transport delays. What I mean by that is understanding the distribution of delays and how to implement this probabilistic problem using Python easily.
Daily life application
As an example of how delay probabilities could be used, here is a screenshot from a train timetable app I have built for a project at my university [2].
As we can see from the tables, the app allows us to choose the route that best meets our needs: we could take the shorter way that will enable us to depart 2 minutes later but only has a 66% chance of arriving on time, or we could take the longer route, and have a 98% chance of arriving on time.
This simple example demonstrates the potential and importance of probability distributions in our daily lives.
2. Dataset
Thanks to a university project, I have come across a dataset from the Swiss public transport system. As you can see below, this dataset contains historical data regarding station arrival times.
In addition to station information and metadata, it includes columns for expected_arrival time and actual_arrival time. Using these, we can get the delay for a given day at a given time.
For the sake of simplicity, let’s consider only the connection departing from “Zürich Flughafen” and arriving at “Zürich HB”.
delay_connection = df[(df["departure"] == "Zürich Flughafen") & (df["arrival"]=="Zürich HB")]
3. Distributions
Now we can show the distribution of delays for this connection.
As we can see from the distribution, it seems that it tends towards a normal / log-normal distribution. Many studies have been performed to analyze the distribution of delays, and for more detail, I suggest looking at [1]. In this post, we will assume that the distribution is normal.
4. Calculating distribution and probabilities
Now, we need to model our distribution to determine the probability that a delay will occur. We start by getting the mean and standard deviation of delays on a given connection.
mean_delay = delay_connection.delay.mean()
std_delay = delay_connection.delay.std()
We can now fit a probability distribution. The normal distribution is defined by the mean and variance
and the probability density function is defined as
Now we will use the Scipy stats library to create the normal cumulative distribution function (CDF), which we can display below.
For the normal distribution, the mean is defined in Scipy as the variable loc, while the standard deviation is scale.
from scipy.stats import norm#norm.cdf(x, loc=mean_delay, scale=std_delay)
Now, what is the probability of having a delay for this connection? We need to replace the random variable x with our delay threshold, thus a delay in seconds bigger than 0 (the orange area below).
Moreover, since we are looking for P(x>0), and the CDF gives us P(x≤x), we will have to compute 1- P(x≤0).
from scipy.stats import normproba = norm.cdf(0, loc=mean_delay, scale=std_delay)
final_proba = 1 - proba
final_proba: 0.5641598241281299
As we can see, there will be a probability of 56% of having a delay on this connection. Of course, this considers any delay bigger than 0 seconds. To further improve the analysis, we could ask what the probability of getting more than a 1-minute delay is.
from scipy.stats import normproba = norm.cdf(60, loc=mean_delay, scale=std_delay)
final_proba = 1 - proba
final_proba: 0.1774974181545348
As we can see, the probability is now only 17%.
5. Conclusion
With this post, I wanted to show you how you can apply data science to model a straightforward problem such as delay probability on a real dataset. We started by gathering the delays at a given connection, finding its mean and standard deviation, and finally fit it in a cumulative distributive function to find the probability of getting the delay.
The idea of writing about probability distributions in delays comes from a project I did in a course at EPFL in which we had to build a stochastic route planner for the Zürich Area (Switzerland). You can see the app we did at this link and our GitHub repository here. If you have any questions or comments, feel free to connect with me on LinkedIn.
References:
[1] B. Büchel, F. Corman, Modelling Probability Distributions of Public Transport Travel Time Components (2018), Institute for Transport Planning and Systems
[2] Original app at: https://share.streamlit.io/michaelroust/stochastic-journey-planner/website/routing/streamlit_site.py