Probability Distributions in Public Transport Delays

A method for calculating delay probabilities on a given connection using Python

Gioele Monopoli
Towards Data Science

--

Photo by Uwe Conrad on Unsplash

1. Introduction

Delays have always been part of our daily lives, mainly because most of us have used public transportation from a young age to move from one point to another.

Let’s now ask ourselves a question. Given a connection, how likely is a delay? Wouldn’t it be beneficial to be able to select the route with the least delay so you can arrive at the destination on time?

Today I would like to talk to you about probability distributions in public transport delays. What I mean by that is understanding the distribution of delays and how to implement this probabilistic problem using Python easily.

Daily life application

As an example of how delay probabilities could be used, here is a screenshot from a train timetable app I have built for a project at my university [2].

Probability in timetable schedules taken from a university project

As we can see from the tables, the app allows us to choose the route that best meets our needs: we could take the shorter way that will enable us to depart 2 minutes later but only has a 66% chance of arriving on time, or we could take the longer route, and have a 98% chance of arriving on time.

This simple example demonstrates the potential and importance of probability distributions in our daily lives.

2. Dataset

Thanks to a university project, I have come across a dataset from the Swiss public transport system. As you can see below, this dataset contains historical data regarding station arrival times.

Departure and arrivals from Swiss trains

In addition to station information and metadata, it includes columns for expected_arrival time and actual_arrival time. Using these, we can get the delay for a given day at a given time.

For the sake of simplicity, let’s consider only the connection departing from “Zürich Flughafen” and arriving at “Zürich HB”.

delay_connection = df[(df["departure"] == "Zürich Flughafen") & (df["arrival"]=="Zürich HB")]

3. Distributions

Now we can show the distribution of delays for this connection.

Distribution of delays for the selected connection

As we can see from the distribution, it seems that it tends towards a normal / log-normal distribution. Many studies have been performed to analyze the distribution of delays, and for more detail, I suggest looking at [1]. In this post, we will assume that the distribution is normal.

4. Calculating distribution and probabilities

Now, we need to model our distribution to determine the probability that a delay will occur. We start by getting the mean and standard deviation of delays on a given connection.

mean_delay = delay_connection.delay.mean()
std_delay = delay_connection.delay.std()

We can now fit a probability distribution. The normal distribution is defined by the mean and variance

and the probability density function is defined as

normal probability density function

Now we will use the Scipy stats library to create the normal cumulative distribution function (CDF), which we can display below.

CDF and PDF for the current connection

For the normal distribution, the mean is defined in Scipy as the variable loc, while the standard deviation is scale.

from scipy.stats import norm#norm.cdf(x, loc=mean_delay, scale=std_delay)

Now, what is the probability of having a delay for this connection? We need to replace the random variable x with our delay threshold, thus a delay in seconds bigger than 0 (the orange area below).

Distribution of delays for the selected connection

Moreover, since we are looking for P(x>0), and the CDF gives us P(x≤x), we will have to compute 1- P(x≤0).

from scipy.stats import normproba = norm.cdf(0, loc=mean_delay, scale=std_delay)
final_proba = 1 - proba
final_proba: 0.5641598241281299

As we can see, there will be a probability of 56% of having a delay on this connection. Of course, this considers any delay bigger than 0 seconds. To further improve the analysis, we could ask what the probability of getting more than a 1-minute delay is.

Delay bigger than 1 minute
from scipy.stats import normproba = norm.cdf(60, loc=mean_delay, scale=std_delay)
final_proba = 1 - proba
final_proba: 0.1774974181545348

As we can see, the probability is now only 17%.

5. Conclusion

With this post, I wanted to show you how you can apply data science to model a straightforward problem such as delay probability on a real dataset. We started by gathering the delays at a given connection, finding its mean and standard deviation, and finally fit it in a cumulative distributive function to find the probability of getting the delay.

The idea of writing about probability distributions in delays comes from a project I did in a course at EPFL in which we had to build a stochastic route planner for the Zürich Area (Switzerland). You can see the app we did at this link and our GitHub repository here. If you have any questions or comments, feel free to connect with me on LinkedIn.

--

--