
Public transport and open data is a combination with tremendous potential. Timetables, disruptions, routes, it is all there in the public domain, ready to be used for all kind of applications. This time we will look at the available real-time information in the Netherlands. The real-time data is available in Gtfs Realtime format and available at ovapi.nl (for the Netherlands).
Warning: it will take some work to get it up and running and implementing a first use case.
General Transit Feed Specification
GTFS is a standard for sharing Public Transportation schedules, including the associated geographical information. It consists of two parts. First, there is the static specification for scheduled information on transport services (GTFS Static) and second there is the realtime status information (GTFS Realtime).
The standard originates from 2005 when efforts were made to integrate public transport services in Google Maps. At that time there was no standardized format for sharing time schedule information. Original, the G of GTFS stood for Google, but to increase adaption it was changed to General.
All details of the GTFS standard can be found on Google Transit pages and on gtfs.org.
GTFS Static
The latest version of the static information is available from OVapi here and is always names gtfs-nl.zip
. It changes, on average, every three or four days. There is also an archive of previous version of the GTFS file in the folder ‘archive
‘.
The zipped file contains the following data files:
- agency.txt – The list of agencies for which transit data is provided
- routes.txt – All transit routes. A route is a a group of trips and is seen as a service by the customer. An examples is a bus line (line 5 in Amsterdam to Westergasfabriek) or train service (serie 3300 is the soptrain between Hoorn Kersenboogerd and Den Haag Central)
- trips.txt – All transit trips. A trip is one bus/train on a route, connecting two or more stop locations. Stop locations can differ per trip (e.g. skipping specific stations). A trip belongs to one route.
- calendar_dates.txt – A table linking dates to services. For each data an entry is present with the trip IDs of all trips running that day. The GTFS standard uses this file as a exception file for the service patterns (e.g. weekly pattern) specified in the optional file calendar.txt. This GTFS provider only uses the calendar_dates to map services to dates. A service is one or more trips and is defined by the service ID in the trips specification.
- stops.txt – All stop locations. This can be a bus stop or train station. Stops are defined on the level of platform and combined to location in the shape of stopareas. A train station has one stop for each platform and one stoparea (the station).
- feed_info.txt – Generic feed info like source, versioning and validity period
- shapes.txt – For each route a list of geographical locations (lat,lon) to draw the transit service on a map. A trip is associated with a shape as individual trips on a route can have a different path.
- stop_times.txt – For each stop on each trip the arrival and departure time. The largest file of the dataset (1 GB of data).
- transfers.txt – List of all possible transfers between two stop locations, e.g. one platform to another platform at the same station.
The image below shows the used parts of the GTFS Static standard and there relations:

GTFS Realtime
The second part of the standard specifies the way real-time information is provided. The specification uses Protocol Buffers, a language and platform independent mechanism for serialization of structured data. It is a Google standard with bindings for several languages like C#, Go, Java and Python. Details for the GTFS Realtime standard can be found on the site of Google Transit.
The GTFS Realtime for The Netherlands uses the three feed types defined by GTFS with an additional feed for train updates:
- Trip Update – Updates on trips. For each active trip one, and no more than one, update is available. If there is no update message for a specific trip, the assumption is that the trip is not running.
- Vehicle Positions – If available (depends on vehicle), the current location of a vehicle on a trip. It provides information on the next stop and current delay of the vehicle on this specific trip.
- Service Alerts – A service alert is generated for each disruption in the network. If a disruption leads to cancellations and/or delays, these are communicated as Trip Updates and Vehicle Positions.
- Train updates – Updates on trains, comparable with the trip update but only for the trains. This feed is not part of the default GTFS Realtime specification. It provides updates on arrival and departure times and scheduled tracks. For each stop in a trip the updates are part of the message.
Decoding Protocol Buffers
To start using protocol buffers we need the protoc
tool from Github. The latest version can be found here. Find the protoc-<release>-<platform>.zip
and download this.
Download the Protocol Buffer definitions from OVapi. Tou need both the gtfs-realtime.proto
and the gtfs-realtime-OVapi.proto
. The last file contains the specific OVapi extensions and.

You can also download the latest protocol buffers from this location, named tripUpdates.pb
, vehiclePositions.pb
, alerts.pb
and trainUpdates.pb
.
When all files are placed in the same directory, it is possible to use the protoc tool to decode the protocol buffer messages:
protoc --decode=transit_realtime.FeedMessage *.proto < vehiclePositions.pb
which gives the decode contents from the vehiclePositions.pb
:
header {
gtfs_realtime_version: "1.0"
incrementality: FULL_DATASET
timestamp: 1672668285
1000 {
1: 1193795
2: 60
}
}
entity {
id: "2023-01-02:QBUZZ:g309:8149"
vehicle {
trip {
trip_id: "161300003"
start_time: "14:38:00"
start_date: "20230102"
schedule_relationship: SCHEDULED
route_id: "2626"
direction_id: 0
[transit_realtime.ovapi_tripdescriptor] {
realtime_trip_id: "QBUZZ:g309:8149"
}
}
position {
latitude: 53.1998672
longitude: 6.56498432
}
current_stop_sequence: 7
current_status: IN_TRANSIT_TO
timestamp: 1672668264
stop_id: "2464829"
vehicle {
label: "7602"
}
[transit_realtime.ovapi_vehicle_position] {
delay: 38
}
}
}
...
After the header, for each entity an entry follows (only one shown here, the file contains approx. 3200 entries) with the update information for a specific vehicle in a JSON like style. The transit_realtime.ovapi*
fields are the OVapi specific data fields. The header specifies if the file is an increment or a full set. This source always returns a full dataset. This data stream contains information on all forms of public transport, except for trains.
Each active trip is returned as an entity
. Within the trip, all stops are listed with a stop_time_update
(only one given above, but repeated for all stops in a trip). Each update contains around 1600 entities (depending on time, day of week and holiday season) and a total of 50.000 st op time updates. This data stream contains updates on trains, which are not in the vehicle updates, but lack the current geographical location.
The numbers above show we are dealing with some serious data streams in size and frequency (full updates are published every minute).
Reading protocol buffers in Python
The next step is to read the protocol buffers in Python. After reading different blogs and websites this seemed a straight forward process, but in practice it is not that simple. Several attempts where needed, with a combination of versions of python and packages. The following combination is working for me:
python 3.8.5
protobuf 3.20.1
protobuf3-to-dict 0.1.5
gtfs-realtime-bindings 0.0.7
The full install of the packages used in this article contains:
pip install protobuf==3.20.1
gtfs-realtime-bindings=0.0.7
protobuf3-to-dict==0.1.5
requests simplejson pandas geopandas folium urllib3 libprotobuf
Now we can compile the protocol buffer definitions to the required python files:
protoc --python_out=. *.proto
This will generate two files; gtfs_realtime_pb2.py
and gtfs_realtime_OVapi_pb2.py
.
If you are working in an Anaconda environment with Jupyter notebooks it might be required to install protobuf
using conda
:
conda install protobuf
ipython kernel install --user
Linux environments are simpler but require the libprotobuf
to be installed
sudo apt install python3-protobuf
This part takes some hassle and does not always feel predictable, but once it is running, you are good to go!
Parsing protocol buffers messages
Now we are able to decode the protocol buffers in Python:
import requests
import gtfs_realtime_OVapi_pb2 # Required for finding additional fields
import gtfs_realtime_pb2
from protobuf_to_dict import protobuf_to_dict
feed = gtfs_realtime_pb2.FeedMessage()
response = requests.get('https://gtfs.ovapi.nl/nl/vehiclePositions.pb',
allow_redirects=True)
feed.ParseFromString(response.content)
vehiclePositions = protobuf_to_dict(feed)
print("Vehicle positions : {}".format(len(vehiclePositions['entity'])))
response = requests.get('https://gtfs.ovapi.nl/nl/trainUpdates.pb',
allow_redirects=True)
feed.ParseFromString(response.content)
trainUpdates = protobuf_to_dict(feed)
print("Train updates : {}".format(len(trainUpdates['entity'])))
response = requests.get('https://gtfs.ovapi.nl/nl/tripUpdates.pb',
allow_redirects=True)
feed.ParseFromString(response.content)
tripUpdates = protobuf_to_dict(feed)
print("Trip updates : {}".format(len(tripUpdates['entity'])))
response = requests.get('https://gtfs.ovapi.nl/nl/alerts.pb',
allow_redirects=True)
feed.ParseFromString(response.content)
alerts = protobuf_to_dict(feed)
print("Alerts : {}".format(len(alerts['entity'])))
This will result in four Python dictionaries, containing the real-time updates from the four different protocol buffer streams.
Parse data to dataframes
Panda dataframes have a dictionary converter built into the constructor, but this only works well with dictionaries with one level of data and no nested structures. There are tools like flatten_jon
that can help in the process, but it is complex to realize and slow to execute.
The files have the following structure:
{
"header": {
"...": "...",
},
"entity": [
{
"A": "A1",
"B": "B1",
"C": [
{"C_A": "CA1"},
{"C_A": "CA2 "}
]
},
{
"A": "A2",
"B": "B2",
"C": [
{"C_A": "CA3"},
{"C_A": "CA4 "}
]
},
]
}
This will be translated to:
A | B | C_A |
-------------------
A1 | B1 | CA1 |
A1 | B1 | CA2 |
A2 | B2 | CA3 |
A2 | B2 | CA4 |
After some experimentation it seems best to write the code manually to convert to nested dictionaries to a new one level dictionary and convert this to a dataframe. The protobuf3-to-dict
package (source) is used to convert the protocol buffer to a Python dictionary first. This dictionary has the same nested structure as the original protocol buffer.
Alerts
The least complex buffer is the alert buffer (after conversion to a dictionary):
{
"header": {
"gtfs_realtime_version": "1.0",
"incrementality": 0,
"timestamp": 1672851585,
},
"enity": [
{
"id": "KV15:RET:2015-05-12:53",
"alert": {
"active_period": [{"start": 1431470580, "end": 1704048875}],
"informed_entity": [{"stop_id": "1541226"}],
"cause": 1,
"effect": 7,
"header_text": {
"translation": [
{
"text": "Rotterdam Airport: bus 33 richting Meijersplein - bus 33 direction Meijersplein.",
"language": "nl",
}
]
},
"description_text": {
"translation": [
{
"text": "Oorzaak : onbekend nRotterdam Airport: bus 33 richting Meijersplein - bus 33 direction Meijersplein.n",
"language": "nl",
}
]
},
},
}
],
}
Structure wise, the entity array needs to be flattened to get one row with one validity period per alert. On a data level a conversion is needed from a UNIX timestamp (field timestamp
in the header and start
and stop
fields in the active_period. The cause and effect fields are enumerations specified in the GTFS specification.
A utility function is written to convert a timestamp column to a column with datetime
objects. All UNIX timestamps are in UTC so a conversion to the local time in the Netherlands is required:
def convert_times(df, columns):
for c in columns:
df[c] = pd.to_datetime(df[c], unit='s', utc=True).
map(lambda x: x.tz_convert('Europe/Amsterdam'))
df[c] = df[c].apply(lambda x: x.replace(tzinfo=None))
return df
Now it is time to convert the alert dictionary to a dataframe with one alert per active period per row:
updates=[]
timestamp = alerts['header']['timestamp']
causes = {0: 'UNKNOWN_CAUSE', ...}
effects = ...}
for al in alerts['entity'] :
aid = al['id']
alert = al['alert']
cause = int(alert['cause']) if 'cause' in alert else 0
effect = int(alert['effect']) if 'effect' in alert else -1
header_text = alert['header_text']['translation'][0]['text']
description_text = alert['description_text']['translation'][0]['text']
for ap in alert['active_period']:
start = ap['start']
end = ap['end']
updates.append({'id': aid, 'timestamp': timestamp,
'cause': causes[cause], 'effect': effects[effect],
'start': start, 'end': end,
'header': header_text, 'description': description_text})
df_alerts = pd.DataFrame(updates)
df_alerts = convert_times(df_alerts, ['timestamp', 'period_start',
'period_end'])
The result is the following dataframe:

Cause en effect are optional fields so a check is needed to see if they are part of dictionary. This overview of alerts needs to be related to the stops and routes the alert influences. Two separate tables are created to couple routes and stops to alerts:
routemapping = []
stopmapping = []
...
for ap in alert['active_period']:
start = ap['start']
end = ap['end']
if 'informed_entity' in alert:
for inf in alert['informed_entity']:
informed_stop = inf['stop_id']
stopmapping.append({'alert_id': aid, 'stop_id': informed_stop,
'start': start, 'end': end})
if 'route_id' in inf:
informed_route = inf['route_id']
routemapping.append({'alert_id': aid,
'route_id': informed_route,
'start': start, 'end': end})
update.append(.....
df_alerts_to_stops = pd.DataFrame(stopmapping)
df_alerts_to_stops = convert_times(df_alerts_to_stops, ['start', 'end'])
df_alerts_to_routes = pd.DataFrame(routemapping)
df_alerts_to_routes = convert_times(df_alerts_to_routes, ['start', 'end'])
With result:

Trip Updates
The next step is the conversion of the trip updates. There are some optional fields, like arrival and departure time, some timestamps, some additional fields and something special about the start time of a trip, the so called business day. The hours field does not run from 00 to 23 but from 00 to 27. The business day in public transport is 28 hours, running till 4:00 in the morning. It a trip is technically part of the previous day, the hours are extended to 27. If a trip belongs to the actual day, the hour is 00 to 04.
For our purposes, we recalculate the business day to a normal 24 hours day. This implies that when the hour is greater than 23, we subtract 24 from the hours and add one day to the date, moving it to the first four hours of a day:
def businessday_to_datetime(date: str, time: str):
try:
res = datetime.strptime(date, '%Y%m%d')
hr = int(time[:2])
if hr >= 24:
res = res + timedelta(days = 1)
hr -= 24
res = res + timedelta(hours=hr, minutes=int(time[3:5]),
seconds=int(time[6:8]))
return res
except:
return None
This method converts a date (string format ‘20230131’) an time (string format ’13:23:45′ and converts it to a datetime
object with a ‘normal’ 24 hour based date an time.
The additional fields added by OVapi are parsed by the Protocol Buffer code but are not replaced with their human readable names. I have not been able to parse the buffers and have the field…
[transit_realtime.ovapi_tripdescriptor] {
realtime_trip_id: "ARR:26004:1125"
}
…parsed with their names. The result is always:
'___X': {'1003': {'realtime_trip_id': 'KEOLIS:4062:40462'}}},
These keys must be used to find the realtime_trip_id
in the dictionary.
It is now possible to convert the Trip Update to a data frame:
rtid_keys = ['___X','1003']
updates=[]
timestamp = tripUpdates['header']['timestamp']
for tu in tripUpdates['entity']:
# print(tu)
uid = tu['id']
trip_update = tu['trip_update']
vehicle = trip_update['vehicle']['label'] if 'vehicle' in trip_update
else None
trip = trip_update['trip']
trip_id = trip['trip_id']
start_time = trip['start_time'] if 'start_time' in trip else None
start_date = trip['start_date']
start_time = businessday_to_datetime(start_date, start_time)
route_id = trip['route_id']
direction_id = int(trip['direction_id']) if 'direction_id' in trip
else None
rt_id = trip[rtid_keys[0]][rtid_keys[1]]['realtime_trip_id']
if rtid_keys[0] in trip else None
for stu in trip_update['stop_time_update']
if 'stop_time_update' in trip_update else []:
stop_sequence = stu['stop_sequence']
if 'arrival' in stu:
arr = stu['arrival']
arrival_time = arr['time'] if 'time' in arr else None
arrival_delay = arr['delay'] if 'delay' in arr else None
else:
arrival_time = None
arrival_delay = None
if 'departure' in stu:
dep = stu['departure']
departure_time = dep['time'] if 'time' in dep else None
departure_delay = dep['delay'] if 'delay' in dep else None
else:
departure_time = None
departure_delay = None
updates.append({'id': uid, 'RT_id': rt_id, 'trip_id': trip_id,
'start_time': start_time, 'route_id': route_id,
'direction_id': direction_id, 'vehicle': vehicle,
'stop_sequence': stop_sequence,
'arrival_time': arrival_time,
'arrival_delay': arrival_delay,
'departure_time': arrival_time,
'departure_delay': departure_delay,
'timestamp': timestamp})
df_trip_updates = pd.DataFrame(updates)
df_trip_updates = convert_times(df_trip_updates, ['departure_time',
'arrival_time', 'timestamp'])
df_trip_updates.head(2)
and the resulting dataframe:

Performance
The dataframe creation is an expensive task. Creating the array of dictionaries is efficient but the DataFrame constructor takes a significant amount of time. In simplified form the dataframe is created in the format:
pd.DataFrame([{'a': 1, 'b': 2, 'c': 'c1'},
{'a': 11, 'b': 12, 'c': 'c2'}
])
It is faster to create a separate array per column and pass theses to the constructor:
pd.DataFrame({'a': [1, 11],
'b': [2, 22],
'c': ['c1', 'c2']}
)
Both implementations result in the same dataframe. In the first implementation, each row added the columns need to be matched on column name. The alternative version prevents this. It requires more code to be written, but the dataframe creation is about 10 to 20 times faster. The implementation on Github uses the alternative form.
The implementation on Github also assures that all ID’s are integers, instead of strings to improve lookup and merge performance.
Data model
Parsing of VehiclePositions and TrainUpdates is more or less the same, see the source code on Github for the code. The GTFS Realtime data refers to the trips, routes and stops from the GTFS Static data. The relationships are as follows (white is static, orange is realtime):

The calendar
, shapes
and transfers
are not shown to keep the image clear.
The code on Github consists of a GTFS
class and a notebook with some example usages. The GTFS class contains both the static and real-time information which can be updated independently. Some caching is added to prevent unnecessary parsing of the input files. This is especially useful for the static data where parsing the stop times is time consuming as it contains more than 14 million entries. The cached version is filtered for a specific day and contains around 2 million rows.
The class is used as follows:
from GTFS import GTFS
gtfs = GTFS()
gtfs.update_static()
gtfs.update_realtime(5)
The static GTFS file is cached for a week. The parsed contents for a few hours and a maximum of a day. The protocol buffer files are also cached for a few minutes to improve performance during development. The last one can be overridden by specifying the maximum age in minutes as parameter of the update_realtime
method.
The downloaded an parsed data is stored in the class:
class GTFS:
stops = ...
routes = ...
trips = ...
calendar = ...
stoptimes = ...
trip_updates =...
train_updates = ...
vehicle_positions = ...
alerts = ...
alerts_to_routes = ...
alerts_to_stops = ...
A few of them will be used in the following example. Other use cases might need the other information.
Plotting stops and the actual locations
The dataset contains all the public transport stops in the Netherlands. It can be used to generate a heatmap of stops in the country. The folium
package is used with the plugin to generate heatmaps.
from folium import plugins
heat_data = [[point.xy[1][0], point.xy[0][0]] for point in gtfs.stops.geometry]
map = folium.Map(location=[52.0, 5.1], zoom_start=8,
width=1000, height=1500, tiles="Cartodb dark_matter")
plugins.HeatMap(heat_data, radius=3, blur=1).add_to(map)
map

Please note that this heatmap only shows the number of stops as they are distributed over the country. In this plot, a stop with only one stopping bus a day weights the same as a stop with a bus every 5 minutes. It is an indication of the accessibility, not the actual accessibility.
With the GTFS realtime information it is possible to add the location of all vehicles (except trains, the train updates do not contain locations) can be added. For this, we zoom in to a region, in this case the city Utrecht.
bbox=((4.99, 52.05), (5.26, 52.15)) # Utrecht
gdf_veh = gtfs.vehicle_positions.cx[bbox[0][0]:bbox[1][0], bbox[0][1]:bbox[1][1]]
gdf_halte = gtfs.stops.cx[bbox[0][0]:bbox[1][0], bbox[0][1]:bbox[1][1]]
map = folium.Map(location=[(bbox[0][1] + bbox[1][1])/2,
(bbox[0][0] + bbox[1][0])/2],
zoom_start=12)
for _, h in gdf_halte.iterrows():
marker = folium.CircleMarker(location=[h["stop_lat"], h["stop_lon"]],
popup=veh["stop_name"],
radius=1, color='blue')
map.add_child(marker)
for _, v in gdf_veh.iterrows():
marker = folium.CircleMarker(location=[v["latitude"], v["longitude"]],
popup=veh['route_short_name'] + " to " +
veh['trip_headsign'],
radius=5, color='red')
map.add_child(marker)
map_osm
First, the bounding box of the region is defined. The stops
dataframe and vehicle_position
dataframe are geopandas so it is possible to filter them on the bounding box with the cx
method. This method filters the dataframe so all rows are within the specified bounding box.
A folium map is created at the center of the bounding box with a zoom factor that shows only the region of the bounding box. Then, for all stops a small blue circle is drawn on the map and for all vehicle a red circle. Each stop has a popup with the stop name and each vehicle with the line number and direction. This results in the following map:

This map is created during the afternoon of a weekday. The same map during the evening in the weekend shows less vehicles on the road:

By merging the actual data with the static stops, stop times, trips and routes, all information is available per item on the map.
Final words
It took some true effort to get to the goal of this article, a map with the current position of public transport vehicles. We needed to pars all static information, decode protocol buffers and combine all this information. The final map is then made in a breeze.
But all this hard work can be re-used for other purposes; creating dynamic departure lists for a stop, an up to date travel planner, tools to help us in case of disruptions, etc.
To be honest, the work was not always enjoyable, especially getting the protocol buffers to work took some hard work and the performance needed a lot of tweaking. But I am happy with the end results. Th final class and Notebook van be found on Github.
I hope you enjoyed this article. For more inspiration, check some of my other articles:
- Summarize a text in Python – continued
- Using Eurostat statistical data on Europe with Python
- Solar panel power generation analysis
- Perform a function on columns in a CSV file
- Create a heatmap from the logs of your activity tracker
- Parallel web requests with Python
If you like this story, please hit the Follow button!
Disclaimer: The views and opinions included in this article belong only to the author.