The world’s leading publication for data science, AI, and ML professionals.

Where is the bus? GTFS will tell us!

Show the real-time locations of public transport vehicles in the Netherlands based on GTFS real-time data.

Map of actual locations public transport vehicles in Utrecht (image by author)
Map of actual locations public transport vehicles in Utrecht (image by author)

Public transport and open data is a combination with tremendous potential. Timetables, disruptions, routes, it is all there in the public domain, ready to be used for all kind of applications. This time we will look at the available real-time information in the Netherlands. The real-time data is available in Gtfs Realtime format and available at ovapi.nl (for the Netherlands).

Warning: it will take some work to get it up and running and implementing a first use case.

General Transit Feed Specification

GTFS is a standard for sharing Public Transportation schedules, including the associated geographical information. It consists of two parts. First, there is the static specification for scheduled information on transport services (GTFS Static) and second there is the realtime status information (GTFS Realtime).

The standard originates from 2005 when efforts were made to integrate public transport services in Google Maps. At that time there was no standardized format for sharing time schedule information. Original, the G of GTFS stood for Google, but to increase adaption it was changed to General.

All details of the GTFS standard can be found on Google Transit pages and on gtfs.org.

GTFS Static

The latest version of the static information is available from OVapi here and is always names gtfs-nl.zip. It changes, on average, every three or four days. There is also an archive of previous version of the GTFS file in the folder ‘archive‘.

The zipped file contains the following data files:

  • agency.txt – The list of agencies for which transit data is provided
  • routes.txt – All transit routes. A route is a a group of trips and is seen as a service by the customer. An examples is a bus line (line 5 in Amsterdam to Westergasfabriek) or train service (serie 3300 is the soptrain between Hoorn Kersenboogerd and Den Haag Central)
  • trips.txt – All transit trips. A trip is one bus/train on a route, connecting two or more stop locations. Stop locations can differ per trip (e.g. skipping specific stations). A trip belongs to one route.
  • calendar_dates.txt – A table linking dates to services. For each data an entry is present with the trip IDs of all trips running that day. The GTFS standard uses this file as a exception file for the service patterns (e.g. weekly pattern) specified in the optional file calendar.txt. This GTFS provider only uses the calendar_dates to map services to dates. A service is one or more trips and is defined by the service ID in the trips specification.
  • stops.txt – All stop locations. This can be a bus stop or train station. Stops are defined on the level of platform and combined to location in the shape of stopareas. A train station has one stop for each platform and one stoparea (the station).
  • feed_info.txt – Generic feed info like source, versioning and validity period
  • shapes.txt – For each route a list of geographical locations (lat,lon) to draw the transit service on a map. A trip is associated with a shape as individual trips on a route can have a different path.
  • stop_times.txt – For each stop on each trip the arrival and departure time. The largest file of the dataset (1 GB of data).
  • transfers.txt – List of all possible transfers between two stop locations, e.g. one platform to another platform at the same station.

The image below shows the used parts of the GTFS Static standard and there relations:

GTFS Static for dutch public transport (image by author)
GTFS Static for dutch public transport (image by author)

GTFS Realtime

The second part of the standard specifies the way real-time information is provided. The specification uses Protocol Buffers, a language and platform independent mechanism for serialization of structured data. It is a Google standard with bindings for several languages like C#, Go, Java and Python. Details for the GTFS Realtime standard can be found on the site of Google Transit.

The GTFS Realtime for The Netherlands uses the three feed types defined by GTFS with an additional feed for train updates:

  • Trip Update – Updates on trips. For each active trip one, and no more than one, update is available. If there is no update message for a specific trip, the assumption is that the trip is not running.
  • Vehicle Positions – If available (depends on vehicle), the current location of a vehicle on a trip. It provides information on the next stop and current delay of the vehicle on this specific trip.
  • Service Alerts – A service alert is generated for each disruption in the network. If a disruption leads to cancellations and/or delays, these are communicated as Trip Updates and Vehicle Positions.
  • Train updates – Updates on trains, comparable with the trip update but only for the trains. This feed is not part of the default GTFS Realtime specification. It provides updates on arrival and departure times and scheduled tracks. For each stop in a trip the updates are part of the message.

Decoding Protocol Buffers

To start using protocol buffers we need the protoc tool from Github. The latest version can be found here. Find the protoc-<release>-<platform>.zip and download this.

Download the Protocol Buffer definitions from OVapi. Tou need both the gtfs-realtime.proto and the gtfs-realtime-OVapi.proto. The last file contains the specific OVapi extensions and.

GTFS Realtime data from OVapi (screenshot from the website of OVapi)
GTFS Realtime data from OVapi (screenshot from the website of OVapi)

You can also download the latest protocol buffers from this location, named tripUpdates.pb, vehiclePositions.pb, alerts.pb and trainUpdates.pb.

When all files are placed in the same directory, it is possible to use the protoc tool to decode the protocol buffer messages:

protoc --decode=transit_realtime.FeedMessage *.proto < vehiclePositions.pb

which gives the decode contents from the vehiclePositions.pb:

header {
  gtfs_realtime_version: "1.0"
  incrementality: FULL_DATASET
  timestamp: 1672668285
  1000 {
    1: 1193795
    2: 60
  }
}
entity {
  id: "2023-01-02:QBUZZ:g309:8149"
  vehicle {
    trip {
      trip_id: "161300003"
      start_time: "14:38:00"
      start_date: "20230102"
      schedule_relationship: SCHEDULED
      route_id: "2626"
      direction_id: 0
      [transit_realtime.ovapi_tripdescriptor] {
        realtime_trip_id: "QBUZZ:g309:8149"
      }
    }
    position {
      latitude: 53.1998672
      longitude: 6.56498432
    }
    current_stop_sequence: 7
    current_status: IN_TRANSIT_TO
    timestamp: 1672668264
    stop_id: "2464829"
    vehicle {
      label: "7602"
    }
    [transit_realtime.ovapi_vehicle_position] {
      delay: 38
    }
  }
}
...

After the header, for each entity an entry follows (only one shown here, the file contains approx. 3200 entries) with the update information for a specific vehicle in a JSON like style. The transit_realtime.ovapi* fields are the OVapi specific data fields. The header specifies if the file is an increment or a full set. This source always returns a full dataset. This data stream contains information on all forms of public transport, except for trains.

Each active trip is returned as an entity. Within the trip, all stops are listed with a stop_time_update (only one given above, but repeated for all stops in a trip). Each update contains around 1600 entities (depending on time, day of week and holiday season) and a total of 50.000 st op time updates. This data stream contains updates on trains, which are not in the vehicle updates, but lack the current geographical location.

The numbers above show we are dealing with some serious data streams in size and frequency (full updates are published every minute).

Reading protocol buffers in Python

The next step is to read the protocol buffers in Python. After reading different blogs and websites this seemed a straight forward process, but in practice it is not that simple. Several attempts where needed, with a combination of versions of python and packages. The following combination is working for me:

python                        3.8.5
protobuf                      3.20.1
protobuf3-to-dict             0.1.5
gtfs-realtime-bindings        0.0.7

The full install of the packages used in this article contains:

pip install protobuf==3.20.1 
            gtfs-realtime-bindings=0.0.7 
            protobuf3-to-dict==0.1.5 
            requests simplejson pandas geopandas folium urllib3 libprotobuf

Now we can compile the protocol buffer definitions to the required python files:

protoc --python_out=. *.proto

This will generate two files; gtfs_realtime_pb2.py and gtfs_realtime_OVapi_pb2.py.

If you are working in an Anaconda environment with Jupyter notebooks it might be required to install protobuf using conda :

conda install protobuf
ipython kernel install --user

Linux environments are simpler but require the libprotobuf to be installed

sudo apt install python3-protobuf

This part takes some hassle and does not always feel predictable, but once it is running, you are good to go!

Parsing protocol buffers messages

Now we are able to decode the protocol buffers in Python:

import requests
import gtfs_realtime_OVapi_pb2  # Required for finding additional fields
import gtfs_realtime_pb2
from protobuf_to_dict import protobuf_to_dict

feed = gtfs_realtime_pb2.FeedMessage()

response = requests.get('https://gtfs.ovapi.nl/nl/vehiclePositions.pb', 
                        allow_redirects=True)
feed.ParseFromString(response.content)
vehiclePositions = protobuf_to_dict(feed)
print("Vehicle positions : {}".format(len(vehiclePositions['entity'])))

response = requests.get('https://gtfs.ovapi.nl/nl/trainUpdates.pb', 
                        allow_redirects=True)
feed.ParseFromString(response.content)
trainUpdates = protobuf_to_dict(feed)
print("Train updates     : {}".format(len(trainUpdates['entity'])))

response = requests.get('https://gtfs.ovapi.nl/nl/tripUpdates.pb', 
                        allow_redirects=True)
feed.ParseFromString(response.content)
tripUpdates = protobuf_to_dict(feed)
print("Trip updates      : {}".format(len(tripUpdates['entity'])))

response = requests.get('https://gtfs.ovapi.nl/nl/alerts.pb', 
                        allow_redirects=True)
feed.ParseFromString(response.content)
alerts = protobuf_to_dict(feed)
print("Alerts            : {}".format(len(alerts['entity'])))

This will result in four Python dictionaries, containing the real-time updates from the four different protocol buffer streams.

Parse data to dataframes

Panda dataframes have a dictionary converter built into the constructor, but this only works well with dictionaries with one level of data and no nested structures. There are tools like flatten_jon that can help in the process, but it is complex to realize and slow to execute.

The files have the following structure:

{
    "header": {
        "...": "...",
    },
    "entity": [
        {
            "A": "A1",
            "B": "B1",
            "C": [
                {"C_A": "CA1"},
                {"C_A": "CA2 "}
            ]
         },
         {
            "A": "A2",
            "B": "B2",
            "C": [
                {"C_A": "CA3"},
                {"C_A": "CA4 "}
            ]
         },
    ]
}

This will be translated to:

    A |    B | C_A |
 -------------------
   A1 |  B1  | CA1 |
   A1 |  B1  | CA2 |
   A2 |  B2  | CA3 |
   A2 |  B2  | CA4 |

After some experimentation it seems best to write the code manually to convert to nested dictionaries to a new one level dictionary and convert this to a dataframe. The protobuf3-to-dict package (source) is used to convert the protocol buffer to a Python dictionary first. This dictionary has the same nested structure as the original protocol buffer.

Alerts

The least complex buffer is the alert buffer (after conversion to a dictionary):

{
    "header": {
        "gtfs_realtime_version": "1.0",
        "incrementality": 0,
        "timestamp": 1672851585,
    },
    "enity": [
        {
            "id": "KV15:RET:2015-05-12:53",
            "alert": {
                "active_period": [{"start": 1431470580, "end": 1704048875}],
                "informed_entity": [{"stop_id": "1541226"}],
                "cause": 1,
                "effect": 7,
                "header_text": {
                    "translation": [
                        {
                            "text": "Rotterdam Airport: bus 33 richting Meijersplein - bus 33 direction Meijersplein.",
                            "language": "nl",
                        }
                    ]
                },
                "description_text": {
                    "translation": [
                        {
                            "text": "Oorzaak : onbekend nRotterdam Airport: bus 33 richting Meijersplein - bus 33 direction Meijersplein.n",
                            "language": "nl",
                        }
                    ]
                },
            },
        }
    ],
}

Structure wise, the entity array needs to be flattened to get one row with one validity period per alert. On a data level a conversion is needed from a UNIX timestamp (field timestamp in the header and start and stop fields in the active_period. The cause and effect fields are enumerations specified in the GTFS specification.

A utility function is written to convert a timestamp column to a column with datetime objects. All UNIX timestamps are in UTC so a conversion to the local time in the Netherlands is required:

def convert_times(df, columns):
    for c in columns:
        df[c] = pd.to_datetime(df[c], unit='s', utc=True). 
                        map(lambda x: x.tz_convert('Europe/Amsterdam'))
        df[c] = df[c].apply(lambda x: x.replace(tzinfo=None))
    return df

Now it is time to convert the alert dictionary to a dataframe with one alert per active period per row:

updates=[]
timestamp = alerts['header']['timestamp']
causes = {0: 'UNKNOWN_CAUSE', ...}
effects = ...}
for al in alerts['entity'] :
    aid = al['id']
    alert = al['alert']
    cause = int(alert['cause']) if 'cause' in alert else 0
    effect = int(alert['effect']) if 'effect' in alert else -1
    header_text = alert['header_text']['translation'][0]['text']
    description_text = alert['description_text']['translation'][0]['text']
    for ap in alert['active_period']:
        start = ap['start']
        end = ap['end']
        updates.append({'id': aid, 'timestamp': timestamp, 
                       'cause': causes[cause], 'effect': effects[effect], 
                       'start': start, 'end': end, 
                       'header': header_text, 'description': description_text})
df_alerts = pd.DataFrame(updates)
df_alerts = convert_times(df_alerts, ['timestamp', 'period_start', 
                                     'period_end'])

The result is the following dataframe:

Alerts dataframe (image by author)
Alerts dataframe (image by author)

Cause en effect are optional fields so a check is needed to see if they are part of dictionary. This overview of alerts needs to be related to the stops and routes the alert influences. Two separate tables are created to couple routes and stops to alerts:

routemapping = []
stopmapping = []
...
    for ap in alert['active_period']:
        start = ap['start']
        end = ap['end']
        if 'informed_entity' in alert:
            for inf in alert['informed_entity']:
                informed_stop = inf['stop_id']
                stopmapping.append({'alert_id': aid, 'stop_id': informed_stop,
                                    'start': start, 'end': end})
                if 'route_id' in inf:
                    informed_route = inf['route_id']
                    routemapping.append({'alert_id': aid, 
                                         'route_id': informed_route, 
                                         'start': start, 'end': end})
        update.append(.....

df_alerts_to_stops = pd.DataFrame(stopmapping)
df_alerts_to_stops = convert_times(df_alerts_to_stops, ['start', 'end'])
df_alerts_to_routes = pd.DataFrame(routemapping)
df_alerts_to_routes = convert_times(df_alerts_to_routes, ['start', 'end'])

With result:

Mapping of alerts on stops and routes (image by author)
Mapping of alerts on stops and routes (image by author)

Trip Updates

The next step is the conversion of the trip updates. There are some optional fields, like arrival and departure time, some timestamps, some additional fields and something special about the start time of a trip, the so called business day. The hours field does not run from 00 to 23 but from 00 to 27. The business day in public transport is 28 hours, running till 4:00 in the morning. It a trip is technically part of the previous day, the hours are extended to 27. If a trip belongs to the actual day, the hour is 00 to 04.

For our purposes, we recalculate the business day to a normal 24 hours day. This implies that when the hour is greater than 23, we subtract 24 from the hours and add one day to the date, moving it to the first four hours of a day:

def businessday_to_datetime(date: str, time: str):
    try:
        res = datetime.strptime(date, '%Y%m%d')
        hr = int(time[:2])
        if hr >= 24:
            res = res + timedelta(days = 1)
            hr -= 24
        res = res + timedelta(hours=hr, minutes=int(time[3:5]), 
                              seconds=int(time[6:8]))
        return res
    except:
        return None

This method converts a date (string format ‘20230131’) an time (string format ’13:23:45′ and converts it to a datetime object with a ‘normal’ 24 hour based date an time.

The additional fields added by OVapi are parsed by the Protocol Buffer code but are not replaced with their human readable names. I have not been able to parse the buffers and have the field…

[transit_realtime.ovapi_tripdescriptor] {
        realtime_trip_id: "ARR:26004:1125"
      }

…parsed with their names. The result is always:

'___X': {'1003': {'realtime_trip_id': 'KEOLIS:4062:40462'}}},

These keys must be used to find the realtime_trip_id in the dictionary.

It is now possible to convert the Trip Update to a data frame:

rtid_keys  = ['___X','1003']
updates=[]
timestamp = tripUpdates['header']['timestamp']
for tu in tripUpdates['entity']:
#     print(tu)
    uid = tu['id']
    trip_update = tu['trip_update']
    vehicle = trip_update['vehicle']['label'] if 'vehicle' in trip_update 
                                              else None
    trip = trip_update['trip']
    trip_id = trip['trip_id']
    start_time = trip['start_time'] if 'start_time' in trip else None
    start_date = trip['start_date']
    start_time = businessday_to_datetime(start_date, start_time)
    route_id = trip['route_id']
    direction_id = int(trip['direction_id']) if 'direction_id' in trip 
                                             else None
    rt_id = trip[rtid_keys[0]][rtid_keys[1]]['realtime_trip_id'] 
                   if rtid_keys[0] in trip else None
    for stu in trip_update['stop_time_update'] 
               if 'stop_time_update' in trip_update else []:
        stop_sequence = stu['stop_sequence']
        if 'arrival' in stu:
            arr = stu['arrival']
            arrival_time = arr['time'] if 'time' in arr else None
            arrival_delay = arr['delay'] if 'delay' in arr else None
        else:
            arrival_time = None
            arrival_delay = None
        if 'departure' in stu:
            dep = stu['departure']
            departure_time = dep['time'] if 'time' in dep else None
            departure_delay = dep['delay'] if 'delay' in dep else None
        else:
            departure_time = None
            departure_delay = None
        updates.append({'id': uid, 'RT_id': rt_id, 'trip_id': trip_id, 
                        'start_time': start_time, 'route_id': route_id, 
                        'direction_id': direction_id, 'vehicle': vehicle,
                        'stop_sequence': stop_sequence,
                        'arrival_time': arrival_time, 
                        'arrival_delay': arrival_delay,
                        'departure_time': arrival_time, 
                        'departure_delay': departure_delay,
                        'timestamp': timestamp})     
df_trip_updates = pd.DataFrame(updates)
df_trip_updates = convert_times(df_trip_updates, ['departure_time', 
                                                  'arrival_time', 'timestamp'])
df_trip_updates.head(2)

and the resulting dataframe:

Trip updates dataframe (without id column, image by author)
Trip updates dataframe (without id column, image by author)

Performance

The dataframe creation is an expensive task. Creating the array of dictionaries is efficient but the DataFrame constructor takes a significant amount of time. In simplified form the dataframe is created in the format:

pd.DataFrame([{'a': 1, 'b': 2, 'c': 'c1'}, 
              {'a': 11, 'b': 12, 'c': 'c2'}
             ])

It is faster to create a separate array per column and pass theses to the constructor:

pd.DataFrame({'a': [1, 11], 
              'b': [2, 22],
              'c': ['c1', 'c2']}
            )

Both implementations result in the same dataframe. In the first implementation, each row added the columns need to be matched on column name. The alternative version prevents this. It requires more code to be written, but the dataframe creation is about 10 to 20 times faster. The implementation on Github uses the alternative form.

The implementation on Github also assures that all ID’s are integers, instead of strings to improve lookup and merge performance.

Data model

Parsing of VehiclePositions and TrainUpdates is more or less the same, see the source code on Github for the code. The GTFS Realtime data refers to the trips, routes and stops from the GTFS Static data. The relationships are as follows (white is static, orange is realtime):

Relation between GTFS Realtime and GTFS Static (image by authour)
Relation between GTFS Realtime and GTFS Static (image by authour)

The calendar, shapes and transfers are not shown to keep the image clear.

The code on Github consists of a GTFS class and a notebook with some example usages. The GTFS class contains both the static and real-time information which can be updated independently. Some caching is added to prevent unnecessary parsing of the input files. This is especially useful for the static data where parsing the stop times is time consuming as it contains more than 14 million entries. The cached version is filtered for a specific day and contains around 2 million rows.

The class is used as follows:

from GTFS import GTFS

gtfs = GTFS()
gtfs.update_static()
gtfs.update_realtime(5)

The static GTFS file is cached for a week. The parsed contents for a few hours and a maximum of a day. The protocol buffer files are also cached for a few minutes to improve performance during development. The last one can be overridden by specifying the maximum age in minutes as parameter of the update_realtime method.

The downloaded an parsed data is stored in the class:

class GTFS:
    stops = ...
    routes = ...
    trips = ...
    calendar = ...
    stoptimes = ...
    trip_updates =...
    train_updates = ...
    vehicle_positions = ...
    alerts = ...
    alerts_to_routes = ...
    alerts_to_stops = ...

A few of them will be used in the following example. Other use cases might need the other information.

Plotting stops and the actual locations

The dataset contains all the public transport stops in the Netherlands. It can be used to generate a heatmap of stops in the country. The folium package is used with the plugin to generate heatmaps.

from folium import plugins

heat_data = [[point.xy[1][0], point.xy[0][0]] for point in gtfs.stops.geometry]

map = folium.Map(location=[52.0, 5.1], zoom_start=8, 
                 width=1000, height=1500, tiles="Cartodb dark_matter")
plugins.HeatMap(heat_data, radius=3, blur=1).add_to(map)
map
Heatmap of public transport stops (image by author)
Heatmap of public transport stops (image by author)

Please note that this heatmap only shows the number of stops as they are distributed over the country. In this plot, a stop with only one stopping bus a day weights the same as a stop with a bus every 5 minutes. It is an indication of the accessibility, not the actual accessibility.

With the GTFS realtime information it is possible to add the location of all vehicles (except trains, the train updates do not contain locations) can be added. For this, we zoom in to a region, in this case the city Utrecht.

bbox=((4.99, 52.05), (5.26, 52.15)) # Utrecht

gdf_veh = gtfs.vehicle_positions.cx[bbox[0][0]:bbox[1][0], bbox[0][1]:bbox[1][1]] 
gdf_halte = gtfs.stops.cx[bbox[0][0]:bbox[1][0], bbox[0][1]:bbox[1][1]] 

map = folium.Map(location=[(bbox[0][1] + bbox[1][1])/2, 
                           (bbox[0][0] + bbox[1][0])/2], 
                 zoom_start=12)

for _, h in gdf_halte.iterrows():
    marker = folium.CircleMarker(location=[h["stop_lat"], h["stop_lon"]], 
                                 popup=veh["stop_name"],
                                 radius=1, color='blue')
    map.add_child(marker)
for _, v in gdf_veh.iterrows():
    marker = folium.CircleMarker(location=[v["latitude"], v["longitude"]], 
                                 popup=veh['route_short_name'] + " to " + 
                                       veh['trip_headsign'],
                                 radius=5, color='red')
    map.add_child(marker)

map_osm

First, the bounding box of the region is defined. The stops dataframe and vehicle_position dataframe are geopandas so it is possible to filter them on the bounding box with the cx method. This method filters the dataframe so all rows are within the specified bounding box.

A folium map is created at the center of the bounding box with a zoom factor that shows only the region of the bounding box. Then, for all stops a small blue circle is drawn on the map and for all vehicle a red circle. Each stop has a popup with the stop name and each vehicle with the line number and direction. This results in the following map:

Map of actual locations public transport vehicles in Utrecht (image by author)
Map of actual locations public transport vehicles in Utrecht (image by author)

This map is created during the afternoon of a weekday. The same map during the evening in the weekend shows less vehicles on the road:

Map of actual locations public transport vehicles in Utrecht (image by author)
Map of actual locations public transport vehicles in Utrecht (image by author)

By merging the actual data with the static stops, stop times, trips and routes, all information is available per item on the map.

Final words

It took some true effort to get to the goal of this article, a map with the current position of public transport vehicles. We needed to pars all static information, decode protocol buffers and combine all this information. The final map is then made in a breeze.

But all this hard work can be re-used for other purposes; creating dynamic departure lists for a stop, an up to date travel planner, tools to help us in case of disruptions, etc.

To be honest, the work was not always enjoyable, especially getting the protocol buffers to work took some hard work and the performance needed a lot of tweaking. But I am happy with the end results. Th final class and Notebook van be found on Github.

I hope you enjoyed this article. For more inspiration, check some of my other articles:

If you like this story, please hit the Follow button!

Disclaimer: The views and opinions included in this article belong only to the author.


Related Articles