In this article, I explore the public transport systems of four selected cities relying on General Transit Feed Specification and various tools of spatial data science.
I picked four cities in this notebook, Budapest, Berlin, Stockholm, and Toronto, to overview their public transport system using publicly available GTFS (General Transit Feed Specification) data. This notebook aims to serve as an introductory tutorial on accessing, manipulating, aggregating, and visualising public transport data using Pandas, GeoPandas, and other standard data science tools to derive insights about public transport. Later on, such understanding can be helpful in various use cases, such as transport, urban planning, and location intelligence.
Additionally, while the GTFS format is supposed to be general and universal, I will also point out situations that still require one-by-one, city-level insights and manual validations throughout the following analytical steps.
In this article, all images were created by the author.
1. Collect and parse GTFS data
For this article, I downloaded public transport data from Transitfeeds.com, an online aggregator website for public transport data. In particular, I downloaded data with the following latest update times for the following cities:
In the following code blocks, I will explore each of these cities multiple times, create comparative plots, and stress out the universality of the GTFS format. Also, to ensure that my analytics are easy to update with newer data dumps, I store each city’s GTFS data in a folder corresponding to the update date:
import os
root = 'data'
cities = ['Budapest', 'Toronto', 'Berlin', 'Stockholm']
updated = {city : [f for f in os.listdir(root + '/' + city) if '20' in f][0] for city in cities}
updated
The output of this cell:
Now, let’s take a closer look at the different files stored in these folders:
for city in cities:
print(len(os.listdir(root + '/' + city + '/' + updated[city])), sorted(os.listdir(root + '/' + city + '/' + updated[city])), 'n')
The output of this cell:
It looks like eight files are present in all of them, resembling to the core of the GTFS structure.
While you can read more about the GTFS structure here, now I will go on with probably the most basic Geospatial feature – the location of the public transport stops.
2. Public transport stop locations
First, I created a sample visual where I parse the stops.txt for Budapest. Then, I use Shapely and GeoPandas to create a geometric data structure, a GeoDataFrame, by creating Point geometries using the long and lat coordinates of each stop location. Then, using the built-in Matplotlib-based plotter of GeoPandas, I create a simple map. Additionally, I use OSMNx to add the boundaries of the city as an enclosing polygon.
import geopandas as gpd
from shapely.geometry import Point
import matplotlib.pyplot as plt
import pandas as pd
import osmnx as ox
city = 'Budapest'
df_stops = pd.read_csv(root + '/' + city + '/' + updated[city] + '/stops.txt')
geometry = [Point(xy) for xy in zip(df_stops['stop_lon'], df_stops['stop_lat'])]
gdf_stops = gpd.GeoDataFrame(df_stops, geometry=geometry)
df_stops.crs = 'EPSG:4326'
admin = ox.geocode_to_gdf(city)
f, ax = plt.subplots(1,1,figsize=(10,10))
admin.plot(ax=ax, color = 'none', edgecolor = 'k')
gdf_stops.plot(ax=ax, alpha = 0.2)
The resulting figure:
Looks good at first! The vast majority of stop locations are within the city; there are just a few commuter lines that fall outside of it. Also, in the centre of the plot, we can clearly see the lack of stop locations where the river Danube flows. Now, let’s see this plot for all the cities!
stops = {}
admin = {}
f, ax = plt.subplots(2,2,figsize=(15,15))
indicies = [(i,j) for i in range(2) for j in range(2)]
for idx, city in enumerate(cities):
bx = ax[indicies[idx]]
df_stops = pd.read_csv(root + '/' + city + '/' + updated[city] + '/stops.txt')
geometry = [Point(xy) for xy in zip(df_stops['stop_lon'], df_stops['stop_lat'])]
gdf_stops = gpd.GeoDataFrame(df_stops, geometry=geometry)
gdf_stops.crs = 'EPSG:4326'
admin_ = ox.geocode_to_gdf(city)
gdf_stops.plot(ax=bx, markersize = 3, alpha = 0.5)
admin_.plot(ax=bx, color = 'none', edgecolor = 'k', linewidth = 4)
bx.set_title(city, fontsize = 15)
bx.axis('off')
stops[city] = gpd.overlay(gdf_stops, admin_)
admin[city] = admin_
plt.tight_layout()
Now that we have seen five cities, we can add a few more shades to our conclusion about the generality of GTFS and public transport data. On the one hand, from the tech perspective, they indeed look very similar across cities and local transport agencies. On the other hand, one certainly has to inspect each city separately when added to a pipeline because, as both the cases of Stockholm and Toronto show, the data associated with the city actually belongs to a larger, e.g. municipality-level territory. The good news is that adding the admin boundary, for instance, from OpenStreetMap, makes it easy to filter the data down to the actual city!
In fact, I just did this last filtering step in the lat-but-one row of the previous code block and stored the city-level stop locations in the stops dictionary.
f, ax = plt.subplots(2,2,figsize=(15,15))
indicies = [(i,j) for i in range(2) for j in range(2)]
for idx, (city, admin_) in enumerate(admin.items()):
bx = ax[indicies[idx]]
gdf_stops = stops[city]
gdf_stops.plot(ax=bx, markersize = 3, alpha = 0.5)
admin_.plot(ax=bx, color = 'none', edgecolor = 'k', linewidth = 4)
bx.axis('off')
plt.tight_layout()
3. Departure times
First, we explored where the stops are; now, let’s take a look at when one should approach them. Also, assuming that the public transport services have significant insights into the demand and inhabitants’ schedules and habits, we may also learn some insights from these schedules’ temporal patterns.
from datetime import datetime
def parse_time_string(time_string):
hour_value = int(time_string.split(':')[0])
if hour_value > 23: hour_value = hour_value-24
time_string = str(hour_value) + ':' + time_string.split(':', 1)[1]
parsed_time = datetime.strptime(time_string, "%H:%M:%S")
return parsed_time
stop_times = {}
for city in cities:
print(city)
df_stop_times = pd.read_csv(root + '/' + city + '/' + updated[city] + '/stop_times.txt')
df_stop_times['departure_time'] = df_stop_times['departure_time'].apply(parse_time_string)
stop_times[city] = df_stop_times
for city, df_stop_times in stop_times.items():
print(city, len(df_stop_times))
df_stop_times.head(3)
The output of this code block:
After reading, parsing, and transforming the departure time information, let’s create a series of visuals about each city’s city-level departure time distribution:
f, ax = plt.subplots(4,1,figsize=(15,20))
for idx, (city, df_stop_times) in enumerate(stop_times.items()):
bx = ax[idx]
df_stop_times = df_stop_times
df_stop_times['hour_minute'] = df_stop_times['departure_time'].dt.strftime('%H:%M')
departure_counts = df_stop_times['hour_minute'].value_counts().sort_index()
departure_counts.plot(kind='bar', color='steelblue', alpha = 0.8, width=0.8, ax = bx)
bx.set_xlabel('Hour-Minute of Departure', fontsize = 18)
bx.set_ylabel('Number of Departures', fontsize = 18)
bx.set_title('Histogram of Departure Times by Hour-Minute in ' + city, fontsize = 26)
bx.set_xticks([ijk for ijk, i in enumerate(departure_counts.index) if ':00' in i])
bx.set_xticklabels([i for i in departure_counts.index if ':00' in i])
plt.tight_layout()
The output image of this code block:
While giving a proper, detaild interpretation of these graphs probably requires some serious local urban planning knowledge, one can also speculate a bit based on personal experience and common sense.
First, given these cities’ quite different climate, cultural background, and day-night rhythms, I was surprised to see how well-aligned each of them are to the 8 am start-of-the-day. Budapest, Berlin, and Stockholm all have nearly the same wake-up pattern, while interestingly, Toronto takes a little longer to get up and running.
Second, there seems to be a clear order of when the afternoon ends – finishing the earliest in Berlin, Budapest, Toronto, and Stockholm. A curious question is whether these times do mark the end of the working shifts or the time people go home.
Third, the morning and afternoon peaks are clearly distinguishable for Budapest and Stockholm, while not so much for Berlin and Toronto, which may relate to how heavily centralised these cities are.
4. Spatial distributions
Computing general trends are possible for tabular data, float/integer variables, and spatial information. The process behind it is called spatial indexing. Spatial indexing, in practice, means that we split the area, such as the admin boundaries of a city, into a regular grid. My personal favourite is Uber’s H3 hexagon grid – you will find more about it here. Using this grid, we can efficiently conduct spatial aggregation by counting the number of stops in each grid cell! Let’s try it at different hexagon resolutions.
import geopandas as gpd # version: 0.13.1
import h3 # version: 3.7.3
from shapely.geometry import Polygon # version: 1.8.5.post1
import numpy as np # version: 1.22.4
# use this function to split an admin polygon into a hexagon grid
def split_admin_boundary_to_hexagons(admin_gdf, resolution):
coords = list(admin_gdf.geometry.to_list()[0].exterior.coords)
admin_geojson = {"type": "Polygon", "coordinates": [coords]}
hexagons = h3.polyfill(admin_geojson, resolution, geo_json_conformant=True)
hexagon_geometries = {hex_id : Polygon(h3.h3_to_geo_boundary(hex_id, geo_json=True)) for hex_id in hexagons}
return gpd.GeoDataFrame(hexagon_geometries.items(), columns = ['hex_id', 'geometry'])
# let's test two resolutions for Budapest
hexagons_gdf_h8 = split_admin_boundary_to_hexagons(admin['Budapest'], 8)
hexagons_gdf_h9 = split_admin_boundary_to_hexagons(admin['Budapest'], 9)
hexagons_gdf_h8.plot()
hexagons_gdf_h9.plot()
Technical note: When I was working on the next cell, I ran into an error – namely, it turned out that the admin boundaries of Berlin were a MultiPolygon, while all the other city admin areas are simple Polygons, and that is also what my _split_admin_boundary_tohexagons function is expecting as an input. So I checked, and it turned out that, for some reason, there was an extra tiny polygon within, with near-zero area, so I had to clean it up by running the following command:
admin['Berlin'] = gpd.GeoDataFrame(admin['Berlin'].geometry.explode(), columns = ['geometry']).head(1)
Now that we have a working pilot for the H3 grid construction, let’s count the number of stops and number of stoppings in each hexagon in the four example cities. While to compute the number of stop locations, we only need to do a spatial join; we also need to incorporate the departure times to compute the number of stoppings. This way, we also make the spatial and temporal dimensions meet!
from matplotlib.colors import LogNorm
resolution = 7
for city in cities:
# create the hexagon grid
hexagons_gdf = split_admin_boundary_to_hexagons(admin[city], resolution)
# merge stops and stopppings
gdf_stop_times = stops[city].merge(stop_times[city], left_on = 'stop_id',right_on = 'stop_id')
# compute the number of unique stops and stoppings in each hexagon
gdf_stops = gpd.sjoin(gdf_stop_times, hexagons_gdf)
nunique = gdf_stops.groupby(by = 'hex_id').nunique().to_dict()['stop_id']
total = gdf_stops.groupby(by = 'hex_id').count().to_dict()['stop_id']
hexagons_gdf['nunique'] = hexagons_gdf.hex_id.map(nunique).fillna(0)
hexagons_gdf['total'] = hexagons_gdf.hex_id.map(total).fillna(0)
# visualize the number of stops and stoppings
f, ax = plt.subplots(1,2,figsize=(15,5))
plt.suptitle(city + ', resolution = ' + str(resolution), fontsize=25)
hexagons_gdf.plot(column = 'nunique', cmap = 'RdYlGn', legend = True, ax = ax[0])
hexagons_gdf.plot(column = 'total', cmap = 'RdYlGn', legend = True, ax = ax[1])
# for log-saled coloring:
# hexagons_gdf.plot(column = 'total', cmap = 'RdYlGn', legend = True, ax = ax[1], norm=LogNorm(vmin=1, vmax = hexagons_gdf.total.max()))
ax[0].set_title('Number of unique stops', fontsize = 17)
ax[1].set_title('Number stoppings', fontsize = 17)
for aax in ax: aax.axis('off')
Let’s see how this looks like:
5. Different modes of transport
Now, we have seen aggregated trends in time and space. Now, let’s zoom in and extract the different means of transportation corresponding to the data records. This information is usually stored in the routes.txt, under the column _routetype. This, this, and this mapping can encode transportation mode codes to their English names.
Based on these, I created the official map and then a simplified version of it, which I will use later. In the simplified version, I renamed, for instance, both the category ‘Tram, Streetcar, Light rail’ (code 0) and the category ‘Tram Service’ (code 900) to just ‘Tram’.
map_complete = { 0 : 'Tram, Streetcar, Light rail',
1 : 'Subway, Metro',
2 : 'Rail',
3 : 'Bus',
4 : 'Ferry',
11 : 'Trolleybus',
100 : 'Railway Service',
109 : 'Suburban Railway',
400 : 'Urban Railway Service',
401 : 'Metro Service',
700 : 'Bus Service',
717 : 'Regional Bus',
900 : 'Tram Service',
1000: 'Water Transport Service'}
map_simple = { 0 : 'Tram',
1 : 'Subway',
2 : 'Railway',
3 : 'Bus',
4 : 'Ferry',
11 : 'Trolleybus',
100 : 'Railway',
109 : 'Railway',
400 : 'Railway',
401 : 'Subway',
700 : 'Bus',
717 : 'Bus',
900 : 'Tram',
1000 : 'Ferry', }
Now, visualise the frequency of each transportation mean, measured by the number of possible routes:
from collections import Counter
import matplotlib.pyplot as plt
# this function does some nice formatting on the axis and labels
def format_axis(ax):
for pos in ['right', 'top']: ax.spines[pos].set_edgecolor('w')
for pos in ['bottom', 'left']: ax.spines[pos].set_edgecolor('k')
ax.tick_params(axis='x', length=6, width=2, colors='k')
ax.tick_params(axis='y', length=6, width=2, colors='k')
for tick in ax.xaxis.get_major_ticks(): tick.label.set_fontsize(12)
for tick in ax.yaxis.get_major_ticks(): tick.label.set_fontsize(12)
f, ax = plt.subplots(1,4,figsize = (15,4))
routes = {}
for idx, city in enumerate(cities):
# get the data, map the english names
df_route = pd.read_csv(root + '/' + city + '/' + updated[city] + '/routes.txt')
df_route['route_type'] = df_route['route_type'].astype(int)
df_route['route_type_en'] = df_route['route_type'].map(map_simple)
D = dict(Counter(df_route.route_type_en))
routes[city] = df_route
# define my color palette
transport_colors = {'Railway': '#A65C47',
'Bus': '#0BB3D9',
'Tram': '#F2B705',
'Ferry': '#997CA6' ,
'Trolleybus' : '#D91818',
'Subway' : '#0869A6'}
# create the bar charts
labels = D.keys()
values = D.values()
colors = [transport_colors[l] for l in labels]
ax[idx].bar(labels, values, color = colors)
ax[idx].set_xticklabels(labels, fontsize = 10, rotation = 60, ha = 'right')
format_axis(ax[idx])
ax[idx].set_title(city, fontsize = 18)
ax[idx].set_ylabel('Number of routes', fontsize = 15)
ax[idx].set_yscale('log')
plt.tight_layout()
This codeblock outputs the following figure:
6. Route shapes
After the spatio-temporal aggregations and transport-mean zoom-in, last but not least, I would like to highlight my most favourite part – the GPS point sequences logging the shape of each route.
Access it this way:
city = 'Budapest'
df_routes = pd.read_csv(root + '/' + city + '/' + updated[city] + '/shapes.txt')
df_routes
The output of this cell:
Now convert this route table, for the case of Budapest, into a GeoDataFrame consisting of LineStrings, add the route type mapping from the previous section, and use the tripst.txt GTFS file, which connects routes (and means of transports) to shapes.
The result will be a geographical map of the public transport road network, which we can colour, for instance, based on the means of transport following the previous section’s colour coding.
import contextily as ctx
for city in cities:
df_shapes = pd.read_csv(root + '/' + city + '/' + updated[city] + '/shapes.txt')
# transform the shape a GeoDataFrame
df_shapes['shape_pt_lat'] = df_shapes['shape_pt_lat'].astype(float)
df_shapes['shape_pt_lon'] = df_shapes['shape_pt_lon'].astype(float)
df_shapes['geometry'] = df_shapes.apply(lambda row: Point(row['shape_pt_lon'], row['shape_pt_lat']), axis=1)
df_shapes = gpd.GeoDataFrame(df_shapes[['shape_id', 'geometry']])
df_shapes.crs = 4326
gdf_shapes = gpd.GeoDataFrame(df_shapes[['shape_id', 'geometry']].groupby(by = 'shape_id').agg(list))
gdf_shapes = gdf_shapes[[len(g)>1 for g in gdf_shapes['geometry'].to_list()]]
gdf_shapes['geometry'] = gdf_shapes['geometry'].apply(lambda x: LineString(x))
gdf_shapes = gpd.GeoDataFrame(gdf_shapes)
# let's only keep those routes which have at least one segment that falls into the city
gdf_shapes['shape_id'] = gdf_shapes.index
shapes_in_city = set(gpd.overlay(gdf_shapes, admin[city]).shape_id.to_list())
gdf_shapes = gdf_shapes[gdf_shapes.shape_id.isin(shapes_in_city)]
# map the means of transport
df_route = routes[city][['route_id', 'route_type_en']].drop_duplicates()
df_trips = pd.read_csv(root + '/' + city + '/' + updated[city] + '/trips.txt')[['route_id', 'shape_id']].drop_duplicates()
df_trips = df_trips.merge(df_route, left_on = 'route_id', right_on = 'route_id')
# merge these
gdf_shapes = gdf_shapes.merge(df_trips, left_index = True, right_on = 'shape_id')
# create the visuals in matplotlib
f, ax = plt.subplots(1,1,figsize=(15,15))
cax = admin[city].plot(ax=ax, edgecolor = 'k', color = 'none')
cax = admin[city].plot(ax=ax, edgecolor = 'k', alpha = 0.52)
for transport, color in transport_colors.items():
gdf_shapes[gdf_shapes.route_type_en==transport].plot(ax=ax, color = color, alpha = 0.9, linewidth = 1)
ax.axis('off')
ctx.add_basemap(ax, alpha = 0.8, crs = 4326, url = ctx.providers.CartoDB.DarkMatterNoLabels)
plt.tight_layout()
plt.savefig('figure7_'+city+ '.png', dpi = 150, bbox_inches = 'tight')
plt.close()
The resulting figures for each city:
Conclusion
In this article, I briefly overview the technicalities needed to explore GTFS data and explore different cities’ public transport systems in a streamlined way. While the general format used in all these cities makes it very handy and easy to adopt analytical tasks from one town to another, my findings and results also outline the importance of one-by-one city-level validation and sanity checks due to some minor flexibility in definitions and terminologies in different cities.
Finally, if you would like to travel back in time and see a similar, transport-related analysis going back to ancient Rome, check my previous TDS article, "Do All the Roads Lead to Rome?".