The world’s leading publication for data science, AI, and ML professionals.

Modeling Traffic Density of the City of Vienna

Network Analysis / Machine Learning

Connecting official records of the road network with Uber data

Photo by Ahmad Tolba on Unsplash
Photo by Ahmad Tolba on Unsplash

As a citizen of a larger European city, I have perceived urban area traffic to be quite messy and unpredictable. However, based on personal experience each of us has built up some gut-feeling about certain busy streets and quickest routes through our hometowns, maybe even dependent on the time of the day.

In this article, I attempt to test these conjectures by analytically deriving the traffic density of the City of Vienna on a granular level. I do this by combining two publicly available sources:

  • Road network: The official records of the road network of the City of Vienna containing ~30,000 road segments, each with its respective geo-location, length, and street type (https:data.gov.at).
  • Uber: Aggregated data on Uber rides, showing travel-time between any combination of sub-districts in Vienna (https://movement.uber.com).

Selected Approach

As a first step, the road network enables the simulation of the shortest path between any two points based on the given maximum speed limit on each street. Such a path results in a hypothetical "traffic-free" travel time.

Next, these "traffic-free" travel times can be contrasted with real-world "traffic-included" travel times sourced from Uber rides. By framing a constrained optimization problem, coefficients representing traffic density at each segment of the network are thereby obtained. The traffic density and average travel speed during different scenarios can be observed as a result.

The Python implementation which is underlying this article can be accessed and forked from my respective repository.

My Project Portfolio


1. Simulating Paths

1.1. Road Network

The official records of the road network contain all street segments and crossings within the city. After a few steps of cleansing, they can be depicted in the form of a DataFrame.

Crossings represent "network nodes"
Crossings represent "network nodes"

Since Network Analysis will be applied, the data is loaded as "nodes" and "edges" into a network graph. The indexes of both DataFrames are identifiers to connect the objects.

Streets represent "network edges", that connect nodes
Streets represent "network edges", that connect nodes

Within this project, I rely on several functionalities from the NetworkX package, which is designed to analyze complex network graphs. See below the graph initialization of nodes and edges from a DataFrame.

By plotting the graph and adding color codes for different street types, the full road network is made visible. The data source distinguishes between three different types of roads: local streets (green), main streets (yellow), and federal streets (red).¹ The street type serves as a proxy for the allowed max. speed limit, although it might not map 1:1 with reality in every case.

Network graph color-coded by street type
Network graph color-coded by street type

Since we will, later on, try to find paths from various starting- to ending-points through the network, we have to ensure that all nodes are connected to the main network graph.

nodes_main = list(max(nx.connected_components(self.G)))
nodes_all  = list(self.G.nodes())
nodes_disconnect = set(nodes_all) - set(nodes_main)

Pruning the network by the identified disconnected nodes and corresponding edges leaves a fully connected network to be analyzed.

Network graph color-coded by disconnected nodes & edges
Network graph color-coded by disconnected nodes & edges

1.2. Uber Areas

Uber splits the City of Vienna into 1370 sub-districts ("areas") in order to aggregate sensitive data on individual rides. Based on actual rides, the mean travel time between any combination of two districts is being synthetically derived and results in an overwhelming 14 Mio. measurements of daily travel times during Q1 2020. For more details on the aggregation methodology by Uber see the official manual.

Uber travel times during Q1 2020 in Vienna
Uber travel times during Q1 2020 in Vienna

Anyways, to connect this data on Uber rides with the introduced road network, each node of the network needs to be mapped to its respective area it lies in. The district boundaries are provided in the form of polygons and can be plotted as well.

Uber areas are segmenting the urban region of Vienna
Uber areas are segmenting the urban region of Vienna

By inspecting a single node and iteratively checking with all polygons if the node lies within its boundaries, a mapping for all nodes can be achieved.

1.3. Shortest Paths

After the successful mapping of network components to Uber defined areas, the available attributes for each edge have the following structure:

Attributes of a random network edge stored as a dictionary
Attributes of a random network edge stored as a dictionary

The travel time attribute is a simple division of distance (meters) by speed (km/h) and transformed into seconds. It represents the hypothetical time spent on a street by constantly driving at the maximum speed limit. Thus, it can also be referred to as the "traffic-free" travel time.

To estimate the shortest path through the network between any of the areas, the NetworkX algorithm looks for a route that minimizes a specified edge attribute. In this case, we minimize the traffic-free travel time.

Since paths can only be calculated between nodes, and not between areas, a random node from the source- and destination-area needs to be drawn. To avoid noise from unlucky draws, several of these random nodes are tested, while only the path of median travel time is selected as output.

A visual representation of such a shortest path can be seen below. While the dark blue line marks the actual route, areas that are being passed through are highlighted in light blue and labeled with their area number.

Various statistics can be extracted from the chosen path. The relevant metric that will serve as an input for the optimization problem is the meter distance traveled by areas on the respective path.

Traveled meter distance by areas on the path
Traveled meter distance by areas on the path

2. Constrained Optimization

2.1. Identifying Coefficients

We are already very close to finding out how traffic looks like in each area of the city. After iterating through all source- and destination-areas from the Uber data set and collecting meter distances for areas on each path, we can set the obtained data matrix into relation with actual travel times from Uber rides.² A split of the DataFrame into X and y variables is shown below.

Meter distances per area serve as independent variables to estimate mean travel time
Meter distances per area serve as independent variables to estimate mean travel time

To put the relationship into simple mathematical notation, the actual Uber travel time (y) should equal the sum of meter distances (x) multiplied with area-specific coefficients (beta).

Logic thereby implies, that a higher beta-term of area "j", allocates more of the total travel time to that specific area, meaning higher traffic density. Also, since the units of measurement for this equation are in seconds (y) and meters (X), the beta-term is expressed as seconds-per-meter. This of course can be transformed into km/h, to give it a real-world meaning.

2.2. Running the Optimization

Since the problem is now specified, a model needs to be fit, that minimizes the RMSE for the given function. In order to obtain sensible beta values, the coefficients are limited to stay within fixed boundaries. Translated into units of km/h, these correspond to a bandwidth between 5–120km/h.

After 10 iterations a minimum seems to be present at ~130 RMSE. By running the model on an unseen test-set, the RMSE increases only slightly to 134, which indicates the robustness of the model.

RMSE of optimization at each iteration
RMSE of optimization at each iteration

The 1370 coefficients, which are initialized as a normal distribution centered around 0.2, seem to find a smooth distribution after the 10th iteration as well.

Distribution of coefficients at each iteration
Distribution of coefficients at each iteration

Plotting the coefficients on a map draws a picture of the geospatial traffic density. While brighter areas indicate low traffic, darker areas are associated with more busy streets and slower traffic. For gray areas, none of the simulated paths intersected their boundaries.

Traffic density by area
Traffic density by area

By interpreting the results, two observations turned out as expected. Firstly, the city center is exposed to the highest traffic volume. This effect is also spreading out to the more western districts, where high population density is coupled with strongly ramified smaller streets. Secondly, the larger brighter areas in the south and along the Danube (main river flowing through the city) map quite well with the existing high-way network, which allows for higher travel speed.

2.3. Scenario Analysis

While modeling the overall traffic density can already identify hot spots in the city, it might also be insightful to compare the traffic situation at different times of the day or compare the dynamics during weekdays vs. weekends.

To get scenario results, the Uber rides which train the model are simply filtered for the selected time- or day-category. See below a comparison of the distribution of traffic density coefficients in the early morning (0–7 AM) versus the coefficients during the afternoon (4–7 PM). As expected, the coefficients in the morning are on average lower, corresponding to less traffic and higher average travel speed (as shown in the lower subplot).

Distribution of coefficients and speed per each for different scenarios
Distribution of coefficients and speed per each for different scenarios

A plot of the coefficient differences between the two scenarios on a map supports this finding in a spatial dimension as well since red areas mean higher traffic density in the afternoon compared to the early morning.

Differences in traffic density (morning vs. afternoon)
Differences in traffic density (morning vs. afternoon)

3. Final Considerations

With the presented approach we were able to combine two different publicly available data sources, by mapping the street network to the "Uber-defined" segmentation of the city. This enabled us to enrich the source- and destination areas of each Uber ride with information about the path during the ride. By fitting a model that puts the traveled path in relation to the travel time, sensible estimators of traffic density were derived.

However, building on this, various extensions come to mind, that could make insights more actionable:

  • Including data on one-way streets and traffic lights into the network (at least the later one is publicly available)
  • Introducing an artificial traffic jam, to observe its implications on alternative routes
  • Estimating traffic density coefficients for each street segment
  • Comparing traffic dynamics pre-Covid vs. post-Covid

If you find any other interesting applications or you want to rebuild the project for your own hometown, please feel free to make use of my work. You find the source code in my respective GitHub repository.

My Project Portfolio


[1] Loosely translated from German

[2] Due to limited computational capacity, only 100,000 out of 14 Mio. paths have been simulated


Related Articles