Five Essential Skills for Transportation Data Science
Public sector transportation agencies are ravenous to hire data scientists; no PhD required
Transportation touches the lives of nearly every living person every day. The public sector entities that build and operate the world’s roads, highways, and other public transportation networks are ravenous to hire data scientists with the skills to make sense of their voluminous data. Functional proficiency with the five skills described in this article will equip you to answer many of the questions that transportation agencies grapple with daily. Joining a department of transportation or metropolitan planning organization is a perfect entry-point for any new data scientist looking to apply her or his skills in service of the greater good.
Interested? Here are five essential skills you’ll use as a data scientist supporting a transportation agency:
- Data Management & Transformation
- GIS
- Decision Trees
- Plotting and Mapping
- Count Regression Models
This article illustrates the application of these skills through an applied investigation of the relationship between roadway lighting and traffic safety in Pennsylvania. This example uses R, which provides an excellent workbench for many transportation questions thanks to its rich ecosystem of users and analytical packages.
The Setup
Suppose you’re an analyst for the Pennsylvania Department of Transportation. The agency is weighing spending more of its annual roadway safety budget on installing highway lighting versus other safety improvements like guardrails. Your charge is to help answer the question: “what impact does street lighting have on crashes?” We investigate this question below.
You can download the code and follow along yourself by cloning the repository from: https://github.com/markegge/safety-demo/
Data Preparation: GIS + Data Transformation
Data analysis is almost always proceeded by data preparation to transform source data into a format suitable for analysis. This includes operations like deriving new attributes, joining tables, aggregation, and reshaping tables.
R’s data.table
and dplyr
packages are both powerful and versatile data transformation multi-tools. In this article, I use R’s data.table
package, which I prefer for its speed and concise syntax.
Safety analysis typically uses five years’ worth of historic crash data. Accordingly, we begin by loading in five years of crash history from the Pennsylvania Crash Information clearinghouse:
Because transportation describes the movement of people and goods through space, transportation data is often spatial. To work with spatial data you’ll need to be familiar with basic GIS operations like buffering, dissolving, and joining, as well as reprojecting spatial data between Web Mercator, StatePlane, and UTM projections.
Projections are a way of mapping spatial data from a round planet onto a flat computer screen. Web maps typically display data with distances measured in degrees of latitude and degrees of longitude, using a projection called Web Mercator (EPSG:4326). Since the length of a degree of longitude varies with latitude, measuring, buffering, and intersection operations are typically performed in StatePlane or UTM projections that measure distances in feet or meters rather than degrees.
For this analysis, we’ll use a spatial representation of Pennsylvania’s highway road network provided by the Federal Highway Administration. Each roadway is divided into segments. To count the number of crashes per road segment, we buffer the road segments by 50' and then spatially join the crash points to the buffered lines.
In R, the sf
package provides an interface for to the powerful GDAL spatial library, allowing the use of the same spatial functions in R that you may already be familiar with from working with PostGIS or Python.
The result looks like this:
Next, we’ll tabulate our crash counts and join these results back to our spatial data. We also use the crash attributes to impute whether a given road segment is lighted or not, based on the lighting conditions reported in the joined crashes.
Decision Trees
Decision trees are useful tools for identifying structural relationships in data. Their principal use is to classify observations into two or more classes by deriving a rule set but, their internal algorithm for defining rulesets is also a useful exploratory data analysis tool for identifying relationships between predictors and outcome variables.
In the segment below, a Decision Tree is used to fill in the “lighting” attribute for road segments without any crashes to impute from, and also to learn about what attributes in the dataset are predictive of crash rates.
Plotting and Mapping
Data visualization enables humans to process and spot trends in vast volumes of data at a glance. For spatial data, this typically means mapping. For example, we can use R’s leaflet
package (which provides an API to the popular Leaflet JavaScript web mapping library) to inspect if our crash-based roadway lighting assignment makes sense.
For plotting tabular data, ggplot2
is R’s workhorse data visualization library. Below, we plot road segment crash counts versus exposure (vehicle miles travelled, defined as segment length times daily traffic).
Count Regression Models
Regression is a highly useful statistical tool for identifying quantitative relationships in data. Linear Regression with Ordinary Least Squares is the most common type of regression (for predicting continuous outcome variables with linear predictor relationships), but regression is actually a family of models with many different types and applications. Expanding your regression repertoire to include count models has many applications in a transportation context.
Statewide crash data is frequently modeled with Zero-Inflated Negative Binomial (ZINB) regression, which accounts for the probability that short or low traffic segments will have zero recorded crashes. We can investigate the relationship between roadway lighting and crashes by including lighting as an explanatory variable in a ZINB regression model.
Our ZINB model predicts crashes based on exposure (VMT) and lighting. Here are the model outputs:
Call:
pscl::zeroinfl(formula = total_crashes ~ lighting + mvmt | mvmt, data = segments)Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.1916955 0.0040869 536.28 <2e-16 ***
lightingunlit -0.3935121 0.0056469 -69.69 <2e-16 ***
mvmt 0.0370332 0.0001227 301.72 <2e-16 ***Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.14028 0.03982 3.523 0.000427 ***
mvmt -1.97937 0.06055 -32.690 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model estimates that road segments without lighting (lightingunlit) are associated with 0.4 fewer crashes, on average, than segments without (everything else being equal).
This counter-intuitive finding suggests the presence of confounding variables. Lighting isn’t installed at random, after all. We intended to investigate lighting’s impact on crashes; we seem to have found crashes’ impact on lighting (i.e. lighting seems to be installed at inherently dangerous locations).
Our failure, however, points the way to additional approaches that may succeed. Since lighting is not installed at random, a better approach may be to find data where lighting conditions have changed over time, such as newly installed lighting or where maintenance records indicate a burned out light bulb.
Data Science in Practice
Data science is iterative. Thomas Edison famously said of his repeated failure to produce a functioning lightbulb, “I have not failed. I’ve just found 10,000 ways that won’t work.” A dose of humility goes a long way in the practical application of data science; if a topic is important, it has likely been studied before. Don’t expect your first inquiry to radically upend existing norms. Do expect to fail more often than you succeed; try new iterations incorporating the learnings from previous iterations until you arrive at an actionable finding (or run out of data, as is often the case).
Nearly every living person interacts with and is impacted by the quality and effectiveness of our transportation systems, whether through its success — mobility and economic opportunity — or shortcomings (global warming; 40,000+ annual deaths due to automobiles in America alone; congestion; etc.). Most public sector transportation agencies have an abundance of data but lack the skilled data scientists to expand their use of data-informed decision making.
If you’re willing to be persistent and possess a working knowledge of the five skills identified above, a public sector transportation agency is a great place for an impact-oriented data scientist.