LA Traffic Data Analysis š
Using open-source data to analyze collision patterns in Los Angeles
Traffic is an issue thatās familiar to pretty much everyone. As a 7-year Los Angeles resident, Iāve sat in more than my fair share of gridlock, seemingly regardless of time of day or day of week. Thatās why I was so interested when I stumbled upon a traffic collision data set maintained by the city of Los Angeles. This data is cool for several reasons. While it doesnāt directly measure traffic, it measures a closely-related proxy. Itās not a stretch to hypothesize that more traffic correlates with more collisions which directly cause more traffic.
I am hopeful that data sets like this one can be used to create safer and more efficient communities for everyone. In that spirit, this data set (and a bunch of others) is actively maintained by the city of Los Angeles and is freely available to the public.
After browsing the data, I settled on 3 major questions I wanted to attempt to answer:
- How do traffic collision patterns vary by time of day, day of week, and time of year?
- How are collisions distributed geographically? Is it possible to identify high-risk areas or intersections?
- Is it possible to predict the number of collisions in a given time frame?
Before getting into the questions above, letās learn more about the data set.
š The Data
The data begins in January 2010 and is updated weekly. In my particular case, I use data from January 2010 - July 2019, which ends up being ~500K rows. Each row corresponds to a collision. This data is transcribed from original paper traffic reports, so itās very likely that there are errors. Below is a sample of some of the key fields:
The availability of these fields inspired the key questions listed above.
As with any data set, this one needs cleaning before starting any analysis. There are a few columns with only one value, reflecting the fact that all rows in this data correspond to traffic collisions.
There are also multiple fields with the approximate street names of collisions (not shown above). These text fields need cleaning, specifically removing extra spaces. Similarly, in the image above we can see the latitude/longitude coordinates contained in a string. I extract these coordinates in separate columns for later use.
The next step is to check for null or missing values. ~16% of collisions (~78K) donāt have an associated victim age. There is also a small number (~400) of collisions that do not have valid latitude/longitude coordinates and will be excluded from the mapping section in part 2.
š” Data Exploration
After cleaning, but before diving into the main questions, I want to do some general data exploration. This sort of analysis is typically useful. Even without a specific goal in mind, I often find helpful trends or insights. Iāll start by plotting a few of the variables in the data. By the way, all of this work was done in R and the code is linked at the bottom of the post!
Hereās what jumps out at me:
- There are hardly any victims below age 15.
- Most victims are in their 20s. The number of collision victims per age generally decreases after age 30.
- There are spikes at most multiples of 5 (25, 30, 35, etc). This suggests that some ages are estimated and that official identification (such as a driverās license) isnāt always used in collision reports.
- Age 99 seems to be a catch-all bucket. It seems unlikely that there are actually as many age-99 victims as are shown above.
This plot also raises questions:
- How are collisions with multiple victims dealt with?
- Whatās going on with the spike at age 99?
I emailed the data owner about these questions, but unfortunately havenāt heard back. Iāll add an update if I get a response.
Next, letās look at collisions by gender.
- This plot tells us that given that a collision occurred, the victim is much more likely to be male than female.
This work would be more interesting if I had the total number of drivers by gender, allowing for a collisions per capita measure. This will be a recurring shortcoming of this data and would be one of my main extensions of this analysis.
The next thing I want to do is look at the lowest and highest collision days in the data. I only include the top and bottom 0.5% on either end so I can review the results manually. Here are the lowest collision days:
- Most low-collision days fall on or around holidays. Intuitively, this makes sense as there are likely less people driving on (certain) holidays.
- Most low-collision days occur before early 2014. Weāll see later in this post that monthly collisions start rising after 2014.
And here are the highest collision days:
- Most high-collision days are Fridays occurring after 2015.
- Weāll see later in this post that Fridays typically have the highest number of collisions out of any day of the week.
These results bring up a few questions and thoughts outside the scope of this analysis.
- Why are only some holidays associated with low-collision days? For example, MLK Day shows up twice as a low-collision day but Independence Day never does.
- Why donāt holidays like Memorial Day or Labor Day show up as either low or high collision days?
- Daylight Savings Time does not show up as any kind of outlier in any year. That surprised me.
- Would it be interesting to look at weather conditions on high-collision days?
With this descriptive analysis in mind, Iām ready to tackle the key questions I outlined at the beginning of this post. The first one is about analyzing collisions over time.
š Collisions by Time
In this section, I analyze how collision patterns vary by time of day, day of week, and time of year.
First, I plot daily collisions for 2018. Plotting all of the data starting from 2010 is too chaotic and so I focus only on 2018 for many parts of this section. In addition to collision date, the data has a field for the reporting date. This is the date the collision was actually reported to police. In most cases, the reporting date is the same day or one day after the collision date.
- Daily collisions and collisions reported have noticeable differences. Look at the mid-April spike in collisions reportedā¦thereās nothing similar in actual collisions!
- These differences make me think that there may be administrative dynamics at play regarding when collisions are reported or processed.
- The outliers in the data donāt have an obvious pattern.
Letās see a similar plot aggregated by month. At this level, I can include the entire time frame from 2010ā2019.
- At the monthly level, these two quantities seem to track each other more closely than at the daily level.
- Monthly collisions were roughly constant from 2010 - 2014, rose from 2014 - 2017, and were roughly constant from 2017 - 2019. Remember that the data used for this analysis ends in July 2019.
So, why do monthly collisions rise in the plot above? Did the number of people living and driving in LA rise from 2014 - 2017? Could it be related to the rise of ride-sharing services? Itās not clear from this data, but these are possible starting points for a separate post.
Next, I analyze the distribution of collisions throughout the day.
Collisions are:
- sharply increasing from ~4am to ~8am
- decreasing from ~8am to ~930am
- generally increasing from ~930am to ~6pm
- sharply decreasing from ~8pm to ~4am
- at their daily minimum [maximum] at ~4am [~5pm]
These results likely mirror the number of vehicles on the road. It seems intuitive that many collisions occur during the evening rush hour. As I mentioned before, having access to a measure of total vehicles on the road per hour would allow me to calculate collisions per capita. There are no hourly timestamps available for when collisions are reported, so I canāt plot that field by hour.
The next step is to examine collisions by day of week.
Collisions are:
- increasing from Sunday - Friday, with a sharp increase from Thursday to Friday
- at their weekly minimum [maximum] on Sunday [Friday]
The end of the working week is the obvious hypothesis for the high number of Friday collisions. I havenāt come up with any others so far!
For collisions reported:
- Sunday [Friday] still have the least [most] collisions
- However, collisions reported per weekday are essentially constant
Finally, I plot results by month.
Collisions are:
- generally constant from April to August
- generally lower from September to February, especially from September to December
- highest in March
So, collisions tend to be lower in colder months. One possible explanation for this is less tourists visiting LA in the colder months. Regarding the high number of collisions in March, my initial hypothesis was Daylight Savings Time. However, none of the highest collision days were on this date. Perhaps this result stems from spring break tourists?
This concludes my analysis of the temporal patterns of collisions. Hereās my summary of the results above.
- Collisions and collisions reported vary substantially at the daily level, but not at the monthly level.
- Monthly collisions were roughly constant from 2010 - 2014, rose from 2014 - 2017, and were roughly constant from 2017 - mid-2019 (the end date of this analysis).
- The number of collisions is lowest [highest] at ~4 AM [~5 PM].
- The number of collisions is lowest [highest] on Sunday [Friday].
- The number of collisions is lowest [highest] in September-December [March].
- Getting information on the total number of vehicles on the road at a given time would allow interesting and useful per-capita calculations of the results above.
Next up is looking at collisions by geography.
š Collisions by Geography
In addition to latitude/longitude coordinates, the data has multiple fields describing where a collision occurred. Iāll start out by looking at these fields. First, Iāll plot the distribution of collisions by area
.
Some areas obviously have more collisions than others. But without additional information such as size or traffic density per area
, this graphic isnāt too informative.
The data also includes fields called location
and cross_street
. location
is the main street a collision occurred on, while cross_street
is the nearest cross street. Iāll look at the 10 most common values for these fields and their combination.
The 10 most common location
are some of the longest and most trafficked roads in LA. These top 10 streets account for >10% of total collisions. There are >25K total location
, so thereās a very long tail. Now, Iāll look at the cross_street
field.
5% of collisions have no associated cross_street
. Otherwise, this list has a lot of overlap with the previous list. The obvious next step is to look at the most common intersections for collisions by combining these two fields.
The most common intersections (location
/ cross_street
combinations) for collisions contains many of the same streets that were in the previous lists. However, there are exceptions: the components of row 2 (Tampa Ave. and Nordhoff St.) donāt appear in either the most common location
or cross_street
. Even the most collision-prone intersections account for only a small proportion of overall collisions.
Now itās time to take advantage of the geographic coordinates in the data and start mapping collisions. As a reminder, there are a small number of collisions that do not have valid latitude/longitude data and so are excluded from this section.
- The spatial distribution of points shows the interesting shape of LA.
- Blue [Red] points indicate latitude/longitude coordinates with a low [high] number of 2018 collisions.
- Even on this zoomed-out map, high collision coordinates are visible in the Valley (the northern part of the map) and the central and eastern parts of the city.
This map is pretty cluttered. For a better view, Iāll zoom in to a window showing much of central and downtown Los Angeles.
- A number of medium and high collision coordinates are clearly visible. Many of these high collision points tend to occur on various intersections of the same street.
The previous maps showed overall collisions. How does this data look if I add time of day?
- This plot only includes coordinates with 5+ collisions. Each coordinate is assigned to the daypart in which most of its collisions occur. Coordinates with ties between dayparts are removed.
- There are no coordinates with a majority of collisions occurring in the
Early Morning
. This makes sense given what I found in the Collisions by Time analysis. - Many coordinates have most of their collisions in the
Afternoon
andEvening
. This result also matches the results of the Collisions by Time section. - Interestingly, thereās a cluster of coordinates where
Late Night
collisions are common.
Letās look at a similar map broken out by weekday/weekend.
- This plot only includes coordinates with 3+ collisions. Each coordinate is assigned to the weekpart in which most of its collisions occur. Coordinates with ties between weekparts are removed.
- There are obviously more days and collisions in the
Weekday
bucket. - This map identifies areas with many weekend collisions. It might also be interesting to include part or all of Friday in the
Weekend
bucket.
To conclude this section, I plot collisions by time of year.
- This plot only includes coordinates with 3+ collisions. Each coordinate is assigned to the season in which most of its collisions occur. Coordinates with ties between seasons are removed.
- It looks like more collisions occur in the summer months (Mar-Aug). This would line up with the results in the Collisions by Time section.
- Given that Los Angeles doesnāt have distinct seasons like fall or winter, there may be other ways to split up the year for a plot like this.
These are my conclusions for the Collisions by Geography section:
- Using the
location
andcross_street
fields, in addition to mapping, it is possible to identify the most accident-prone coordinates in Los Angeles. - The most common dayparts for collisions are the
Afternoon
andEvening
. - Many more collisions occur during the weekdays than the weekend. Mapping collisions by weekpart shows areas where weekend collisions are more common.
- More collisions happen in the summer than winter months.
š® Predicting Collisions
The final section deals with trying to predict collisions. I specifically try to predict the number of collisions that will occur per month and area
.
Iāll start by looking at an example of the collision time series for a single area
:
- The trend for
area
2 generally matches the trend of overall monthly collisions (this is in the Collisions by Time analysis).
Letās decompose this time series into trend, seasonality, and remainder components. As an aside, Iāll be focusing on the analysis results in this post and wonāt delve into the theory of the time series methods Iām using. However, there are lots of resources available online if youād like to learn more!
- Keep the shape of the seasonality curve in mind. Iāll compare it against the decomposition of overall monthly collisions next.
- The trend looks similar to what we saw in the previous graphic.
Letās compare the area
2 decomposition to the overall monthly collisions decomposition.
- The seasonality curve for the overall data looks very different as compared to
area
2! This indicates that different areas can have different dynamics.
Next, Iāll look at the auto-correlation function (ACF) and partial auto-correlation function (PACF) for area
2. The ACF analyzes how correlated lagged values of collisions are to the current value. The PACF shows how much previously unexplained variance each lag explains.
- These plots indicate that
area
2 collision values are correlated with their lags. However, after the first two lagged values, additional lags are not accounting for much unexplained variance. - So, any model predicting
area
2 collisions should include at least 2 lagged terms.
I try a few different model types to predict collisions:
- 3 and 6 month moving average models (MA)
- Auto Regressive Integrated Moving Average (ARIMA)
- Prophet library
MA models average past values to generate predictions and are the simplest type of time series model. ARIMA models can use past values, differencing, and previous errors. Prophet is an additive forecasting model where non-linear trends are fit with yearly and monthly seasonality.
Prophet is much newer than ARIMA or MA models. You can find more info on it here:
I evaluate each model on the final 12 months of data (August 2018 to July 2019) with the following metrics:
- Mean absolute percentage error (MAPE): average absolute percentage difference between predictions and actual values
- Bias: average percentage difference between predictions and actual values
The MAPE gives me an idea of how far my predictions are from actual collision values while the bias tells me if I am systematically over- or under-estimating the data.
I auto-fit ARIMA models for each area
. So each one can have a different p, d, and q. I also test multiple Prophet model specifications (and show the best results below). Here are the model MAPE and bias results averaged over all areas:
- Overall, I think these results are pretty good!
- The 3 and 6 month MA models have very similar performance.
- The ARIMA model has a similar MAPE to the MA models, but worse bias.
- The Prophet model has the worst results by far (even after tuning).
- Itās surprising to me that the MA models (the simplest ones) have the best performance!
Letās look at model predictions for area
2 only.
- The Prophet model has higher predictions than other models for the first 5 months.
- All models miss the drop in Jan 2019 and the up/down pattern of April 2019 - July 2019.
- The 6 month MA model predictions donāt vary that much.
- The MA and ARIMA models seem to make conservative predictions that donāt capture the fluctuating nature of the data.
Next, Iāll look at the average MAPE per month per model. This plot includes all areas.
- For the first half of the validation data, the MAPE for the MA and ARIMA models move together.
Itās also worth looking at the average bias per month per model.
- The bias for the MA and ARIMA models move together.
- The Prophet model has positive bias (overpredicts) for the entire validation set.
Now, Iāll look at the average MAPE and bias per model per area
.
- Average MAPE performance varies substantially by area. For example, model performance on area 12 [3] is relatively good [bad].
- Average bias also varies substantially by
area
. - Surprisingly, most models have positive bias (which indicates over-prediction) for most areas!
Next, Iāll look at the worst area
/month predictions per model.
area
14 shows up twice. Both January and February of 2019 show up.- Itās interesting to see cases where all models struggled (Jan 2019 in
area
2) vs. cases where one model in particular struggled (Sept 2018 inarea
14).
Based on the worst predictions, Iāll zoom into area
14.
- The trend for
area
14 is completely different than forarea
2 above. - All models except Prophet miss the spike in July 2019.
Here are my conclusions for the Collision Prediction section:
- Overall monthly performance is not bad (<10% MAPE and bias in most cases).
- However, MA models have the best performance which indicates that longer-term lagged data, differencing, and previous errors donāt improve error rates. This is surprising, but suggests that the number of collisions per
area
/month is largely random within a certain range. - Trends seem to vary by
area
. This should in theory be addressed by the ARIMA and Prophet methods which fit a separate model perarea
. - To improve these models, I would want to dig into one or two areas in-depth and attempt to understand the trends. Getting data about the specific traffic patterns of an
area
would definitely help too. Additional data sources (like weather) could also be promising to explore.
š Conclusion
This concludes my analysis on Los Angeles collision data! Feel free to get in touch if you have other approaches to these questions.
You can find all of the code I used in the GitHub repository below:
If you enjoyed this post, check out some of my other work below!
Thanks for reading!