LA Traffic Data Analysis 🚗

Using open-source data to analyze collision patterns in Los Angeles

Published in

Towards Data Science

16 min readApr 13, 2020

Traffic is an issue that’s familiar to pretty much everyone. As a 7-year Los Angeles resident, I’ve sat in more than my fair share of gridlock, seemingly regardless of time of day or day of week. That’s why I was so interested when I stumbled upon a traffic collision data set maintained by the city of Los Angeles. This data is cool for several reasons. While it doesn’t directly measure traffic, it measures a closely-related proxy. It’s not a stretch to hypothesize that more traffic correlates with more collisions which directly cause more traffic.

I am hopeful that data sets like this one can be used to create safer and more efficient communities for everyone. In that spirit, this data set (and a bunch of others) is actively maintained by the city of Los Angeles and is freely available to the public.

After browsing the data, I settled on 3 major questions I wanted to attempt to answer:

How do traffic collision patterns vary by time of day, day of week, and time of year?
How are collisions distributed geographically? Is it possible to identify high-risk areas or intersections?
Is it possible to predict the number of collisions in a given time frame?

Before getting into the questions above, let’s learn more about the data set.

📈 The Data

The data begins in January 2010 and is updated weekly. In my particular case, I use data from January 2010 - July 2019, which ends up being ~500K rows. Each row corresponds to a collision. This data is transcribed from original paper traffic reports, so it’s very likely that there are errors. Below is a sample of some of the key fields:

Selected fields from LA collision data set

The availability of these fields inspired the key questions listed above.

As with any data set, this one needs cleaning before starting any analysis. There are a few columns with only one value, reflecting the fact that all rows in this data correspond to traffic collisions.

There are also multiple fields with the approximate street names of collisions (not shown above). These text fields need cleaning, specifically removing extra spaces. Similarly, in the image above we can see the latitude/longitude coordinates contained in a string. I extract these coordinates in separate columns for later use.

The next step is to check for null or missing values. ~16% of collisions (~78K) don’t have an associated victim age. There is also a small number (~400) of collisions that do not have valid latitude/longitude coordinates and will be excluded from the mapping section in part 2.

💡 Data Exploration

After cleaning, but before diving into the main questions, I want to do some general data exploration. This sort of analysis is typically useful. Even without a specific goal in mind, I often find helpful trends or insights. I’ll start by plotting a few of the variables in the data. By the way, all of this work was done in R and the code is linked at the bottom of the post!

Number of collisions by victim age. Note the spike at age 99!

Here’s what jumps out at me:

There are hardly any victims below age 15.
Most victims are in their 20s. The number of collision victims per age generally decreases after age 30.
There are spikes at most multiples of 5 (25, 30, 35, etc). This suggests that some ages are estimated and that official identification (such as a driver’s license) isn’t always used in collision reports.
Age 99 seems to be a catch-all bucket. It seems unlikely that there are actually as many age-99 victims as are shown above.

This plot also raises questions:

How are collisions with multiple victims dealt with?
What’s going on with the spike at age 99?

I emailed the data owner about these questions, but unfortunately haven’t heard back. I’ll add an update if I get a response.

Next, let’s look at collisions by gender.

Number of collisions by gender. “X” represents unknown gender.

This plot tells us that given that a collision occurred, the victim is much more likely to be male than female.

This work would be more interesting if I had the total number of drivers by gender, allowing for a collisions per capita measure. This will be a recurring shortcoming of this data and would be one of my main extensions of this analysis.

The next thing I want to do is look at the lowest and highest collision days in the data. I only include the top and bottom 0.5% on either end so I can review the results manually. Here are the lowest collision days:

Days with the lowest number of collisions. “daily_count” is the number of collisions on the corresponding day.

Most low-collision days fall on or around holidays. Intuitively, this makes sense as there are likely less people driving on (certain) holidays.
Most low-collision days occur before early 2014. We’ll see later in this post that monthly collisions start rising after 2014.

And here are the highest collision days:

Days with the highest number of collisions. “daily_count” is the number of collisions on the corresponding day.

Most high-collision days are Fridays occurring after 2015.
We’ll see later in this post that Fridays typically have the highest number of collisions out of any day of the week.

These results bring up a few questions and thoughts outside the scope of this analysis.

Why are only some holidays associated with low-collision days? For example, MLK Day shows up twice as a low-collision day but Independence Day never does.
Why don’t holidays like Memorial Day or Labor Day show up as either low or high collision days?
Daylight Savings Time does not show up as any kind of outlier in any year. That surprised me.
Would it be interesting to look at weather conditions on high-collision days?

With this descriptive analysis in mind, I’m ready to tackle the key questions I outlined at the beginning of this post. The first one is about analyzing collisions over time.

🕒 Collisions by Time

In this section, I analyze how collision patterns vary by time of day, day of week, and time of year.

First, I plot daily collisions for 2018. Plotting all of the data starting from 2010 is too chaotic and so I focus only on 2018 for many parts of this section. In addition to collision date, the data has a field for the reporting date. This is the date the collision was actually reported to police. In most cases, the reporting date is the same day or one day after the collision date.

2018 daily collisions and collisions reported. Can you spot the differences?

Daily collisions and collisions reported have noticeable differences. Look at the mid-April spike in collisions reported…there’s nothing similar in actual collisions!
These differences make me think that there may be administrative dynamics at play regarding when collisions are reported or processed.
The outliers in the data don’t have an obvious pattern.

Let’s see a similar plot aggregated by month. At this level, I can include the entire time frame from 2010–2019.

Monthly collisions and collisions reported

At the monthly level, these two quantities seem to track each other more closely than at the daily level.
Monthly collisions were roughly constant from 2010 - 2014, rose from 2014 - 2017, and were roughly constant from 2017 - 2019. Remember that the data used for this analysis ends in July 2019.

So, why do monthly collisions rise in the plot above? Did the number of people living and driving in LA rise from 2014 - 2017? Could it be related to the rise of ride-sharing services? It’s not clear from this data, but these are possible starting points for a separate post.

Next, I analyze the distribution of collisions throughout the day.

Collisions are:

sharply increasing from ~4am to ~8am
decreasing from ~8am to ~930am
generally increasing from ~930am to ~6pm
sharply decreasing from ~8pm to ~4am
at their daily minimum [maximum] at ~4am [~5pm]

These results likely mirror the number of vehicles on the road. It seems intuitive that many collisions occur during the evening rush hour. As I mentioned before, having access to a measure of total vehicles on the road per hour would allow me to calculate collisions per capita. There are no hourly timestamps available for when collisions are reported, so I can’t plot that field by hour.

The next step is to examine collisions by day of week.

Collisions and collisions reported by day of week

Collisions are:

increasing from Sunday - Friday, with a sharp increase from Thursday to Friday
at their weekly minimum [maximum] on Sunday [Friday]

The end of the working week is the obvious hypothesis for the high number of Friday collisions. I haven’t come up with any others so far!

For collisions reported:

Sunday [Friday] still have the least [most] collisions
However, collisions reported per weekday are essentially constant

Finally, I plot results by month.

Collisions and collisions reported by month. These fields are similar at this resolution.

Collisions are:

generally constant from April to August
generally lower from September to February, especially from September to December
highest in March

So, collisions tend to be lower in colder months. One possible explanation for this is less tourists visiting LA in the colder months. Regarding the high number of collisions in March, my initial hypothesis was Daylight Savings Time. However, none of the highest collision days were on this date. Perhaps this result stems from spring break tourists?

This concludes my analysis of the temporal patterns of collisions. Here’s my summary of the results above.

Collisions and collisions reported vary substantially at the daily level, but not at the monthly level.
Monthly collisions were roughly constant from 2010 - 2014, rose from 2014 - 2017, and were roughly constant from 2017 - mid-2019 (the end date of this analysis).
The number of collisions is lowest [highest] at ~4 AM [~5 PM].
The number of collisions is lowest [highest] on Sunday [Friday].
The number of collisions is lowest [highest] in September-December [March].
Getting information on the total number of vehicles on the road at a given time would allow interesting and useful per-capita calculations of the results above.

Next up is looking at collisions by geography.

🌎 Collisions by Geography

In addition to latitude/longitude coordinates, the data has multiple fields describing where a collision occurred. I’ll start out by looking at these fields. First, I’ll plot the distribution of collisions by area.

Some areas obviously have more collisions than others. But without additional information such as size or traffic density per area, this graphic isn’t too informative.

The data also includes fields called location and cross_street. location is the main street a collision occurred on, while cross_street is the nearest cross street. I’ll look at the 10 most common values for these fields and their combination.

The 10 most common location are some of the longest and most trafficked roads in LA. These top 10 streets account for >10% of total collisions. There are >25K total location, so there’s a very long tail. Now, I’ll look at the cross_street field.

5% of collisions have no associated cross_street. Otherwise, this list has a lot of overlap with the previous list. The obvious next step is to look at the most common intersections for collisions by combining these two fields.

The most common intersections (location / cross_street combinations) for collisions contains many of the same streets that were in the previous lists. However, there are exceptions: the components of row 2 (Tampa Ave. and Nordhoff St.) don’t appear in either the most common location or cross_street. Even the most collision-prone intersections account for only a small proportion of overall collisions.

Now it’s time to take advantage of the geographic coordinates in the data and start mapping collisions. As a reminder, there are a small number of collisions that do not have valid latitude/longitude data and so are excluded from this section.

2018 collisions for all of Los Angeles. Each point is a unique latitude/longitude coordinate.

The spatial distribution of points shows the interesting shape of LA.
Blue [Red] points indicate latitude/longitude coordinates with a low [high] number of 2018 collisions.
Even on this zoomed-out map, high collision coordinates are visible in the Valley (the northern part of the map) and the central and eastern parts of the city.

This map is pretty cluttered. For a better view, I’ll zoom in to a window showing much of central and downtown Los Angeles.

2018 collisions for selected areas of Los Angeles

A number of medium and high collision coordinates are clearly visible. Many of these high collision points tend to occur on various intersections of the same street.

The previous maps showed overall collisions. How does this data look if I add time of day?

2018 collisions by most common daypart for selected areas of Los Angeles

This plot only includes coordinates with 5+ collisions. Each coordinate is assigned to the daypart in which most of its collisions occur. Coordinates with ties between dayparts are removed.
There are no coordinates with a majority of collisions occurring in the Early Morning. This makes sense given what I found in the Collisions by Time analysis.
Many coordinates have most of their collisions in the Afternoon and Evening. This result also matches the results of the Collisions by Time section.
Interestingly, there’s a cluster of coordinates where Late Night collisions are common.

Let’s look at a similar map broken out by weekday/weekend.

2018 collisions by most common weekpart for selected areas of Los Angeles

This plot only includes coordinates with 3+ collisions. Each coordinate is assigned to the weekpart in which most of its collisions occur. Coordinates with ties between weekparts are removed.
There are obviously more days and collisions in the Weekday bucket.
This map identifies areas with many weekend collisions. It might also be interesting to include part or all of Friday in the Weekend bucket.

To conclude this section, I plot collisions by time of year.

2018 collisions by most common season for selected areas of Los Angeles

This plot only includes coordinates with 3+ collisions. Each coordinate is assigned to the season in which most of its collisions occur. Coordinates with ties between seasons are removed.
It looks like more collisions occur in the summer months (Mar-Aug). This would line up with the results in the Collisions by Time section.
Given that Los Angeles doesn’t have distinct seasons like fall or winter, there may be other ways to split up the year for a plot like this.

These are my conclusions for the Collisions by Geography section:

Using the location and cross_street fields, in addition to mapping, it is possible to identify the most accident-prone coordinates in Los Angeles.
The most common dayparts for collisions are the Afternoon and Evening.
Many more collisions occur during the weekdays than the weekend. Mapping collisions by weekpart shows areas where weekend collisions are more common.
More collisions happen in the summer than winter months.

🔮 Predicting Collisions

The final section deals with trying to predict collisions. I specifically try to predict the number of collisions that will occur per month and area.

I’ll start by looking at an example of the collision time series for a single area:

The trend for area 2 generally matches the trend of overall monthly collisions (this is in the Collisions by Time analysis).

Let’s decompose this time series into trend, seasonality, and remainder components. As an aside, I’ll be focusing on the analysis results in this post and won’t delve into the theory of the time series methods I’m using. However, there are lots of resources available online if you’d like to learn more!

Trend, seasonality, and remainder components for “area” 2 (Rampart)

Keep the shape of the seasonality curve in mind. I’ll compare it against the decomposition of overall monthly collisions next.
The trend looks similar to what we saw in the previous graphic.

Let’s compare the area 2 decomposition to the overall monthly collisions decomposition.

Trend, seasonality, and remainder components for overall data

The seasonality curve for the overall data looks very different as compared to area 2! This indicates that different areas can have different dynamics.

Next, I’ll look at the auto-correlation function (ACF) and partial auto-correlation function (PACF) for area 2. The ACF analyzes how correlated lagged values of collisions are to the current value. The PACF shows how much previously unexplained variance each lag explains.

These plots indicate that area 2 collision values are correlated with their lags. However, after the first two lagged values, additional lags are not accounting for much unexplained variance.
So, any model predicting area 2 collisions should include at least 2 lagged terms.

I try a few different model types to predict collisions:

3 and 6 month moving average models (MA)
Auto Regressive Integrated Moving Average (ARIMA)
Prophet library

MA models average past values to generate predictions and are the simplest type of time series model. ARIMA models can use past values, differencing, and previous errors. Prophet is an additive forecasting model where non-linear trends are fit with yearly and monthly seasonality.

Prophet is much newer than ARIMA or MA models. You can find more info on it here:

Prophet

Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts…

facebook.github.io

I evaluate each model on the final 12 months of data (August 2018 to July 2019) with the following metrics:

Mean absolute percentage error (MAPE): average absolute percentage difference between predictions and actual values
Bias: average percentage difference between predictions and actual values

The MAPE gives me an idea of how far my predictions are from actual collision values while the bias tells me if I am systematically over- or under-estimating the data.

I auto-fit ARIMA models for each area. So each one can have a different p, d, and q. I also test multiple Prophet model specifications (and show the best results below). Here are the model MAPE and bias results averaged over all areas:

Multiple model results averaged over all areas in the data

Overall, I think these results are pretty good!
The 3 and 6 month MA models have very similar performance.
The ARIMA model has a similar MAPE to the MA models, but worse bias.
The Prophet model has the worst results by far (even after tuning).
It’s surprising to me that the MA models (the simplest ones) have the best performance!

Let’s look at model predictions for area 2 only.

Model predictions vs actual collisions for “area” 2 (Rampart)

The Prophet model has higher predictions than other models for the first 5 months.
All models miss the drop in Jan 2019 and the up/down pattern of April 2019 - July 2019.
The 6 month MA model predictions don’t vary that much.
The MA and ARIMA models seem to make conservative predictions that don’t capture the fluctuating nature of the data.

Next, I’ll look at the average MAPE per month per model. This plot includes all areas.

Average monthly MAPE for each model (all areas)

For the first half of the validation data, the MAPE for the MA and ARIMA models move together.

It’s also worth looking at the average bias per month per model.

Average monthly bias for each model (all areas)

The bias for the MA and ARIMA models move together.
The Prophet model has positive bias (overpredicts) for the entire validation set.

Now, I’ll look at the average MAPE and bias per model per area.

Average MAPE performance varies substantially by area. For example, model performance on area 12 [3] is relatively good [bad].

Average bias also varies substantially by area.
Surprisingly, most models have positive bias (which indicates over-prediction) for most areas!

Next, I’ll look at the worst area/month predictions per model.

area 14 shows up twice. Both January and February of 2019 show up.
It’s interesting to see cases where all models struggled (Jan 2019 in area 2) vs. cases where one model in particular struggled (Sept 2018 in area14).

Based on the worst predictions, I’ll zoom into area14.

Model predictions vs actual collisions for “area” 14 (Pacific)

The trend for area 14 is completely different than for area 2 above.
All models except Prophet miss the spike in July 2019.

Here are my conclusions for the Collision Prediction section:

Overall monthly performance is not bad (<10% MAPE and bias in most cases).
However, MA models have the best performance which indicates that longer-term lagged data, differencing, and previous errors don’t improve error rates. This is surprising, but suggests that the number of collisions per area/month is largely random within a certain range.
Trends seem to vary by area. This should in theory be addressed by the ARIMA and Prophet methods which fit a separate model per area.
To improve these models, I would want to dig into one or two areas in-depth and attempt to understand the trends. Getting data about the specific traffic patterns of an area would definitely help too. Additional data sources (like weather) could also be promising to explore.

🏁 Conclusion

This concludes my analysis on Los Angeles collision data! Feel free to get in touch if you have other approaches to these questions.

You can find all of the code I used in the GitHub repository below:

jai-bansal/los-angeles-collision-analysis

This repo contains materials surrounding an analysis of collision data in Los Angeles. Below is a description of the…

github.com

If you enjoyed this post, check out some of my other work below!

How to Use Random Seeds Effectively

This post is about an aspect of the machine process that doesn’t typically get much attention: random seeds.

towardsdatascience.com

7 Tips for Developing Technical Trainings for your Organization

Techniques to bring your next training course idea to life

medium.com

Thanks for reading!

LA Traffic Data Analysis 🚗

Using open-source data to analyze collision patterns in Los Angeles

📈 The Data

💡 Data Exploration

🕒 Collisions by Time

🌎 Collisions by Geography

🔮 Predicting Collisions

Prophet

Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts…

🏁 Conclusion

jai-bansal/los-angeles-collision-analysis

This repo contains materials surrounding an analysis of collision data in Los Angeles. Below is a description of the…

How to Use Random Seeds Effectively

This post is about an aspect of the machine process that doesn’t typically get much attention: random seeds.

7 Tips for Developing Technical Trainings for your Organization

Techniques to bring your next training course idea to life

Written by Jai Bansal