What I Learned From Analyzing and Visualizing Traffic Accidents Data

Susan Li
Towards Data Science
5 min readOct 4, 2017

--

Source: Power BI

Overview

The National Highway Traffic Safety Administration (NHTSA) has some really interesting data that they make available to public. I downloaded several datasets that contain information on fatal motor vehicle crashes and fatalities from 1994 to 2015. The purpose of this analysis is to explore and gain a better understanding of some of the factors that affect the likelihood of vehicle crashes.

The analysis and visualization are done in R language. R is awesome, as you will come to find out.

Data

Load the libraries

I’ll be using the following libraries for the analysis and visualization. I don’t show the code for most of the data cleaning and analysis steps to keep the post concise, but as with all of my posts, the code can be found on Github.

library(XML)
library(RCurl)
library(rvest)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggthemes)
library(reshape)
library(treemap)

Traffic fatalities in the United States have been trending downwards. Notably, fatalities in 2014(less than 33,000) is far lower than the peak in 2005(more than 43,000).

ggplot(aes(x=Year, y=Val), data = df_long_total) + geom_line(size = 2.5, alpha = 0.7, color = "mediumseagreen", group=1) + 
geom_point(size = 0.5) +
ggtitle('Total Number of Accidents and Fatalities in the US 1994 - 2015') +
ylab('count') +
xlab('Year') +
theme_economist_white()
Figure 1

And the above figures did not take into account the ever-increasing number of cars on the road. Americans are driving more than ever before.

ggplot(aes(x=Year, y=Val), data = df_long_travel) + geom_line(size = 2.5, alpha = 0.7, color = "mediumseagreen", group=1) + 
geom_point(size = 0.5) +
ggtitle('Total Vehicle Miles Traveled 1994 - 2015') +
ylab('Billion Miles') +
xlab('Year') +
theme_economist_white()
Figure 2

2015 Traffic Fatalities by the State and Percent Change from 2014

state <- state[c('State', 2015, 2014, 'Percent.Change')]
newdata <- state[order(-state$`2015`),]
newdata
  • Texas led the U.S. with the most traffic fatalities in both 2014 and 2015.
  • Understandably, the states that have the fewest traffic fatalities are also among those have the fewest residents, including the District of Columbia, followed by Rhode Island and Vermont.

Nationwide, motor vehicle crash fatalities were higher for males than females every year (more than double).

ggplot(aes(x = year, y=count, fill=killed), data=kill_full) +
geom_bar(stat = 'identity', position = position_dodge()) +
xlab('Year') +
ylab('Killed') +
ggtitle('Number of Persons Killed in Traffic Accidents by Gender 1994 - 2015') + theme_economist_white()
Figure 3

The age group of 25 to 34 had the highest number of fatalities.

age_full$age <- ordered(age_full$age, levels = c('< 5', '5 -- 9', '10 -- 15', '16 -- 20', '21 -- 24', '25 -- 34', '35 -- 44', '45 -- 54', '55 -- 64', '65 -- 74', '> 74'))
ggplot(aes(x = age, y=count), data =age_full) + geom_bar(stat = 'identity') +
xlab('Age') +
ylab('Number of Killed') +
ggtitle('Fatalities Distribution by Age Group 1994 - 2015') + theme_economist_white()
Figure 4

From 2005 to 2015, fatalities increased in only two age groups; 55 to 64 and 65 to 74. Age groups of 16 to 20 and 35 to 44 had the highest decrease in fatalities.

ggplot(age_full, aes(x = year, y = count, colour = age)) + 
geom_line() +
geom_point() +
facet_wrap(~age) + xlab('Year') +
ggtitle('Traffic Fatalities by Age 1994 - 2015') +
theme(legend.position="none")
Figure 5

From this treemap, we see 3pm to 5:59pm and 6pm to 8:59pm had the most fatalities. Let’s dive it deeper.

treemap(kill_by_hour_group, index=c("hours","variable"), vSize="sum_hour", type="index", fontsize.labels=c(15,12), title='Fatalities by time of the day', fontcolor.labels=c("white","orange"), fontface.labels=c(2,1), bg.labels=c("transparent"),  align.labels=list(
c("center", "center"), c("right", "bottom")), overlap.labels=0.5, inflate.labels=F,
)
Figure 6

The most accidents occurred between Midnight and 2:59am on Saturdays and Sundays. Let’s dive even deeper to find out why.

ggplot(aes(x = variable, y = sum_hour, fill = hours), data = kill_by_hour_group) +
geom_bar(stat = 'identity', position = position_dodge()) +
xlab('Hours') +
ylab('Total Fatalities') +
ggtitle('Fatalities Distribution by Time of the Day and Day of the week 1994-2015') + theme_economist_white()
Figure 7

Between Midnight and 2:59am on Saturdays and Sundays is the time many people leave the bars. How many times do we still have to say, don’t drink and drive?

ggplot(aes(x = year, y = count, fill = hour), data = pair_all) +
geom_bar(stat = 'identity', position = position_dodge()) +
xlab('Year') +
ylab('Number of Fatalities') +
ggtitle('Fatal Crashes caused by Alcohol-Impaired Driving, by Time of Day 1994-2015') + theme_economist_white()
Figure 8

The percentage of alcohol-impaired driving fatalities is actually flat over the past 10 years.

ggplot(aes(x = year, y = mean, color = bac), data = al_all_by_bac) +
geom_jitter(alpha = 0.05) +
geom_smooth(method = 'loess') +
xlab('Year') +
ylab('Percentage of Killed') +
ggtitle('Fatalities and Blood Alcohol Concentration of Drivers 1994-2015') + theme_economist_white()
Figure 9

Your Turn

NHTSA provides a rich data source for information on traffic fatalities. There are hundreds of methods to analyze them and the best one really depends on the data, and the questions you are trying to answer. Our job is to tell a story backed-up by the data. What type of the vehicle is more likely to be involved in a crash? Where is the safest seat in a vehicle? So, come out with your own story, and let me know what you find in the data!

Data does not inspire people, stories do.

--

--