USA Accidents Data Analysis

The data of countrywide traffic accidents from February 2016 to March 2019 is analyzed.

Shubhankar Rawat

Published in

Towards Data Science

9 min readFeb 21, 2020

INTRODUCTION

Road accidents have become very common these days. Nearly 1.25 million people die in road crashes each year, on average, 3,287 deaths a day. Moreover, 20–50 million people are injured or disabled annually. Road traffic crashes rank as the 9th leading cause of death and accounts for 2.2% of all deaths globally. Road crashes cost USD 518 billion globally, costing individual countries from 1–2% of their annual GDP.
In the USA, over 37,000 people die in road crashes each year, and 2.35 million are injured or disabled. Road crashes cost the U.S. $230.6 billion per year or an average of $820 per person. Road crashes are the single greatest annual cause of death of healthy U.S. citizens travelling abroad.
(Source)

Looking at the severity of road accidents, I decided to analyze the accidents’ data to discover something useful. And here I am, sharing my results.

DATA

The dataset is taken from Kaggle. You can find it here.
This is a countrywide traffic accident dataset, which covers 49 states of the United States. The data is continuously being collected from February 2016 to March 2019, using several data providers, including two APIs which provide streaming traffic event data. These APIs broadcast traffic events captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks.

The dataset contains 2,243,939(2.24 million) rows and 49 columns(Quite a large dataset). A point to be noted is that even though the dataset contains data for only three years, there are 2.24 million accidents already.

Feature Description

As discussed earlier, the dataset contains 49 features, and the following is their description.

THE APPROACH

Before getting into the analysis part, let’s look at the null values present in the dataset.

The figure below shows only those features which have null values.

Exploratory Data Analysis

I will start by eliminating the unnecessary features first.

The feature Country contains only one entry — USA, which is quite apparent since we are dealing with the USA’s dataset. Hence, I will be deleting this feature.

The feature Turning_Loop also contains one value — False. This means that there was no turning loop in the vicinity of any of the accidents. As this feature includes only one value, I’ll be dropping this as well.

Let’s look at the Source feature. It represents the API that reported the accident.

Number of accidents reported by each source

There are only three API sources that reported the accidents. It can be observed that most of the accidents(around 1,700,000) were reported by MapQuest, followed by Bing.

There are nine standard time zones in the US officially defined by federal law. The entire 50 states and the District of Columbia have six main time zones.

Out of these six significant timezones, we see(from the figure below) that there are only four timezones present in the dataset.

Most numbers of accidents took place in regions with timezone: Eastern Standard Time followed by Pacific Standard Time.

The dataset consists of an exciting feature: TMC.
A quick google search gives the following:

Traffic Message Channel (TMC) is a technology for delivering traffic and travel information to motor vehicle drivers. It is digitally coded using the ALERT C or TPEG protocol into RDS Type 8A groups carried via conventional FM radio broadcasts. It can also be transmitted on Digital Audio Broadcasting or satellite radio. TMC allows silent delivery of dynamic information suitable for reproduction or display in the user’s language without interrupting audio broadcast services.

Now, let us plot the number of accidents with respect to the TMC feature.

The plot depicts that most numbers of accidents have a TMC of 201. You can refer to the TMC code list for better understanding.

The most exciting feature is the Severity. It represents the severity of an accident.

The plot depicts that mostly the accidents had severity equal to 2(average) followed by 3(above average), which is unfortunate. There are hardly any accidents with very low severity(0 and 1).

Let’s look at the Start_Time and End_Time features:

The Start_Time and End_Time features depict the start and end time of an accident. To gain a better understanding, I have computed the duration of each accident.

It is interesting to see that the duration of accidents varies from minutes to years.

Number of accidents vs duration for top 10 values

The above plot is not as significant as the one that follows, but, I wanted to check the most common durations. Around 950,000(43%) accidents had a duration of 29 minutes, followed by 6 hours.

A point to be noted is that the dataset description tells that Start_Time and End_Time represent the starting and ending time of the accident. Although the duration(which is calculated by taking the difference between Start_Time and End_Time) for some accidents comes out to be in hours or even in months and years, it is not evident that an accident lasted for a few days or years. Maybe the two features also include the repair time as well, not much can be concluded.

A more significant way to look at the duration for each accident is to look at the duration unit.

Number of accidents for each duration unit

The plot depicts that about 77% of the accidents have a duration in minutes, whereas about 22% of the accidents have a duration in hours. Only 52 accidents have a duration in years and 975 in days. This means that accidents that have a longer duration are rare in the USA. Also, accidents with small durations are much more frequent.

Now, let’s see the trend of accidents for each severity over time.

Number of accidents vs time for each severity

It can be observed that the number of accidents has increased over time for each severity. This is alarming and requires serious action. Even though there was a decrease in the number of accidents in 2017, around week 12, the number of accidents increased after that. Accidents with severity = 2 are more frequent and have increased the most, followed by accidents with severity = 3.

The Start_Lat and Start_Lng features are interesting since they can be plotted on a map, to get the exact location of the accident. First, I will draw a scatterplot between the two.

The scatterplot looks nice, but at the same time, it is alarming that almost every corner of the USA is covered, meaning that the accidents occurred over a large number of locations over the past few years.

To get a clear idea, I have plotted the accident’s site using the coordinates given in the dataset on the USA map for each severity.

The above plot looks messy! So let’s break it down.

I have plotted the locations for each severity individually.

From the above plots, we can conclude that most numbers of accidents occurred in the Eastern and Western part of the USA, which accounts for the fact that most numbers of accidents took place in regions with timezone: Eastern Standard Time and Pacific Standard Time.

Now, let’s look at the Distance(mi) feature. This feature tells the length(in miles) of the road extent affected by accident.

The above plot depicts that the impact of most of the accidents on the road is small.

Now let’s look at the most accident-prone cities in the USA.

Top 10 cities in terms of number of accidents

We see that most of the accidents occur in Houston, followed by Charlotte and Los Angeles.

Let’s look at most accident-prone states of USA

Top 10 states in terms of number of accidents

The list of states of the USA with their code is given here.

The plot depicts that California(CA) has the most number of accidents followed by Texas(TX) and Florida(FL). It is interesting to see that the number of accidents in California is almost twice the number of accidents in Texas.

We can see that the most accident-prone city in the USA is Houston which is in Texas followed by Charlotte(North Carolina — which is number 4) and Los Angeles(California).

Let’s go even deeper and plot the number of accidents with respect to zipcode.

It can be observed that most numbers of accidents occurred in the region with zip code 91706 followed by 91765 and 91761. Refer to this link for more information about zip codes.

Now let’s see the visibility(mi) feature. It denotes the visibility in miles.

The plot shows that most of the accidents occurred when visibility was high, which means that visibility is not a significant concern when it comes to accidents. This is obvious since low visibility is not the only factor.

Now let’s see the weather conditions during the accidents.

Number of accidents vs Weather_Condition

The plot depicts that the weather condition for most of the accidents was clear, followed by overcast and mostly cloudy. Overcast and mostly cloudy are reasonable factors for accidents unlike clear, which means that weather conditions also does not play an important role.

Let’s look at the Amenity feature. This feature indicates the presence of amenity in a nearby location.

We see that for almost all(98.84%) accidents, there was no amenity available, which is unfortunate.

Now let’s look at the bump feature. This feature indicates the presence of a speed bump or hump in a nearby location.

We see that 99.99% of the accidents were not due to a speed bump.

With this, we come to an end to this analysis.

Conclusion

The USA accidents dataset, taken from Kaggle, was analyzed, and results were discussed above.

We came to a lot of exciting things like we came to know which city or state witnessed the most number of accidents in the USA, we even plotted the results on a map and also considered the severity of an accident.

Hope you got to know something and enjoyed the article!!