Car Crashes and the Weather: An Exploratory Analysis of Environmental Conditions’ Impact on Traffic Accident Rates

Published in

Towards Data Science

10 min readSep 21, 2018

It would seem to be a fair assumption that driving in poor weather conditions and low light increases the risk of accidents, but my analysis of UK traffic incident data showed some surprising results.

This post is adapted from a more detailed report, which can be found on GitHub, along with the code and data used in the analysis.

Each year in the UK there are over 130,000 traffic collisions involving injury which are reported to the police, resulting in between 1,700 and 1,800 deaths [1]. Data on these incidents [2] is published annually by the Department for Transport, and has been recorded and published in a similar format since 1979, although in this work I looked only at 2016, the most recent year for which results were available.

A report is published alongside the data [1], which gives some interesting insights in to annual trends in accident and fatality rate, and breaks down the data by looking at specific road user types (e.g. Car divers, cyclists, pedestrians, etc.). It also considers the impact of possibly related factors, such as the weather, drink driving rates, and even GDP, on incident rates. However, the report has very little detail about regional trends, patterns across the year or across the day, or whether there are relationships between the physical environment and the rate of crashes.

I wanted to investigate these points, by digging in to the data to explore the pattern of accident conditions across the country, and to see if I could uncover any relationships between environmental conditions and the rate of accidents.

The Data

The data consists of three main tables:

Accidents — 32 variables, detailing the location, time, date, lighting, weather, and road surface conditions, number of causalities, road type and other variables. Each observation represents one of 136,621 collisions involving injury reported to the police in 2016.
Casualties — Linked via ‘Accident Index’ to the Accidents table, the table has 16 columns, giving further detail on the casualties involved. There are 181,384 rows, each representing a single person injured in a collision.
Vehicles — This table gives details of the vehicles involved in collisions but is not used in this investigation

I also obtained government estimates for distance travelled on different types of road in the UK in 2016 [3], the relative density of road traffic for each hour of each day of the week in 2016 [3], and the UK population, at Local Authority (LA) level, for 2016 [4]. A shapefile provided by the Office of National Statistics giving the geographical boundaries of each LA was used for mapping purposes [5].

Data Preparation

The categorical variables in the data are stored as a numeric code, with a separate spreadsheet detailing the meaning of each code for each variable. In order to translate the data into a readable format, it was loaded, along with the data dictionary, into Pandas Data Frames and the codes translated to values, before being exported to a new csv, which was then used for all remaining work.

On initial viewing the data seemed to be almost totally complete, but upon closer inspection it turned out that some of the values were equivalent to missing data (e.g. ‘Unknown’). There were 3,999 records with data missing from key variables. These represented just under 3% of the data, so I decided that they would have little impact on the overall results and removed them.

In order to investigate incident rates by road type I needed to integrate estimates for distance travelled on different types of road [3]. The two data sets use different classifications for road type, so I translated the more granular road types in the main data set in to those used in the distance travelled estimates.

The National Level

First I chose to examine the lighting conditions, road surface conditions, and weather conditions in which each accident occurred, and their timing.

Lighting, Road and Weather Conditions

All of these variables showed a similar distribution (see Figure 1), with one category accounting for 70–85% of observations, a secondary category making up 10–25%, and the remainder covered by uncommon conditions. This is unsurprising as, although data on weather conditions were not available, we know that in the UK it is generally dry and clear (contrary to stereotype!), but sometimes rainy, and more driving is done during daylight hours.

Fig 1: Occurrence of Environmental Conditions

Distribution Across Time

The next thing I wanted to look for was a seasonal pattern. Grouping incidents by date produced the plot shown in Figure 2. Although the variance is slightly higher in winter months, there is very little change in the overall rate throughout the year (less than 5% standard deviation in monthly mean), as indicated by the dashed regression line. Taking in to account the change in traffic volume by month, the standard deviation is still only 8%.

One point of interest is the three outliers in late December. These are the three days of the year with the lowest number of crashes: Christmas Day, Boxing Day, and New Year’s Eve.

Fig 2: Traffic accidents throughout the year. Outliers (>2sd from mean) are highlighted in red.

Grouping incident by time of day showed unsurprising peaks during the morning and evening rush hours, and a far lower rate during the night than the day. Dividing the raw figures by the relative traffic volume for each hour showed a very different pattern, with a far higher relative incident rate at night than during the day, and less of an impact from rush hours.

However, it seems unlikely that this change is as a result of lighting conditions, as there is no significant difference between the patterns for December and June, even though daylight hours are very different in these months.

Fig 3: Distribution of accidents throughout the day

Comparing Local Authorities

In order to make comparison between areas, I transformed the categorical variables ‘Lighting Conditions’, ‘Weather Conditions’, and ‘Surface Conditions’ in to dummy variables and grouped by LA, giving a count of how often each value occurred for each LA.

Population estimates were also incorporated at this stage, enabling calculation of LA accident rates per 1000 inhabitants. This is clearly an imperfect measure, as it assumes a relationship between the resident population of an area and those driving there, which is not necessarily valid, but it provides a good starting point for analysis. Road traffic volumes by LA would be very useful here, but they are only available at the national level.

One extreme outlier LA, the City of London, was identified and removed to avoid distorting results. Due to its low resident population but very busy nature, the City had an accident rate of 38.8 per 1000 inhabitants, over six times the rate of the next highest LA (Westminster, at 5.94, which is probably affected by similar bias, but to a lesser degree, which did not noticeably affect the results).

Dummy variable counts were translated into proportions of the total accident count for each LA. In the absence of detailed weather data, this provided me with a proxy for the weather conditions which tend to be prevalent in each location.

To account for the skewed nature of these variables (see the Lighting, Road and Weather Conditions section), and to enable easier comparison between variables and between LAs, the distribution of these proportions across LAs was rescaled between 0 and 1.

This process produced 18 new columns. Plotting each of these against the accident, casualty, serious injury, and death rates, and calculating Pearson’s and Spearman’s rank scores, showed that most of these combinations did not have a strong correlation. Of the 72 comparisons made, the five giving an absolute correlation coefficient greater than 0.3 are listed in Table 1. Notably, none of the selected environmental conditions relate to lighting.

Table 1:Correlation between environmental conditions and accident or casualty rate

Having identified potential correlations, the next step was to build a regression model and see how this fitted across the country. A multivariate regression assumes no relation between the predicting variables, but it is obvious that there will be some in this case. The two Weather Conditions values and the two Road Surface Condition values are pairs originating from the same variables. In addition, a wet road surface is clearly linked to rain and a dry surface linked to fine weather. Figure 4 confirms that all four variables are strongly correlated.

Fig 4: Confirming co-linearity of potential predictors

This allowed the use of a simple univariate regression model. ‘Raining no high winds’ was selected as it exhibited the strongest correlation to accident rate. Figure 5 shows their relationship. The resulting polynomial regression model (shown in red) was then subtracted from the true accident rate of each local authority and the residuals mapped in Figure 6.

Fig 5: Raining, no high winds against Accident Rate

The model showed no obvious strong regional pattern, although higher than expected accident rates can be seen in the suburban areas around London and Birmingham.

Fig 6: A map of the UK showing the difference between true accident rate and that predicted by a linear regression model based on rainy, low wind weather conditions. High values indicate a higher than expected accident rate.

With weather and surface conditions having been identified as a possibly contributory factor in accident rates, I wanted to see if it was possible to identify groups of LAs with similar frequencies of these conditions. K means clustering on rescaled road surface conditions did not provide good results. Clustering was also attempted with the variables rescaled to a uniform distribution, but this gave too much weight to the rarer conditions. Returning to the original proportions produced three clear clusters. The profile of the LA closest to the centroid of each cluster and the geographic distribution of clusters is shown in Figure 7

Fig 7: Clustering on Surface Conditions. Left: Geographical distribution of clusters. Right: Parallel coordinates plot showing the profile of the LAs closest to the cluster centroids.

Discussion of Results

The expectation, going in to this investigation, was that driving in the dark and in poor weather was dangerous, and that this would be evidenced by a pattern of higher crash rates during the winter, and in areas which experience generally colder and wetter conditions (the north and west of the country), but this is not supported by the results. There is no significant change in the rate of traffic accidents throughout the year; the monthly crash rate varies very little, and December and June have been shown to have very similar hourly patterns of incidents. The analysis also found no connection between the proportion of accidents occurring in different lighting conditions and the rate or severity of those accidents.

A relationship was identified between weather/road surface conditions and accident rate, but it was the reverse of that expected. LAs with a higher proportion of accidents involving wet conditions tended to have a lower overall rate of accidents.

A possible reason for this could be that rather than causing more accidents, the poor weather reduces them, by discouraging driving, leading to fewer drivers, less congested roads and lower speeds. The relationship was not especially strong, but this could support the above hypothesis, as often a journey is required, regardless of the weather. Accurate daily weather records and traffic flow estimates at the LA level would allow greater investigation in this area.

It was possible to identify clusters of LAs with a similar profile of accident conditions (Figure 7) and these match expectations, following a clear geographic pattern of urban areas, the warmer, drier south and east of the country, and the colder, wetter north and west, along with some areas of higher altitude, such as The Pennines.

Although the results have been somewhat surprising, the objectives of the investigation were met, but there are still many areas of the data which could be investigated further, such as how junction layout or vehicle type relate to collision rates in different conditions.

Finally…

Thanks for reading! If you liked my work please feel free to leave some applause. If you want to learn more, the code I used is available on GitHub, along with the original report which accompanied the work, which is a bit more in depth than this post.

Tools

Aside from some minor formatting changes to prepare data for loading, which were done in Excel, all other work was done using Python. Numpy and Pandas were used for the majority of data manipulation. Sci-kit Learn was used for normalising the data (quantile_transform and MinMaxScaler), Kmeans clustering(KMeans) and for identifying the LA closest to the centre of each cluster(pairwise_distances_argmin_min). Matplotlib and Seaborn were used for graph based visualisations and Geopandas was used for geographic visualisations.

References

[1] D. for T. UK Government, “Reported road casualties Great Britain, annual report: 2016 — GOV.UK.” [Online]. Available: https://www.gov.uk/government/statistics/reported-road-casualties-great-britainannual-report-2016. [Accessed: 30-Oct-2017].

[2] D. for T. UK Government, “Road Safety Data — Datasets,” 2017. [Online]. Available: https://data.gov.uk/dataset/road-accidents-safety-data. [Accessed: 30-Oct-2017].

[3] D. for T. UK Government, “GB Road Traffic Counts — Datasets.” [Online]. Available: https://data.gov.uk/dataset/gb-road-traffic-counts. [Accessed: 05-Nov-2017].

[4] O. for N. S. UK Government, “Population Estimates for UK, England and Wales, Scotland and Northern Ireland — Office for National Statistics.” [Online]. Available: https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimat es/datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland. [Accessed: 30- Nov-2017].

[5] O. for N. S. UK Government, “Local Authority Districts (December 2016) Full Clipped Boundaries in Great Britain — Datasets.” [Online]. Available: https://data.gov.uk/dataset/local-authority-districtsdecember-2016-full-clipped-boundaries-in-great-britain2. [Accessed: 30-Nov-2017].