Analysis of car accidents in Barcelona using Pandas, Matplotlib, and Folium

Published in

Towards Data Science

10 min readSep 1, 2019

Open Data Barcelona is Barcelona´s data service which contains around 400 datasets, covering a wide rage of topics such as population, business, or housing. This project was born in 2010 with the main objective of maximize available public resources, allowing companies, citizens, researcher, and other public institutions to make use of the data generated.

In this article, we employ the dataset that contains the accidents managed by the local police in the city of Barcelona in 2017. This dataset includes information such as the number of injuries by severity, the number of vehicles involved, the date, and the geographic location of the accident.

Accidents managed by the local police in the city of Barcelona - Open Data Barcelona

List of accidents handled by the local police in the city of Barcelona. Incorporates the number of injuries by…

opendata-ajuntament.barcelona.cat

Since the dataset is in Catalan, we are going to use an English version available in Kaggle.

Barcelona data sets

Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…

www.kaggle.com

Exploratory data analysis and data cleaning

Exploratory data analysis consists of analyzing the main characteristics of a data set usually by means of visualization methods and summary statistics. The objective is to understand the data, discover patterns and anomalies, and check assumption before we perform further evaluations.

After downloading the csv file from Kaggle, we can load it into a Pandas dataframe using the pandas.read_csv function and visualize the first 5 rows using the pandas.DataFrame.head method.

As we can observe, the dataframe contains 15 columns: Id, District Name, Neighborhood, Street, Weekday, Month, Day, Hour, Part of the day, Mild injuries, Serious injuries, Victims, Vehicles involved, Longitude, Latitude.

We can easily print a list with column names using the pandas.DataFrame.columns method. Additionally, the pandas.DataFrame.info method provides information about a DataFrame including column types, non-null values and memory usage.

Apparently, there are not null values, since all the columns have 10339 entries. However, some entries contain the string Unknown. We can replace this string by not a number and assess again the number of null values by using the pandas.DataFrame.info method.

As we can see, only the columns District Name and Neighborhood Name contain null values. Since we are not going to use these columns in our further analysis, we do not need to take into consideration the null values. We won’t analyze where the accident happened by using the District Name and Neighbourhood Name, but by using the Longitude and Latitud.

Before we start to draw conclusions using our data, we are going to clean it. The first cleaning step consists of dropping unnecessary columns to simplify the dataframe.

After dropping the columns, we modify some data types. We can consult the data types by using pandas.DataFrame.info method or pandas.DataFrame.dtypes attribute. The latter returns a Series with the data type of each column.

As we can observe, Month, Day, and Hour are not datetime objects. We can easily combine those columns into a single column using the pandas.to_datetime function. Before using this function, we modify column names, replacing spaces by underscores and upper case letters by lower case letters. Additionally, we include a column with the year.

Now, we can combine weekday, month, day, hour, and year into one single column called date. To avoid a ValueError, we have to convert month names into integers before using the pandas.to_datetime function.

After the conversion, we can obtain a datetime column using the aforementioned function as follows:

We can easily check the data type of the date column by using the pandas.DataFrame.dtypes attribute.

To access individual elements of the date such as month, day, or hour, we can use the pandas.Series.dt attribute. We can even access the day of the week by using the pandas.Series.dt.dayofweek attribute, where Monday=0 and Sunday=6. 💪

Since we can access all the information we need related to the date of the accident using pandas.Series.dt attribute, we can drop columns month, year, hour, day, and day of the week, as they are not longer needed.

Finally, we can drop the street column as well, since we are going to visualize where the accidents happened using only the longitude and latitude.

As shown above, the dataframe obtained contains 8 columns: id, mild_injuries, serious_injuries, victims, vehicles_involved, longitud, latitude, and date.

To easily access information about a car accident, we are going to set id as the index of the data frame, removing before the trailing spaces present in the id entries.

The final cleaning step consists of evaluating whether there are duplicated entries in the data frame. If so, we will remove these duplicated entries from the data frame, as they represent the same car accident.

Data cleaning finished!! 👏 Now! We are ready to answer questions and draw conclusions using our data. 👌 🍀

Answering questions and drawing conclusions

Exploratory data analysis and data cleaning are the steps that allow us to get a feeling about the dataset and to get the dataset ready to easily draw conclusions using it. Now! We are ready to answer the following questions using the dataset.

Time analysis

How many accidents were registered by the police in Barcelona in 2017?

We can easily obtain the total number of accidents registered in Barcelona by using the pandas.DataFrame.shape attribute, since each entry of the data frame represent a different car accident.

In 2017, 10330 accidents were registered by the police in Barcelona.

Distribution of car accidents per month

To analyze the distribution of car accidents per month, we employ the pandas.DataFrame.groupby function. A groupby operation involves a combination of splitting the object, applying a function, and combining the results. First, we group by month, and then we calculate the number of accidents in each month. We can easily interpret the result by using a bar plot as follows:

As we can observe, the number of accidents decreases in August and December. One reason could be that fewer people are driving to work in these months.

Distribution of car accidents per day of the week

As we did with months, we can analyze the distribution of car accidents according to the day of the week by using a bar plot as well.

As shown in the plot above, the number of car accidents decrease at the weekend. Weekdays present an average of 1656 car accidents per day, around 600 more accidents than on weekends (on average 1025 car accidents per day).

The next plot depicts the number of accidents during each day of the year. As we can observe, there are between 10–50 accidents per day and the number of accidents on friday are as a rule much higher than the number of accidents on sunday.

Distribution of car accidents per hour

Following the same procedure as before, we plot the distribution of car accidents this time according to time.

As we can observe in the plot, the greater number of accidents occur in early-morning hours 8–9 and between 12 and 20.

Distribution of car accidents per day of the week and hour

We can also analyze the number of accidents per day of the week and hour using a side-by-side bar plot. In this particular case, we use a horizontal plot for better visualization.

As we can easily observe, there are more accidents at night on weekends than during weekdays. On the contrary, there are much more accidents from early-morning (8) til afternoon (19) during weekdays than at the weekend.

Time analysis — conclusions

August presents the lowest number of car accidents 651 in 2017. The rest of the months present a number of accidents around 800–900.
The number of car accidents decrease on weekends.
The greater number of car accidents occurs from (8–9) and (12–20).
At night most of the accidents happen on weekends.

We can always group by different temporal variables and create more complex plots in order to extract more complicated patterns and conclusions in regard to time dependency.

Type of accident analysis

The data we are analyzing contains information related to (1) the date of the accident, (2) the type of accident, (3) the location of the accident. Regarding the type of accident the data frame includes information such as the number of victims, the number vehicles involved in the accident, and the type of injuries (mild or serious). As before, we can examine the distribution of all those variables using bar plots.

Vehicles involved

The previous plot depicts the number of accidents in 2017 according to the number of vehicles involved. In most accidents, two vehicles were involved (7028 accidents in 2017). Furthermore, the police recorded car accidents where up to 14 vehicles were involved; however, car accidents with many vehicles are not common.

Mild — Serious injuries

The data frame includes information about how many victims suffered mild and serious injuries in each car accident. We can easily represent the percentage of mild and serious injuries using a pie plot as follows:

The plot shows that only 2% of the injuries are serious injuries. Although most of the injuries in car accidents were mild, it would be interesting to analyze under which circumstances (time, date, location) serious injuries are more frequent.

The following plot shows the percentage of injuries according to the day of the week. Mild injuries follow an expected pattern, since they present higher rates during weekdays, when more accidents happen. However, serious injuries present high rates on weekends, although the average number of accidents on weekends (1656) is lower than during weekdays (1025). This indicates that accidents on weekends tend to be more severe than during weekdays.

We can also plot the percentage of injuries according to the hour.

As we can observe, accidents tend to be more severe in late-evening and night.

Type of accident analysis — conclusions

(1) In most accidents, 1,2, or 3 vehicles were involved. The police of Barcelona registered in 2017 car accidents were up to 14 vehicles where involved.

(2) Most of the people injured in car accidents in 2017 suffered mild injuries (98%).

(3) Accidents tend to be more severe during night, late-evening, and weekends.

Location analysis

The best way to analyze spacial data is by using maps. Folium is a python library that helps you create several types of Leaflet maps. We can easily generate a map of Barcelona, creating a Folium Map object. The location argument allows to center the map in a specific location (in our case Barcelona). We can also provide an initial zoom level into that location to zoom the map into the center.

Despite the initial zoom level, the map generated is interactive, meaning you can easily zoom in and out.

The dataset includes the latitude and longitude of each car accident. We can easily visualize them by using circle markers. The following map shows the accidents where serious injuries were caused, displaying the number of serious injuries with a popup label.

In Folium, we can also group markers into different clusters using the MarkerCluster object. The following plot depicts car accidents with seriously injured victims as before, but this time the accidents are group into clusters.

One striking feature of Folium is the possibility of creating animated heat maps, changing the data being shown based on a certain dimension (e.g. hours). We can easily achieve that by using the HeatMapWithTime() class method. First, we create a nested list where each position contains the latitude and longitude of all car accidents in that specific hour. For instance, hour_list[0] contains the car accidents that happen from 00:00:00 to 00:59:00 (e.g. hour_list[0] → [[lat1,lon1],[lat2,lon2],[lat3,lon3],…,[latn,logn]). Then, we call the method and add it to our map.

Looking at the above timeline, we can observe how the number of accidents increases from 8 hours, remaining high until 21 hours when starts to decrease.

Thanks for reading!!! 😊 🍀