A basic guide to time series analysis

An analysis of daily accidents in the UK from 2014 to 2017 using Time Series

Fabrice Mesidor
Towards Data Science

--

Introduction

Linear regression is a very common model used by Data Scientist. An outcome or target variable is explained by a set of features. There is a case where the same variable is collected over time and we used a sequence of measurements of that variable made at regular time intervals. Welcome to Time Series. One difference from standard linear regression is that the data are not necessarily independent and not necessarily identically distributed. Working with time series can be frustrating as it implies that you have to find a correlation between the lag or errors of any previous prediction of the value and itself. Also, the ordering matters and changing the order will change the meaning of the data. Due to its complexity, Data Scientist got lost sometimes in the process of times series analysis. In this blog, I am going to share a full time series analysis guided by one of the well known Data Science methods: OSEMIN.

Context and Data used

The visual above shows the methodology used in my study from gathering the data to drawing conclusions.

The data used for this analysis contained the date and amount of 1461 daily accidents in the UK from January 1st, 2014 to December 31, 2017. I used a dataset from Kaggle for this exercise. I downloaded a CSV file and used a popular python code ‘pd.read_csv’ to store it into a Data Frame. No other independent variables were considered in this analysis as I am focused on the time series.

The main purpose of this study is to explain the different steps of a full data science project. Other objectives are to find out if the number of accidents in a day is dependent on the number of accidents in any given day.

The 3 questions that the study is seeking to answer are:

  1. What is the relation between the amount of accident on a current day and the day prior?
  2. Is there any pattern that can help predict (or prevent) the amount of accident in the UK on any given day?
  3. Is the month of the year or day of week related to the number of accidents during that month?

Treating the data

The data was relatively clean and ready to use. However, I had to do some transformation for analysis purpose.

Firstly, I changed the column containing the dates to be the index of my Data Frame. While working with time series data in Python, it’s important to always ensure that dates are used as index values and are understood by Python as a true “date” object.

Secondly, it is important to check for any missing data as it can change considerably the set of data. No missing values were found in the data.

Lastly, to perform relevant EDA I add three (3) more columns to my data: one for month, day and weekday name respectively. This should help me understand the trends based on the month of the year, the day of the week. Also, I can group the data based on these new features to understanding better my data.

Below how my data looks like after treatment and addition of the 3 columns.

A sample of my data generated using: data_cleaned_df.sample(5, random_state=0)

Exploring my data

One of the most vital steps in a data science project is the EDA. Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. The EDA gives me a good understanding of my data.

Quick presentation of my data:

The daily number of accidents is decreasing from 400/day in 2014 to 356 in 2017. In 2017, daily accident reached a low of 322. The standard deviation compared to the daily average is circa 17 accidents per day. (less than 5% of the average).

Descriptive statistics of the series provided by “df.describe”
All the observations represented in a line graph

Plotting the data in a line informs of any existing trends while the scatter plot is a good way to spot any outliers. We can observe that any day with more than 475 accidents or less than 200 accidents is quite abnormal. However, for my analysis, I didn’t remove any those numbers. Also, a box plot is another great graphics to see when the data has outliers within intervals.

A few additional interesting remarks: the number of accidents always dropped in December as seen in the graph below. When starting the EDA, I was expecting the number to go up due to year-end parties. However, it might be due to people traveling or staying at home to spend more time with families and friends. The decrease in the number on Sunday is in line with the second hypothesis.

Daily accidents per year.
Daily accident per day of the week — don’t focus on the order of the days ;)

Finally, building a histogram with the dataset gives me an idea of the distribution of the data. Clearly, the numbers of accidents approximate a normal distribution slightly skewed.

Model

Before any modeling, I need to check if the time series is stationary. A time series is said to be stationary if its statistical properties such as mean, variance remain constant over time. As most time series models work on the assumption that the time series are stationary, it is important to validate that hypothesis. For general time series datasets, if it shows a particular behavior over time, there is a very high probability that it will follow a similar behavior in the future. Hence, it will become difficult to find a correct model or to do any prediction. I am going to plot the data with the rolling mean and rolling standard deviation for any case of trends. Also, I will perform the Dickey Fuller test.

The Dickey Fuller Test is a statistical test for testing stationarity. The Null-hypothesis for the test is that the time series is not stationary. So if the test statistic is less than the critical value, we reject the null hypothesis and say that the series is stationary.

After performing the Dickey Fuller test, at a confidence level of 95%, we reject the null hypothesis. The series is stationary as seen in the graph below.

The data looks stationary. Also the Dickey Fuller test returns a p-value <0.05

After confirmation of stationarity of the series, I can continue with the model. I ensure that I split my data into a training and testing set. The testing set is not used in the modeling process and will be used to evaluate the performance of the selected model on unseen data.

To select the relevant time series model, I built the ACF and PACF to determine respectively the value of q and p for the ARIMA.

After different iterations, I picked an ARMA (2,3) to represent the data. For each step, I had to move to another model every time I got coefficients that were not significative.

Based on this result, the number of accident would be explained by the number during two days prior and a moving average part (lag 3)

Although the residuals of this model approximate a normal distribution, the ARMA found failed in predicting unseen data. Remember I split my data and kept a testing set. By passing the testing to the model, the predictions were awful. Reducing the number of observations in the testing data didn’t improve the quality of the forecast. I concluded that an ARMA wasn’t best to represent the data. Logic, right? Other factors can impact the number of accident in a day like the type of vehicles, the area, age of the driver and so on.

Conclusion and considerations

Even though we didn’t get to a “perfect model” — not the ideal anyway for a data scientist — some of the questions were answered through EDA. We saw the relation between the days of the weeks and the number of accident and how the month of the year impact the numbers. That’s why this step cannot be ignored in a data science project.

While modeling our data, we find the correlation between the value of the accidents on a day and the numbers of the day prior. Definitely, our model would have performed better if you integrated some exogenous variables. Those should be found using domain knowledge and thorough literature review. An ARIMAX or a SARIMAX would be more appropriate for our event. It can be frustrating to keep going back one or two steps up the process but I do believe it is one of the beauties of data science as each step is bringing more answers and more narrative from the numbers.

Any thoughts, please leave me a comment and we can discuss.

My codes, models and the data used are available on my GitHub. Please use the link below:

--

--