Understanding ARIMA Models using Valenbisi (Valencia’s bike share system)

Natalie Olivo
Towards Data Science
11 min readMar 21, 2018

--

My brilliant colleague Nicole Eickhoff and I were going to Spain together, and we wanted to meet data scientists there. Since we’d be there for Fallas, a month-long celebration of community and tradition, we collected Valenbisi data around for the week leading up to Crida Fallas. Crida Fallas is the first day of activities to kick off a month long of events. We wanted to see how Fallas affected bike share activity. Here I detail the project we presented and discussed with the Valencia Big Data Meetup on Tuesday, March 13, 2018.

As part of the celebration of Fallas, each neighborhood in Valencia creates fantastic street light displays.

In this blog I briefly describe my process collecting data using a scraper, conducting an ARIMA analysis, deliberate over results, and propose next steps to building out a model that can predict bike and dock availability at each Valenbisi station in Valencia. For more in-depth coverage on each of these, keep on the look-out for links provided throughout this blog and my Jupyter Notebooks hosted on my Github.

Collecting the data

Nicole requested Valenbisi’s API key access on 16-Feb-2018. After finding the API only provided data from two years ago, she reached out to me for ideas. Given that the site provides a real-time downloadable CSV, as well did not describe scraping as prohibited in its Terms of Use at the time, I knew we could build a scraper and give it a try. Using very little besides the python library ‘requests’ I was able to build a scraper and host it on an AWS web instance for 10 days before I left for Spain.

Here’s a link to my Github-hosted Jupyter Notebook where I walk through the backbone to scraper I hosted on AWS.

I collected the real-time data every 15 minutes 1001 times. The csv includes information on ‘Bikes Available’, ‘Station Spots Free’, ‘Total’, ‘X’ and ‘Y’.

This is a little over 10 days of data. This prohibits us from being able to observe seasons over the year, and only seasons over the week. Since we have so few data, this will be a great scaled example and will potentially be more intuitively interpretable. We will be able to focus on weekly and hourly trends.

Our data does not include information on which activity is rebalancing effort.

Our data does not include specific ride data, making this data unfit for network analysis.

X and Y were UTF coordinates and could be converted to longitude and latitude coordinates using the UTM library. The country code for Spain is ‘30’ and ‘S’, shown in the example below.

[Input]:
import
utm
utm.to_latlon(725755.944, 4372972.613, 30, 'S')
[Output]:
(39.47674728414135, -0.37534073558984327)

And using Bokeh, I was able to plot each Valenbisi station (blue) and each location of a major event programmed for the first day (red), kicking off the month long of activities: Crida Fallas. February 25, 2018.

#Plaza Ayuntamiento @39.4697661,-0.4113992
#Church of San Juan del Hospital @39.4743981,-0.3814647
#Torres de los Serranos @39.4791963,-0.378187

Exploratory Data Analysis

First off, we anticipated a ‘spike’ or some irregular behavior to the whole system around Crida Fallas. The following plot of All Available bikes at all stations shows that there are other shocks occuring, not necessarily related to Fallas. The vertical lines indicate Crida Fallas, February 25, 2018.

We presented to Valencia’s Big Data Meetup. Suggestions for such frequent and unrelated-to-commute-times plummets were system power outages. For this analysis, I retained all datapoints.

Next, I plot three individual Valenbisi stations. Perhaps Proximity to Crida Fallas activities is relevant to how a station is affected. First we’ll see what a station looks like that is situated close to Crida event locations, or close to the red dots on the map.

Wow, we see a big spike for Crida Fallas, and another later in the week! This station is, as we expected, very close to the city center, the Ayuntamiento. Over all, apparent return to normal behavior after Crida Fallas seems to happen pretty quickly.

Visual Observations of a station situated closer to the city center, the Ayuntamiento.

  1. A big spike of Available Bikes during Crida Fallas
  2. Another spike later in the week
  3. Over all, a relatively quick return to ‘normal’ behavior after Crida Fallas

Now, the following stations are nearer to the city boundaries of Valencia.

I see very little recognizable pattern here. Whatever occurred on February 25th looks as if it could have occured any day for this station.

Visual Observations of a sation situated towards the South West boundary of the city:

  1. Very little recognizable pattern here.
  2. Whatever behavior this station exhibits on February 25th looks as if it may have occured any other day.

Visual Observations of a station situated near the edge of the city.

  1. Without looking up what was near this station (a hospital maybe?), It’s clear that the bikes here are used for commuting.
  2. We see that Crida not only affects this station on the day of, but the day before and for two days after the day before behavior returns to ‘normal’.

We see that whatever solution we come up with will need to work for stations with a number of different kinds of behavior. For our purposes, let’s conduct an ARIMA analysis on all the data. I choose to begin the analysis by looking at all the data first because it doesn’t exhibit volatile behavior around Crida Fallas and will be a great learning example for ARIMA modeling. ARIMA models rely on the assumption that the state the bike station was in most recently is influential on the future number of available bikes/docks. Additionally, ARIMA models can capture seasonal relationships in our time series data.

ARIMA analysis

  1. Decompose time series into Trend, Seasonality, and Residuals
  2. Test for Stationarity (Dickey-Fuller Test)
  3. Create ARIMA model
  4. Results interpretation

1. Decomposition

Visual observations:

  1. Over the course of 10 days, we see a trend in a slightly upward direction. This upward trend only spans 100 available bikes.
  2. In the seasonal decomposition, we observe repetition. It is futile to attempt to interpret each crevice, but the repetitive pattern in this feature indicates a daily pattern of behavior. (We have 10 days and can count 10 repetitions.)
  3. Our residuals chart shows as expected: a major plunge for only 5 of theh 6 plunges which were suggested by Valencia Meetup Goers as ‘possible outages’.

2. Is our data stationary? — the Augmented Dickey-Fuller Test

First, I test that our data is stationary. It is important your time series is stationary before forecasting, otherwise your model will be a poor predictor since ARIMA relies on the assumption of stationarity.

An example of non-stationary data, stationized. Provided by https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc442.htm

This can be its own blog post and is incredibly interesting. For a more quantitative explanation, I recommend this source.

For the purposes of our conversation, It is important to know that when performing the Dickey-Fuller test that we are performing a Hypothesis test:

H0: The data is non-stationary

H1: The data is stationary

I used the following code to perform the Dickey-Fuller Test on the data. It was provided as a learning exercise from my Data Science Bootcamp:

Output:

Our p-value is definitely below .05, meaning we can reject the null hypothesis; our data is stationary!

We used statsmodels.tsa.stattools.adfuller with the following inputs:

  • maxlag : None (because we specify autolag = ‘AIC’, this will not be used in our case.)
  • regression : ‘c’ : constant only (default)
  • autolag : ‘AIC’ : if ‘AIC’ (default) or ‘BIC’, then the number of lags is chosen to minimize the corresponding information criterion. This will choose the model with the best AIC score.

More information on AIC and BIC:

AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth.

BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model.

… Despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily. The only way they should disagree is when AIC chooses a larger model than BIC

Our p-value for the Dickey-Fuller test is definitely below .05, meaning we can reject the null hypothesis. Our data is Stationary and can therefore be plugged into an ARIMA model.

3. ARIMA modelling

Now begins the point where in this blog, you find deviations from the original dialogue with our Valencia’s Big Data Meet-Up presentation. My hope is that this blog can drive further discussion about methodology and some of the uncertainties I ran into with ARIMA models and my data. I am open to conversation and dialogue how to improve my understanding and my model.

I found ARIMA modelling to be rather uncertain for the following reasons:

  • The empirical way to discover seasonality metrics (ACF/PACF plots) to use in our statsmodels SARIMAX model didn’t exactly produce an effective model. (See github for more details.)
  • The model that appeared to predict the best (of the number of models I ran) was one that exhibited few statistically significant lag variables.

Here is the code of the model I saw was the best for forecasting. You can find (almost) all my other models at my Github.

mod = sm.tsa.statespace.SARIMAX(df[0:960].available, 
trend='n',
order=(1,0,1),
seasonal_order=(4,0,7,24),
enforce_stationarity =False,
enforce_invertibility = False)
results = mod.fit()
results.summary()
Wow, why does a model with nearly no significant lags produce the best predictions? Perhaps it’s not so bad. Discussion below.
With no other independent variables besides its own behavior, our ARIMA model can predict pretty well, I’d say.

As a reminder: we only have 10 days of data, with 4 points of data per hour. This prohibits us from testing for annual seasonal relationships. For our purposes, I’m going to say that a season lasts 6 hours. That means there are 4 seasons per day. This should capture rush hours, mid-days, and evenings. This, most intuitively, makes the number of datapoints/season variable m = 4*6 = 24.

Motivation behind choosing my (pdq)(PDQ)m values:

Non-seasonally, the most influential lag on the current state of the model is what occurred during the lag directly before it. p =1, q = 1, (and d = 0, because our data is stationary).

P, the seasonal AR order. I suspect that there is a seasonality between days, so I want to see what occurs one day ago. (4 lags of 24 data points)

D, seasonal differencing is 0 because our data is stationary.

Q, seasonal Moving Average order. Now I’ll admit, at first glance this seems the least intuitive. I found it by trial and error, but it seems to be the best for predictability if we account for 7 lags of 24 points, meaning 42 hours prior; not quite 2 full days. If you think of the week in terms of 168 hours, these numbers divide cleanly: 168/42 = 4. This is indication of 4 behaviorally distinct times of a week. We want our model to to use 1/4th of a week as its moving average window accordingly.

Please feel free to refer to my github to see (almost) all the models I tried, and leave any suggestions for model parameter improvements in the comments.

4. Results Interpretation:

First, I’d like to acknowledge that while there were very few statistically significant lags, removing a variable that was deemed statistically insignificant by its p>|z| value adversely affected the model’s prediction power. ARIMA(1,0,0)(4,0,7,24), Green = forecast, Blue = Actual.

An example of what happens when you remove the ma.L1 with p>|z| value .993. ARIMA(1,0,0)(4,0,7,24).

The ARIMA(1,0,1)(4,0,7,24) model did have the overall lowest AIC and BIC scores*. These indicators are not a score for the model quality, but provide a means of comparison with other models; the lower the better.

Secondly, I’d like to take a dive into some of the other results provided by the Statespace Model Results:

The Ljung-Box (Q) test is done to confirm that the residuals are independent. This means that our errors are not correlated with each other and we’ve successfully accounted for patterns in the data with our model.

The hypothesis test we perform for Ljung-Box is as follows:

H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process).

H1: The data are not independently distributed; they exhibit serial correlation.

The test statistic given by Ljung-Box(Q) follows a X² distribution with (h-p-q) degrees of freedom with h being the number of lags tested.

If the p-value (Prob(Q)) is greater than 0.05 then we fail to reject the null hypothesis; the residuals are independent which indicates that the model provides an adequate fit to the data. We see our Prob(Q) is .62. Great. So despite not having many statistically significant lag variables, our model has managed to explain enough of the variance that the residuals are random.

Heteroskedasticity: An assumption of Linear Regression is that there is no heteroskedasticity of residuals. This means that the variance of residuals should not increase with fitted values of response variable; we prefer our residuals to be homoskedastic.

H0: The variance of the residuals is constant; no Heteroskedasticity.

H1: Heteroskedasticity is present.

Here we fail to reject the null hypothesis at the 95% confidence level, since our p-value is .07. We can assume homoskedasticity in our residuals. Okay, not bad. Another test that our model manages to pass.

The Jarque-Bera and Kurtosis metrics are about normality. Our data is not normally distributed, so I will skip these measures for now.

Possible Next Steps

  1. Scale up, which means collect data for longer and apply models to individual stations. Consider a recurrent neural network to model time series sequencing.
  2. Consider a Poisson Point Process model to predict bikes available and docks open with temporal precedence established using .shift() like this DSSG project did.
  3. Request data on rebalancing efforts.
  4. Consider feature engineering additional variables for major city events and distance from specific event activities.
  5. Consider shock modelling for affected bike stations.

#datavacation

This was one of two presentations Nicole and I put together for our trip to Spain. It was shared with the Valencia Data Science Meetup.

Our other work was a CRM data seminar presented to an undergraduate Marketing Class at the University of Valencia’s Business School. Keep a lookout for a blog on this coming from Nicole soon!

Data Source:

http://www.valencia.es/ayuntamiento/datosabiertos.nsf/resultadoCapas/95D6756F6861F9FDC1257C70003E4FBD?OpenDocument&lang=1&nivel=2&seccion=1&bdorigen=&idapoyo=22ADF97C1FD223B5C1257C55003BD01F

Github Repository:

Additional Resources:
https://www.datacamp.com/courses/introduction-to-time-series-analysis-in-python

--

--