Time Series Modeling using Scikit, Pandas, and Numpy

Intuitive use of seasonality to improve model accuracy

Dr. Varshita Sher
Towards Data Science

--

Image by Author

Welcome to Part 2 of Time Series Analysis! In this post, we will be working our way through modeling time series data. This is a continuation of my previous post on Time Series Data.

In our previous blog post, we talked about what time series data is, how to format such data to maximize its utility, and how to handle missing data. We also learned how to resample time series data by the month, week, year, etc, and calculate rolling means. We dived into concepts such as trends, seasonality, first-order differencing, and autocorrelation. If you are familiar with most of the stuff, you are good to go. In case you need a refresher, you can do a quick google search for these topics or read my previous post here.

Few words before we begin:

There are other, undoubtedly better, packages available for time series forecastings, such as ARIMA or Facebook’s proprietory software Prophet. However, this article was inspired by a friend’s take-home assignment that required her to use only Scikit, Numpy, and Pandas (or face instant disqualification!).

Let’s dive into our dataset

We will be working with the publicly available dataset Open Power System Data. You can download the data here. It contains electricity consumption, wind power production, and solar power production for 2006–2017.

url='https://raw.githubusercontent.com/jenfly/opsd/master/opsd_germany_daily.csv'
data = pd.read_csv(url,sep=",")

After setting the Date column as the index, this is how our dataset looks like:

# to explicitly convert the date column to type DATETIME
data['Date'] = pd.to_datetime(data['Date'])
data = data.set_index('Date')

Defining the Modeling task

Goals of Prediction

Our aim is to predict Consumption (ideally for future unseen dates) from this time series dataset.

Training and Test set

We will be using 10 years of data for training i.e. 2006–2016 and last year’s data for testing i.e. 2017.

Performance Measure

In order to evaluate how good our model is, we would be using R-squared and Root Mean Squared Error (but will be printing all relevant metrics for you to take the final call).

Helper Functions

In order to print all performance metrics relevant to a regression task (such as MAE and R-square), we will be defining the regression_results function.

import sklearn.metrics as metrics
def regression_results(y_true, y_pred):
# Regression metrics
explained_variance=metrics.explained_variance_score(y_true, y_pred)
mean_absolute_error=metrics.mean_absolute_error(y_true, y_pred)
mse=metrics.mean_squared_error(y_true, y_pred)
mean_squared_log_error=metrics.mean_squared_log_error(y_true, y_pred)
median_absolute_error=metrics.median_absolute_error(y_true, y_pred)
r2=metrics.r2_score(y_true, y_pred)
print('explained_variance: ', round(explained_variance,4))
print('mean_squared_log_error: ', round(mean_squared_log_error,4))
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))

Feature Engineering

As a baseline, we choose a simplistic model, one that predicts today’s consumption value based on

  • yesterday’s consumption value and;
  • difference between yesterday and the day before yesterday’s consumption value.
# creating new dataframe from consumption column
data_consumption = data[['Consumption']]
# inserting new column with yesterday's consumption values
data_consumption.loc[:,'Yesterday'] =
data_consumption.loc[:,'Consumption'].shift()
# inserting another column with difference between yesterday and day before yesterday's consumption values.
data_consumption.loc[:,'Yesterday_Diff'] = data_consumption.loc[:,'Yesterday'].diff()
# dropping NAs
data_consumption = data_consumption.dropna()

Defining training and test sets

X_train = data_consumption[:'2016'].drop(['Consumption'], axis = 1)
y_train = data_consumption.loc[:'2016', 'Consumption']
X_test = data_consumption['2017'].drop(['Consumption'], axis = 1)
y_test = data_consumption.loc['2017', 'Consumption']

Cross-Validation on Time-Series data

A question that often comes up during data science interviews is: Which cross-validation technique would you use on time-series data?

You may be tempted to gravitate towards the all-time-favorite K-Fold Cross-Validation (believe me, up until recently — don’t ask how recently! — I did not know there exist CV techniques other than K-fold). Unfortunately, that would not be the correct answer. The reason being, it does not take into account that time-series data has some natural ordering to it and the randomization in standard k-fold cross-validation does not preserve that ordering.

A better alternative for cross validation on time series data (than K-fold CV) is Forward Chaining strategy.

In forward chaining, say with 3 folds, the train and validation sets look like:

  • fold 1: training [1], validation [2]
  • fold 2: training [1 2], validation [3]
  • fold 3: training [1 2 3], validation [4]

where 1, 2, 3, 4 represent the year. This way successive training sets are supersets of those that come before them.

Luckily for us, sklearn has a provision for implementing such train test split using TimeSeriesSplit.

from sklearn.model_selection import TimeSeriesSplit

The TimeSerieSplit function takes as input the number of splits. Since our training data has 11 unique years (2006 -2016), we would be setting n_splits = 10. This way we have neat training and validation sets:

  • fold 1: training [2006], validation [2007]
  • fold 2: training [2006 2007], validation [2008]
  • fold 3: training [2006 2007 2008], validation [2009]
  • fold 4: training [2006 2007 2008 2009], validation [2010]
  • fold 5: training [2006 2007 2008 2009 2010], validation [2011]
  • fold 6: training [2006 2007 2008 2009 2010 2011], validation [2012]
  • fold 7: training [2006 2007 2008 2009 2010 2011 2012], validation [2013]
  • fold 8: training [2006 2007 2008 2009 2010 2011 2012 2013], validation [2014]
  • fold 9: training [2006 2007 2008 2009 2010 2011 2012 2013 2014], validation [2015]
  • fold 10: training [2006 2007 2008 2009 2010 2011 2012 2013 2014 2015], validation [2016]

Spot check Algorithms

# Spot Check Algorithmsmodels = []
models.append(('LR', LinearRegression()))
models.append(('NN', MLPRegressor(solver = 'lbfgs'))) #neural network
models.append(('KNN', KNeighborsRegressor()))
models.append(('RF', RandomForestRegressor(n_estimators = 10))) # Ensemble method - collection of many decision trees
models.append(('SVR', SVR(gamma='auto'))) # kernel = linear
# Evaluate each model in turn
results = []
names = []
for name, model in models:
# TimeSeries Cross validation
tscv = TimeSeriesSplit(n_splits=10)

cv_results = cross_val_score(model, X_train, y_train, cv=tscv, scoring='r2')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

# Compare Algorithms
plt.boxplot(results, labels=names)
plt.title('Algorithm Comparison')
plt.show()

Both KNN and RF perform equally well. But I personally prefer RF since this ensemble model (combine multiple ‘individual’ (diverse) models together and deliver superior prediction power.) can almost work out of the box and that is one reason why they are very popular.

Grid Searching Hyperparameters

I discussed the need for grid searching hyperparameters in one of my previous articles.

An optimal combination of hyperparameters maximizes a model’s performance without leading to a high variance problem (overfitting).

The Python code for performing grid-search is as follows:

from sklearn.model_selection import GridSearchCVmodel = RandomForestRegressor()
param_search = {
'n_estimators': [20, 50, 100],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [i for i in range(5,15)]
}
tscv = TimeSeriesSplit(n_splits=10)
gsearch = GridSearchCV(estimator=model, cv=tscv, param_grid=param_search, scoring = rmse_score)
gsearch.fit(X_train, y_train)
best_score = gsearch.best_score_
best_model = gsearch.best_estimator_

If you notice the code above, we have defined a custom scorer by setting scoring = rmse_scoreinstead of using one of the common scoring metrics defined in sklearn. We define our custom scorer as follows:

from sklearn.metrics import make_scorerdef rmse(actual, predict):predict = np.array(predict)
actual = np.array(actual)
distance = predict - actualsquare_distance = distance ** 2mean_square_distance = square_distance.mean()score = np.sqrt(mean_square_distance)return scorermse_score = make_scorer(rmse, greater_is_better = False)

Checking best model performance on test data

y_true = y_test.values
y_pred = best_model.predict(X_test)
regression_results(y_true, y_pred)

This is not bad for starters. Let’s see if we can further improve our model.

Feature Engineering Returns

Up until now, we have been using values at (t-1)th day to predict values on t date. Now, let us also use values from (t-2)days to predict consumption:

# creating copy of original dataframe
data_consumption_2o = data_consumption.copy()
# inserting column with yesterday-1 values
data_consumption_2o['Yesterday-1'] = data_consumption_2o['Yesterday'].shift()
# inserting column with difference in yesterday-1 and yesterday-2 values.
data_consumption_2o['Yesterday-1_Diff'] = data_consumption_2o['Yesterday-1'].diff()
# dropping NAs
data_consumption_2o = data_consumption_2o.dropna()

Resetting the train and test set

X_train_2o = data_consumption_2o[:'2016'].drop(['Consumption'], axis = 1)
y_train_2o = data_consumption_2o.loc[:'2016', 'Consumption']
X_test = data_consumption_2o['2017'].drop(['Consumption'], axis = 1)
y_test = data_consumption_2o.loc['2017', 'Consumption']

Checking to see if the ‘best’ random forest using ‘new’ predictors performs better

model = RandomForestRegressor()
param_search = {
'n_estimators': [20, 50, 100],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [i for i in range(5,15)]
}
tscv = TimeSeriesSplit(n_splits=10)
gsearch = GridSearchCV(estimator=model, cv=tscv, param_grid=param_search, scoring = rmse_score)
gsearch.fit(X_train_2o, y_train_2o)
best_score = gsearch.best_score_
best_model = gsearch.best_estimator_
y_true = y_test.values
y_pred = best_model.predict(X_test)
regression_results(y_true, y_pred)

Great news!! We have significantly brought down the RMSE and MAE values, whereas the R-squared value has also gone up.

Feature Engineering strikes back

Let us see if adding the value of solar production is beneficial in some way to predicting electricity consumption.

data_consumption_2o_solar = data_consumption_2o.join(data[['Solar']])data_consumption_2o_solar = data_consumption_2o_solar.dropna()

Resetting Train/Test + GridSearch + Checking performance

X_train_2o_solar = data_consumption_2o_solar[:'2016'].drop(['Consumption'], axis = 1)
y_train_2o_solar = data_consumption_2o_solar.loc[:'2016', 'Consumption']
X_test = data_consumption_2o_solar['2017'].drop(['Consumption'], axis = 1)
y_test = data_consumption_2o_solar.loc['2017', 'Consumption']
model = RandomForestRegressor()
param_search = {
'n_estimators': [20, 50, 100],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [i for i in range(5,15)]
}
tscv = TimeSeriesSplit(n_splits=5)
gsearch = GridSearchCV(estimator=model, cv=tscv, param_grid=param_search, scoring = rmse_score)
gsearch.fit(X_train_2o_solar, y_train_2o_solar)
best_score = gsearch.best_score_
best_model = gsearch.best_estimator_
y_true = y_test.values
y_pred = best_model.predict(X_test)
regression_results(y_true, y_pred)

Voila, the model’s performance is even better now.

Variable Importance Plot

imp = best_model.feature_importances_
features = X_train_2o_solar.columns
indices = np.argsort(imp)
plt.title('Feature Importances')
plt.barh(range(len(indices)), imp[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

As we can see, the amount of solar production is not as strong a predictor of power consumption as other time-based predictors.

Feature Engineering Endgame

If you followed the narrative in Part 1 of my previous blog post, you would remember that our dataset has some seasonal element to it, weekly seasonality to be more precise. Thus, it would make more sense to feed as input to the model, consumption value in the week prior to the given date. That means, if the model is trying to predict the consumption value on Jan 8, it must be fed information about the consumption on Jan 1.

data_consumption_2o_solar_weeklyShift = data_consumption_2o_solar.copy()data_consumption_2o_solar_weeklyShift['Last_Week'] = data_consumption_2o_solar['Consumption'].shift(7)data_consumption_2o_solar_weeklyShift = data_consumption_2o_solar_weeklyShift.dropna()

Resetting Train/Test + GridSearch + Checking performance

X_train_2o_solar_weeklyShift = data_consumption_2o_solar_weeklyShift[:'2016'].drop(['Consumption'], axis = 1)
y_train_2o_solar_weeklyShift = data_consumption_2o_solar_weeklyShift.loc[:'2016', 'Consumption']
X_test = data_consumption_2o_solar_weeklyShift['2017'].drop(['Consumption'], axis = 1)
y_test = data_consumption_2o_solar_weeklyShift.loc['2017', 'Consumption']
model = RandomForestRegressor()
param_search = {
'n_estimators': [20, 50, 100],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [i for i in range(5,15)]
}
tscv = TimeSeriesSplit(n_splits=10)
gsearch = GridSearchCV(estimator=model, cv=tscv, param_grid=param_search, scoring = rmse_score)
gsearch.fit(X_train_2o_solar_weeklyShift, y_train_2o_solar_weeklyShift)
best_score = gsearch.best_score_
best_model = gsearch.best_estimator_
y_true = y_test.values
y_pred = best_model.predict(X_test)
regression_results(y_true, y_pred)

And we did it again.. The errors have been reduced further and r-square increased. We could keep on adding more relevant features but I guess you get the idea now!

Feature Importance Plot

As we rightly hypothesized, the value on (t-7)th day is a much stronger predictor than the value on (t-1)th day.

Conclusion

In this article, we learned how to model time series data, conduct cross-validation on time series data, and fine-tune our model hyperparameters. We also successfully managed to reduce the RMSE from 85.61 to 54.57 for predicting power consumption.

In Part 3 of this series, we will be working on a case study analyzing the time series data generated by call centers, essentially working towards analyzing the (dreaded) increment in abandonment rates. Stay tuned…

Until next time :)

--

--

Senior Data Scientist | Explain like I am 5 | Oxford & SFU Alumni | https://podurama.com | Top writer on Medium