Preprocessing Time Series Data for Supervised Machine Learning

Applying Standard ML algorithms to Time-Series forecasting

Sijuade Oguntayo
Towards Data Science

--

Photo By Aron Visuals on Unsplash

We shall be exploring some techniques to transform Time Series data into a structure that can be used with the standard suite of supervised ML models.

Time Series vs Cross-Sectional Data

Time series is a sequence of evenly spaced and ordered data collected at regular intervals. One consequence of this is that there is a potential for correlation between the response variables.

An example of time-series is the daily closing price of a stock. In this example, the observations are of a single phenomenon (stock prices) over a period of time. The time unit of observation of a time series could be daily, weekly, monthly, or yearly.

This is quite different from cross-sectional data which represents phenomena observed at a single point in time. For example, in healthcare, data could be collected on the height, weight, and other health information including a blood pressure measurement on a sample of the population. The order of the response variables does not necessarily matter and they are assumed to be independent of one another.

Supervised Machine Learning

Supervised learning is an approach to machine learning where the machine learns from labeled data. By feeding the learner with examples together with the true labels for those examples, the machine learns a mapping from input to output.

After this learning/training stage, samples not seen before by the learner are fed to the model and a prediction is made based on the mapping learned.

One classical example of supervised learning is predicting house prices. The type of input that may be fed to the learner includes details such as the square footage of the house, number of rooms, whether or not a garden is present, and the neighborhood the house is located in.

This approach differs from Unsupervised learning where the machine learns to find patterns in the data without guidance or labels. An example of this — Separating Netflix subscribers into clusters based on watch history and ratings data to better recommend similar content.

Unlike cross-sectional data, traditional ML algorithms are not typically applied to time series. Here are some reasons why —

Non-Stationarity

Non-stationarity is when the statistical properties of a series, e.g the mean, variance, and covariance (or the process generating the series) changes over time. Non-stationary series are typically difficult to model and forecast and are therefore required to be made stationary to obtain meaningful results as many statistical tools and processes require stationarity. A proven method of stationarizing a non-stationary series is through the use of differencing.

Creative Commons License — https://commons.wikimedia.org/wiki/File:Stationarycomparison.png

Multicollinearity

Multicollinearity is when one or more independent variables in a regression model are linearly correlated. This doesn’t typically affect the efficacy of the model, it may, however, affect the statistical significance of the independent variable(s). This violates a core assumption of linear regression and could lead to incorrect interpretation of the coefficients. Multicollinearity, however, is only really a problem for parametric-based models like Linear Regression as opposed to non-parametric-based learners like for example, Decision Trees and it’s enough to be aware of this.

Some advantages to using standard ML algorithms for Time Series include —

  • Many ML practitioners are already familiar and experienced with standard algorithms like Decision Trees, Ensemble Tree-based models, Artificial Neural Network Regression, etc
  • Hyper-parameter optimization.
  • Using an example we will be looking at later, say we wished to forecast the energy output of a solar farm at a specific time, we could engineer new features like specifying the seasons, we could also include additional features like forecasts of temperature, and humidity at that time.

ARIMA

ARIMA is one example of a traditional method of forecasting time series. ARIMA stands for Auto-Regressive Integrated Moving Average and is divided into 3 parts —

  • AR(p) — The auto-regressive part represents the number of time periods to apply lag our data for. A p term of 2 means we consider two time-steps before each observation as the explanatory variables for that observation for the autoregressive portion of the calculation. The observation itself becomes the target variable.
  • I(d) — For the Integrated portion, d represents the number of differencing transformations applied to the series to transform a non-stationary time series to a stationary one. Differencing is explained in more detail further down.
  • MA(q)— A time series is thought to be a combination of 3 components: Trend refers to the gradual increase or decrease of the time series over time, Seasonality is a repeating short-term cycle in the series, and Error refers to noise not explained by seasonality or trend. Moving average therefore refers to the number of lags of the error component to consider. .

AR & MA are quite similar, the difference is AR considers lagged time series values from previous time periods, while MA considers errors from previous periods.

The processes described above are very similar to the operations we may apply to transform time series data into a form that may be fed into a supervised learning model —

Sliding Window

Given a time series, the observation at a particular time will be the predictor variable, and the specified lag will represent the number of prior values to that time period to form the explanatory variables.

Solar Energy Time Series

Time series data of solar power output in GW measured at 30-minute intervals.

Sliding window
Sliding Window

Two features are created for up to two time-steps prior to form the explanatory variables.

Differencing

This is the number of transformations required to stationarize a time series. Differencing is the change from one period to the next. If yᵗ refers to the value of a time series y at time t, then the first difference of y at time t is yᵗ-yᵗ⁻¹, while the third difference is yᵗ-yᵗ⁻³.

Differencing
First Difference

Rather than implementing these operations manually, the python library tsExtract may be used (Full disclosure, I authored this library).

Data Exploration

The data we’ll be working with is from the University of Sheffield. Solar data may be requested through the Sheffield Solar API. Data is available from the year 2015 till present and is measured in 30-minute intervals.

pvLive API

In the above code, we install the pvlive library and via API, put in a request for dates from the 21st of November 2014 until the 21st of November 2020.

Sheffield Solar Time Series

We notice a yearly seasonal pattern as well as an up-trend. The trend implies an increase in energy output over time which is possibly due to an increase in the number of solar farms, and/or an increase in the amount of energy generated at the existing farms by adding more solar panels. This is an example of a non-stationary series.

Sheffield Solar 2015

Here, we look only at the year 2015. We notice max output between the months of April and August. This more or less coincides with the end of spring going into, and till the end of summer.

Hourly Solar Output

Looking at a random day in January 2018, what we see here is to be expected. The solar output is present from about 8 am in the morning till about 4 pm, and peaks at around 12 noon.

Build Features

pip install tsextract

tsExtract

The code above imports some functions from tsExtract that we’ll find useful to preprocess the data for supervised learning. The features_request defines the features we wish to build from the time-series data —

  • window — We take a window of size 48. Since the observations are recorded at 30-minute intervals, this is a window of one day. This can typically be chosen by looking at repeating patterns in the data, or by taking a grid-search approach and trying out different values of window sizes for which works best.
  • window_statistic — This function also takes a sliding window of size 48, a mean operation is then applied to collapse the data to 1-d. This is especially useful when all we want is simply a statistical summary of the windowed data rather than the raw values.
  • difference_statistic — The first value also refers to window size, the second value refers to the order of differencing to be applied while the third input is the statistic we wish to apply. To summarize, this takes a window of size 48, applies first-order differencing to that window, and then collapses the matrix to 1d using standard deviation along the second axis.

The arguments passed into the build_features function are the data to preprocess and the features_request dictionary defining the features to build.

Also passed in is the number of time steps in the future we wish to predict. A specified target_lag of 48 means that we would be training the model to predict 24 hours into the future.

Finally, we specify whether or not we want t_zero. t_zero refers to the present time. This is from the moment we both look back for explanatory variables and look forward to the target variable.

Another benefit to having t_zero is that this may be used to calculate the difference between the target variable and t_zero. This variable could then be used as the new target variable. In this situation, we would be training our model to predict the difference between the energy output at the present time and the target time as opposed to predicting the raw energy values.

build_df output
Scaling and train-test split

We apply standard scaler and split the data 70–30 into train and test set. This standardizes the features to have 0 mean and unit variance.

Modeling with Keras

We set up a two-layer NN architecture using Keras with dropout to reduce overfitting. The model is trained with a batch size of 32 and an adam optimizer with MSE as the loss function. Training is done for 100 epochs with test loss —

493/493 [==============================] - 0s 781us/step - loss: 0.1598
Actual & Prediction Line Plot

A line plot of both the true and predicted values shows similar movements.

Actual vs Prediction Scatter Plot

The scatterplot shows a positive correlation between the actual and predicted values. This could be tighter.

Lagged Correlation

With time series, we don’t only consider error metrics like MSE, MAE, etc. This is because such metrics don’t capture lag. To get a better picture of how well our model is performing, we also look at Lag Correlation (Cross-Correlation) between the true and predicted values.

A stock price predictor model that correctly predicts the general movement of a stock price for example, but predicts sharp movements much later than the actual values would not be a useful model since it would lead to missed opportunities.

The plot below shows an example of this. A model that produces the results below might return a relatively low error. Such a model, however, would show a high lag correlation coefficient at a lag of 5.

Example — Lag Correlation

To check for this, we can compute the lagged correlation between the true and predicted values. Specifically, we’ll be using Spearman’s correlation coefficient. This is calculated as —

Lag Correlation

where —

  • ∂ represents the lag value
  • cov() represents the covariance of two vectors
  • σ — standard deviation
  • y(t+∂) is the vector y shifted by ∂ time periods
Lag Correlation Coefficient

We notice a high correlation coefficient at lag 0 which is ideal. We also notice a gradual reduction as the lag value increases. Here, we would ideally like to see a sharp drop after lag 0 as this would signify high confidence. More could be done to improve on this model.

Predicting the next 24 hours

At the moment, we have data of solar output up until the 21st of November 2020. Since we built our data using a target lag of 48-time steps (24 hours), what this means is that we only have a feature-label pair for data up until the 20th of November, 2020.

While we have features up until the 21st, we only have labels up until the 20th exactly 24 hours prior. For this reason, we can make predictions for the next 24 hours, but we don’t have the actual values for that time period and we will need to put in a request for the real values to make comparisons.

Forecasting the next 24 hours
Line Plot

The model, it appears has learned to follow the general movement of solar energy production during the day, though it appears to overestimate the output.

Scatterplot

We see this represented in the scatter plot as well with higher values for the predictions than for the true values.

What’s Next

We‘ll be looking into further techniques for improving the performance and lag of the model in a different article. We will also be looking into different ways to deploy the model and make predictions on the real-time data.

Here are some further techniques we could try —

  • Playing around with the windowing and differencing values. One way to determine the right values is to create visualizations exploring patterns in the data. Another way would be to simply try different values and pick whichever works best.
  • Engineer additional features. tsExtract offers a wide range of different feature engineering possibilities including statistical summaries using temporal & spectral functions, for example, absolute energy, zero-crossing rate, and spectral centroid. Also included are momentum & force terms which represent first and second-order differencing.
  • With regards to modeling, possible options are to try out different architectures by playing with the size of the layer and/or the number of neurons, and an ensembling of different models.
  • Implement a difference network. This is where we subtract t_zero from the response variable and train our model to predict that instead. This extra level of differencing works particularly well with difference features (where we subtract consecutive observations as discussed earlier). The prediction from the model may then be added to t_zero for a final output.

The Jupyter Notebook used may be found here.

References

Yagcioglu S (2020) Classical Examples of Supervised vs. Unsupervised Learning in Machine Learning https://www.springboard.com/blog/lp-machine-learning-unsupervised-learning-supervised-learning/

Brownlee J (2016) Time-Series Forecasting as Supervised Learning https://machinelearningmastery.com/time-series-forecasting-supervised-learning/

Ching Nok Yee, Nancy Zhang, Rosana de Oliveira Gomes, Sijuade Oguntayo, Valerie Koh Hui Yi Wind Energy Trade with Deep Learning — Time Series Forecasting https://towardsdatascience.com/wind-energy-trade-with-deep-learning-time-series-forecasting-580bd41f163

White S (2020) Mind-Blowing Information on Cross-sectional data https://www.allassignmenthelp.com/blog/cross-sectional-data/

Iordanova T (2020) An Introduction to Stationary and Non-Stationary Processes https://www.investopedia.com/articles/trading/07/stationary.asp

Frost F (2017) Multicollinearity in Regression Analysis: Problems, Detection, and Solutions https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/

Chen J (2019) Autoregressive Integrated Moving Average (ARIMA) https://www.investopedia.com/terms/a/autoregressive-integrated-moving-average-arima.asp

Leloux J, Taylor J, Moreton, Desportes A (2015) Monitoring 30,000 PV systems in Europe: Performance, Faults, and State of the Art https://www.researchgate.net/publication/283211424_Monitoring_30000_PV_systems_in_Europe_Performance_Faults_and_State_of_the_Art

Kohzadia N Boyd M S Kermanshahi B Kaastrac B (1996) A comparison of artificial neural network and time series models for forecasting commodity prices https://www.sciencedirect.com/science/article/abs/pii/0925231295000208

--

--