Time series forecasting with 2D convolutions

Published in

Towards Data Science

13 min readAug 19, 2020

In this article, I will show you a time series forecasting method I haven’t seen documented elsewhere. I doubt it is a new method, but since I haven’t seen a great article on it, here it is.

The Dataset

The data I used for this project is the data from the Global Energy Forecasting competition, put on by my hometown university, UNC Charlotte. You can find more about it here: http://www.drhongtao.com/gefcom/2017

What you need to know is the data is various readings from an energy grid. Our target is to forecast real-time energy demand for the grid using these data points. The data points also include dew point and dry bulb temperature, since air conditioning is a huge driver of energy consumption.

Our target variable is RTDemand: Real Time energy demand for the energy grid we are working with. The data has clear daily cycles. Here are three days of our data:

In the middle of the night when the sun is down and everyone is asleep, our power consumption reaches a minimum. We wake up in the morning, head off to work, and our power consumption reaches its maximum as the sun reaches peak intensity. I think the daily dips correspond to commuting times.

If we zoom out a little more, we can see clear auto-correlation and trends in days, just as you see in weather. Here’s about 3 weeks of data:

We can also notice a larger seasonal trend if we zoom out even more and look at the data over an entire year:

This is a very good dataset for time series forecasting.

Single variable pure time series models

For time series forecasting, we will need to make time sequences leading up to a target outcome. In our examples, I am choosing 72 hours as the length of the time sequence. What this means is the input to our model is 72 individual numbers representing the last 72 hours of data, and the target output we want from our model is its forecast for the 73rd hour. I thought 72 hours was a good length because it can capture local trends and the day/night cycle well.

Let me show you what I mean by time sequences. Here is our first 3 days of X, our input to the model:

array([[12055., 11430., 10966., 10725., 10672., 10852., 11255., 11583.,
        12238., 12877., 13349., 13510., 13492., 13314., 13156., 13364.,
        14632., 15653., 15504., 15088., 14579., 13882., 12931., 11883.,
        10978., 10406., 10089.,  9982., 10031., 10289., 10818., 11444.,
        12346., 13274., 13816., 14103., 14228., 14154., 14055., 14197.,
        15453., 16531., 16410., 15954., 15337., 14347., 13178., 12106.,
        11400., 11059., 10959., 11073., 11485., 12645., 14725., 15863.,
        16076., 16222., 16358., 16362., 16229., 16123., 15976., 16127.,
        17359., 18818., 18724., 18269., 17559., 16383., 14881., 13520.],
       [11430., 10966., 10725., 10672., 10852., 11255., 11583., 12238.,
        12877., 13349., 13510., 13492., 13314., 13156., 13364., 14632.,
        15653., 15504., 15088., 14579., 13882., 12931., 11883., 10978.,
        10406., 10089.,  9982., 10031., 10289., 10818., 11444., 12346.,
        13274., 13816., 14103., 14228., 14154., 14055., 14197., 15453.,
        16531., 16410., 15954., 15337., 14347., 13178., 12106., 11400.,
        11059., 10959., 11073., 11485., 12645., 14725., 15863., 16076.,
        16222., 16358., 16362., 16229., 16123., 15976., 16127., 17359.,
        18818., 18724., 18269., 17559., 16383., 14881., 13520., 12630.],
       [10966., 10725., 10672., 10852., 11255., 11583., 12238., 12877.,
        13349., 13510., 13492., 13314., 13156., 13364., 14632., 15653.,
        15504., 15088., 14579., 13882., 12931., 11883., 10978., 10406.,
        10089.,  9982., 10031., 10289., 10818., 11444., 12346., 13274.,
        13816., 14103., 14228., 14154., 14055., 14197., 15453., 16531.,
        16410., 15954., 15337., 14347., 13178., 12106., 11400., 11059.,
        10959., 11073., 11485., 12645., 14725., 15863., 16076., 16222.,
        16358., 16362., 16229., 16123., 15976., 16127., 17359., 18818.,
        18724., 18269., 17559., 16383., 14881., 13520., 12630., 12223.]])

Each number in the array is a reading of RTDemand: How many kilowatts of power were required for that hour from this particular power station. Each of the three big arrays has 72 hours worth of data in it. If you look carefully at the first 8 or so readings in each of these 3 arrays of 72, you’ll notice that each new set of 72 is a series that has been moved forward 1 hour in time. So each one of these 72-length input arrays represents the last 72 hours of readings for realtime demand for this energy grid.

We then want to forecast the 73rd hour, so our y array will look like this:

array([[12630.],
       [12223.],
       [12070.]])

Notice that if you look at the final entry in the second X array above, it is also the first entry here in our Y, and the final entry in the third X array is the second entry in our Y here. With the first X array, then, we’re trying to predict this second value in the Y series.

Data transformations

Once we have our data loaded and windowed, we next need to transform it into a proper set for training machine learning models. First, I’ll be scaling all input variables. Later we will look at using all 12 inputs of the dataset, but for now I’ll introduce the idea with just 1 variable. I will not be scaling my target variable, my Y, because I feel it makes monitoring the progress of the model much easier for minimal cost. Next, we’ll be splitting this into a train/test split:

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
X = scaler.fit_transform(X)split = int(0.8 * len(X))
X_train = X[: split - 1]
X_test = X[split:]y_train = y[: split - 1]
y_test = y[split:]

Finally, our shapes are a little off. The input to the models we’ll be working with is (Samples, Timesteps, Features). In this first model, we’re only using the time-windowed target variable as our input. So, we only have 1 feature. Our X_train.shape is currently (Samples, Timesteps). We’ll reshape it now, after the train/test split above:

X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
X_train.shape
(61875, 72, 1)

That is 61,875 samples, each made of 72 hourly readings, and 1 feature. Ready to rock.

Benchmark model:

First, a benchmark. Our optimization metric will be mean squared error/root mean squared error. So first we need to know what a good vs. bad reading of that number might be. We’ll also be looking at R², though we will only use mean squared error as our loss function and optimization target if there is a conflict.

For the benchmark model, we’ll see what reading we get for mean squared error and R². The benchmark model here will be to guess the previous value in our time series. Here’s some code to come up with a quick read on this model:

# Benchmark model
prev_val = y_test[0]
sse = 0for n in range(0, len(y_test)-1):
 err = y_test[n] — prev_val
 sq_err = err ** 2
 sse = sse + sq_err
 prev_val = y_test[n]mse = sse / n
mse

With our test dataset, this produces an answer of 411,577.17, and the square root of that is 641.54. One way to interpret this is that, on average, this benchmark model is off by 641.54 megawatts for a given hour. Here is a graph of the benchmark model vs. the real results.

This won’t be an easy model to beat, even though it is quite simple.

1-Variable LSTM model:

Now that we have our dataset set up, we can start working with machine learning models to make our forecast.

One common way to forecast time series is LSTM models. This will provide a good benchmark learned model to compare with our convolved models. Here’s some code for setting up and predicting with an LSTM model:

def basic_LSTM(window_size=5, n_features=1):
 new_model = keras.Sequential()
 new_model.add(tf.keras.layers.LSTM(100, 
               input_shape=(window_size, n_features),
               return_sequences=True,
               activation=’relu’))
 new_model.add(tf.keras.layers.Flatten())
 new_model.add(tf.keras.layers.Dense(1500, activation=’relu’))
 new_model.add(tf.keras.layers.Dense(100, activation=’linear’))
 new_model.add(tf.keras.layers.Dense(1))
 new_model.compile(optimizer=”adam”, loss=”mean_squared_error”) 
 return new_modells_model = basic_LSTM(window_size=window_size, n_features=X_train.shape[2])ls_model.summary()Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 72, 100)           40800     
_________________________________________________________________
flatten (Flatten)            (None, 7200)              0         
_________________________________________________________________
dense (Dense)                (None, 1500)              10801500  
_________________________________________________________________
dense_1 (Dense)              (None, 100)               150100    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
=================================================================
Total params: 10,992,501
Trainable params: 10,992,501
Non-trainable params: 0

After training, we can evaluate the model:

ls_model.evaluate(X_test, y_test, verbose=0)
1174830.0587427279from sklearn.metrics import r2_score
predictions = ls_model.predict(X_test)
test_r2 = r2_score(y_test, predictions)
test_r2
0.8451637094740732

The results we get are Ok, but not stellar. Specifically, we wind up with a higher error than our previous benchmark model. Here’s the graph to get an idea of what it was predicting:

As you can see, it predicts a decent amount of the variability, but ultimately is not that great an outcome. The issue seems to be relatively large errors during the ‘dip’ in the morning. I also found this model to be very unreliable, frequently reducing its loss function to nan and producing no output.

Introducing a 1D convolution method

Another method for forecasting time series is using a 1D convolution model. A 1D convolution uses a filter window and cycles that window over your data to produce a new output. Depending on the learned parameters of the convolution windows, they can act like moving averages, direction indicators, or detectors of patterns across time. Let me explain the method with some images.

Here we have a dataset that has 8 elements, and a filter size of 4. The four numbers in the filter are the parameters learned by a Conv1D layer. In the first step, we multiply the elements of the filter times the input data, and add together the results to produce a convolved output.

In the second step of a convolution, the window is moved over by one and the same process is repeated to produce a second output.

This process continues until your window hits the end of your input data. In our case, one input data sequence is the 72 hours of data we’ve set up previously. If we add the option padding=”same”, our input data will be equally padded with zeros on the beginning and end to maintain an output length equal to the input length. The demonstration above uses a linear activation, meaning that last multi-colored array is our output. However, you can use a whole host of activation functions here, which will run this number through an additional step. So, in our example below, there will be a ReLU activation function applied to this last output to produce the final, final output.

Here is the code for setting up and running a 1D convolution, given the data setup we previously described:

def basic_conv1D(n_filters=10, fsize=5, window_size=5, n_features=2):
 new_model = keras.Sequential()
 new_model.add(tf.keras.layers.Conv1D(n_filters, fsize, padding=”same”, activation=”relu”, input_shape=(window_size, n_features)))
 # Flatten will take our convolution filters and lay them out end to end so our dense layer can predict based on the outcomes of each
 new_model.add(tf.keras.layers.Flatten())
 new_model.add(tf.keras.layers.Dense(1800, activation=’relu’))
 new_model.add(tf.keras.layers.Dense(100))
 new_model.add(tf.keras.layers.Dense(1))
 new_model.compile(optimizer=”adam”, loss=”mean_squared_error”) 
 return new_model

Here’s what it looks like with our dataset:

univar_model = basic_conv1D(n_filters=24, fsize=8, window_size=window_size, n_features=X_train.shape[2])univar_model.summary()Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d (Conv1D)              (None, 72, 24)            216       
_________________________________________________________________
flatten_1 (Flatten)          (None, 1728)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1800)              3112200   
_________________________________________________________________
dense_4 (Dense)              (None, 100)               180100    
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 101       
=================================================================
Total params: 3,292,617
Trainable params: 3,292,617
Non-trainable params: 0

Notice here I have 24 convolution windows, and the filter size is 8. So, in our case, the input data will be 72 hours, this will create a window with a size of 8, and there will be 24 of those filters. Because I used padding=”same”, the width of the output of each of those filters will be 72, just like our input data, and the number of outputs will be 24 convolved arrays. Flatten produces an output of 72 * 24 = 1,728 length array. To continue our sample convolution from above, it would look like this:

So, how does this method work vs. an LSTM and a dumb model?

Ok, so it worked a bit better than the LSTM, but it still isn’t up to par with the original benchmark model of ‘just guess the previous value’. When we look at the graph, we can see a clear bias in this model:

Adding more data

In our case, we’re only using the feature we want to predict as our input variable. However, our dataset came with 12 possible input variables. We can stack up all of our input variables and use them together to make a prediction. Since many of the input variables are moderately to strongly correlated to our output variable, it should be possible to make a better prediction with more data, right? Well, there’s a bit of a problem with that in 1D convolutions.

I believe this is what is going on in the case of a 1D convolution window being applied to a multi-series input. If I am right, then adding more datasets would tend to ‘blur out’ the impact of any one particular input changing, and instead should produce a less accurate model.

If I want to stack a different data series into my model, I first have to run it through the same windowing process to produce a set of observations which each contain the last 72 readings of the variable. So, for instance, if I wanted to add in the column 1 variable, DADemand (day ahead demand, the demand from the previous day at this time), I would do the following to it:

(DADemand, _) = window_data(gc_df, window_size, 1, 1)scaler = StandardScaler()DADemand = scaler.fit_transform(DADemand)split = int(0.8 * len(X))DADemand_train = DADemand[: split — 1]DADemand_test = DADemand[split:]DADemand_test.shape
(61875, 72, 1)

Then, I can continue that process for all 12 of my variables, and stack them up into a single set like this:

data_train = np.concatenate((X_train, db_train, dew_train, DADemand_train, DALMP_train, DAEC_train, DACC_train, DAMLC_train, RTLMP_train, RTEC_train, RTCC_train, RTMLC_train), axis=2)data_test = np.concatenate((X_test, db_test, dew_test, DADemand_test, DALMP_test, DAEC_test, DACC_test, DAMLC_test, RTLMP_test, RTEC_test, RTCC_test, RTMLC_test), axis=2)data_train.shape(61875, 72, 12)

So that is 61,875 examples, each is 72 hours of individual readings of each of 12 different time series here. We’ll now run this through a Conv1D net and see what results we get. If you take a peek back at our function for creating these models, you’ll notice that one of the variables is the number of features, so the code to run this and an LSTM was as straightforward as creating a model using the existing code but our new stacked data, fit, evaluate, and graph. Here’s how it all worked out:

As expected, our performance here actually declined with the additional variables, I believe because of the blurring effect. As I sat upon a midnight dreary and pondered weak and weary, I found a solution to this problem.

2D Convolutions

Ok, so what we need here is a convolution window that looks through our features and figures out which features are the good ones. It basically needs to look like this:

After doing some research, this shape can be achieved with a 2D convolution window shaped as (1, filter_size), and in the image above, filter_size=3. Back in our energy forecasting problem, we have 12 features. In order to get it into a 2D convolution window, we’ll actually need it to have 4 dimensions. We can get the other dimension with:

data_train_wide = data_train.reshape((data_train.shape[0], data_train.shape[1], data_train.shape[2], 1))
data_test_wide = data_test.reshape((data_test.shape[0], data_test.shape[1], data_test.shape[2], 1))data_train_wide.shape(61875, 72, 12, 1)

I did some testing of various shapes for this 2D window and found that doing 2 features at a time worked best:

def basic_conv2D(n_filters=10, fsize=5, window_size=5, n_features=2):
 new_model = keras.Sequential()
 new_model.add(tf.keras.layers.Conv2D(n_filters, (1,fsize), padding=”same”, activation=”relu”, input_shape=(window_size, n_features, 1)))
 new_model.add(tf.keras.layers.Flatten())
 new_model.add(tf.keras.layers.Dense(1000, activation=’relu’))
 new_model.add(tf.keras.layers.Dense(100))
 new_model.add(tf.keras.layers.Dense(1))
 new_model.compile(optimizer=”adam”, loss=”mean_squared_error”) 
 return new_modelm2 = basic_conv2D(n_filters=24, fsize=2, window_size=window_size, n_features=data_train_wide.shape[2])m2.summary()Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 72, 12, 24)        72        
_________________________________________________________________
flatten_4 (Flatten)          (None, 20736)             0         
_________________________________________________________________
dense_12 (Dense)             (None, 1000)              20737000  
_________________________________________________________________
dense_13 (Dense)             (None, 100)               100100    
_________________________________________________________________
dense_14 (Dense)             (None, 1)                 101       
=================================================================
Total params: 20,837,273
Trainable params: 20,837,273
Non-trainable params: 0

Ok, so the model is quite huge. It took about 4 minutes per epoch to train on my regular CPU. When it was done, though, I evaluated it and graphed it, and my friend it paid out like the lottery:

And how does this model compare to our previous models?

So this model far outperformed all previous models and beat our benchmark ‘dumb model’ by a giant margin.

But wait, there’s more!

Ok, so just as a bonus model you didn’t think you’d get: What if we used a similar idea, but also convolved over 8 hours at a time with a filter shape of (8,1)?

Here’s the code for that next deeper layer:

def deeper_conv2D(n_filters=10, fsize=5, window_size=5, n_features=2, hour_filter=8):
 new_model = keras.Sequential()
 new_model.add(tf.keras.layers.Conv2D(n_filters, (1,fsize), padding=”same”, activation=”linear”, input_shape=(window_size, n_features, 1)))
 new_model.add(tf.keras.layers.Conv2D(n_filters, (hour_filter, 1), padding=”same”, activation=”relu”))
 new_model.add(tf.keras.layers.Flatten())
 new_model.add(tf.keras.layers.Dense(1000, activation=’relu’))
 new_model.add(tf.keras.layers.Dense(100))
 new_model.add(tf.keras.layers.Dense(1))
 new_model.compile(optimizer=”adam”, loss=”mean_squared_error”) 
 return new_model

This model performed very well too:

Based on our loss metric/optimization target, this model performed better than any other:

You can find all the code for this, as well as the sample dataset, at this location: https://github.com/walesdata/2Dconv_pub

You can find out more about me and what I’m up to here: https://www.linkedin.com/in/john-wales-62832b5/

Time series forecasting with 2D convolutions

Written by Johnny Wales