The world’s leading publication for data science, AI, and ML professionals.

Understanding RNNs (Recurrent Neural Networks)

The Neural Network That Can Remember The Past

Photo by Gemma Evans on Unsplash
Photo by Gemma Evans on Unsplash

The first time I heard of a RNN (Recurrent Neural Network), I was perplexed. The article I read was claiming that a RNN is a neural net with memory – that it could remember the sequential ups and downs of the data in order to make more informed predictions.

My first thought back then was – how is a RNN different from a linear regression with many lags (an autoregressive model)? Turns out a RNN is not only a lot different but also more versatile and more powerful.

But before we add it to our forecasting toolkit, we should do our best to develop an intuitive understanding of how it works – starting with how an RNN is able to remember the past. Let’s find out.


Sequential Data

A RNN is a neural network that works best on sequential data. If you are unfamiliar with neural nets, then you should start with my Understanding Neural Networks post. Going forward in this article, I will assume that the reader has a basic understanding of what a neural net is and how one works.

What’s sequential data – it is data where the order matters. Some examples of sequential data include stock prices and interest rates (ordered by time), the words in a blog post (the order of the words convey context and meaning), or the average temperature for each day (ordered by time).

An example of sequential data (Apple stock price)
An example of sequential data (Apple stock price)

Usually with sequential data we want to predict what’s coming next. For example, we might want to forecast what the temperature will be tomorrow or whether a stock’s price will be higher or lower next month.


Simple Linear Regression Forecast

The simplest forecast that I can think of is an AR(1) model (an autoregressive model with a single lag) where you simply use the previous observation to attempt to forecast the next one. Let’s use stock returns as our example:

  • We want to forecast Ret, Apple’s return over the next month. This is our target variable.
  • Our lone feature variable is the most recent month’s return.
  • Our dataset consists of a 20 year time series of Apple’s monthly stock returns (calculated as the percentage change in price of Apple stock from the last day of the previous month to the last day of the current month).

An AR(1) model would have the following equation:

Predicted_Ret(t) = m*Ret(t-1) + B

This should look familiar as it’s the equation for a line (Y = mX + B). The picture below depicts our AR(1) model in more detail. Let’s go through it piece by piece:

  • We use our data to estimate the optimal values of the parameters in green – m is the slope of our line and B is the intercept.
  • Notice that the equation on the left produces an output with a hat symbol (^). The hat denotes that the output is merely an estimate of our target variable, the actual stock return.
  • Our goal is to minimize the differences between the forecasted return (^) and the actual return.
  • Notice that horizontally Ret(0), the first month return, lines up with Ret(1), the second monthly return, and Ret(1) lines up with Ret(2) and so on. This is what we mean when we say that we use the previous observation (our feature) to forecast the next one (our target) – given last month’s return and our estimated values of m and B, we can produce a forecast of next month’s return.
AR(1) model
AR(1) model

You don’t need to have been around the market for very long to know that this is not a good model. Stock prices can move for various reasons (company fundamentals, economic shocks, investor fear/euphoria) or sometimes no reason at all. So we shouldn’t expect a simple AR(1) model to do a good job.

Rather we should look further beyond just the most recent observation and consider the entire sequence (really we should be considering data beyond price movements, but that is a story for another day). RNNs allow us to do just that.


Considering The Entire Sequence Using RNNs

Before we dive into RNNs, one question an alert reader might have is "Why not just increase the number of lags in our autoregressive model?" That is instead of using a single lag, why not use something like AR(20), which forecasts next month’s return using the 20 most recent monthly returns. The answer is twofold:

  1. The number of monthly returns to use for our model is actually an important parameter that we need to tune. Choosing the wrong number could lead to poor performance. Moreover, depending on the economic regime, the optimal number of monthly returns to use can potentially vary widely. An RNN gets around this because it can see the entire available history of returns – and more importantly, it automatically decides at each point in time how much weight to give to the historical returns. So we don’t need to tell a RNN to look at the previous 5, 10, or 20 returns because it inherently already knows to do so.
  2. Autoregressive models are linear models and thus assume a linear relationship between our features and the target. In situations where there is non-linearity, this could cause performance issues. RNNs, especially when stacked on more RNNs or on dense layers (a dense layer is a layer of normal neural net neurons), can detect and capture all the nonlinear relationships in our data. In fact, with RNNs (and neural nets in general) we should worry more about over-fitting rather than under-fitting.

What Does It Mean For A Model To Have Memory?

Let us recreate our AR(1) model using a neural net neuron. Nothing complicated here – recall that a singular neuron (if you are unfamiliar with neurons, please take a second and read my previous blog on neural nets) takes an input, multiplies it by a weight (m), and adds a bias (B) to it. Those are exactly the operations of a single variable linear regression, which is what an AR(1) model is. Note that we are neglecting the activation function here to simplify the explanation.

AR(1) via a neural net neuron
AR(1) via a neural net neuron

Now let’s think about how we can add memory to this model. What is memory in the case of a quantitative model? There’s no 100% correct answer, but in my opinion, memory is the ability to draw from relevant past experiences in order to aid decision making. In modeling terms, we want the model to be dynamic – in other words, we want it to be able to shift according to its read of the situation (based on its past experiences).

Our AR(1) model might be trained with historical data (recall that we fed it the last 20 years of Apple’s monthly stock returns), but it definitely doesn’t have memory. The AR(1) model equally weights each data point that we give it when estimating the regression parameters m and B. Thus, it is not able to decide which data points are more relevant and which are less (the regression parameters are static, once estimated). It is static, not dynamic.

Where Does The Memory Come From?

Finally it’s time to give our model memory. It actually requires just a simple trick. Notice the additions at the bottom in the picture below:

Super simple RNN with memory
Super simple RNN with memory

The key addition is that we are now taking in the previous output, Output(t-1), multiplying it against a new parameter u, and adding it to what we previously had. So our updated equation looks like:

Predicted_Ret(t) = *uPredicted_Ret(t-1) +* mRet(t-1) + B

Or with a more general notation:

Output(t) = *uOutput(t-1) +* mRet(t-1) + B

So why exactly does this constitute memory? To see why, we need to jump ahead a bit and first understand how an RNN analyzes data. Here’s some Python pseudocode for how an RNN generates its predictions:

# Our input data is 20 years of Apple's monthly returns
inputs = appl_monthy_returns
time_steps = range(inputs.shape[0])
# List to store outputs
predictions = []
# Initialize state to 0 (state is output[t-1] from above)
state = 0
# Initialize u, m, and B
u = 1
m = 1
B = 1
for t in time_steps:
    input_t = inputs[t]
    current_prediction = (u * state) + (m * input_t) + B
    predictions.append(current_prediction)
    state = current_prediction

    # Function that updates m and B via backpropagation through time
    u, m, B = BPTT(state, predictions, inputs)

OK, now let’s walk through the code:

  1. First we initialize state, which is what I am calling Output(t-1), to 0 because there is no previous state at t=0.
  2. Then we start the loop. The loop will run for as many times as there are inputs – in our case we have 20 years of monthly returns so it will run 20*12 = 240 times.
  3. In each iteration of the loop, we calculate the current prediction. We then append the calculated prediction to our output list, predictions – this list is our predictions of the next month’s return.
  4. Next, we set state equal to the current prediction so that we can use it in the next loop – that is, we need the prediction from t=0 to calculate the prediction at t=1.
  5. Finally, we use backpropagation through time (beyond the scope of this post) to update the RNN’s parameters u, m, and B.

Take note of a few key things:

  • The state is where the memory comes from. The state at time t is the previous output (from time t-1) – and this previous output includes the previous model parameters (time t-1’s u, m, and B) as well as the output from time t-2. When the current time-step of the RNN (each time-step is one iteration of the for loop) looks at the output from the previous time-step (state), it is in essence looking at its past self. The previous output is the RNN’s way of snapshotting its past self and passing it forward. That’s why I call it state in my pseudocode – it’s just one way to summarize (very roughly) the most up to date state of the model’s decision making process.
  • We multiply state by the parameter u. This allows the RNN to decide how much or how little to use its memory (past snapshots of itself).
  • The last thing we do in each iteration of our for loop is update the RNN’s parameters (u, m, and B). This means that in each iteration of the loop we may and probably will see different values for u, m, and B.
  • The purpose of the for loop is to move the RNN forward through time. Unlike linear regression where the model is estimated all at once, the RNN gradually converges by incrementally examining the sequential data one time-step at a time.

Let’s write out the equation (using simpler notation) to make sure we understand everything. I will call the output at time t, O(t), and the input, X(t). Let’s write out O(3):

O(3) = u O(2) + m X (3) + B

We can also write out O(2):

O(2) = u_2 O(1) + m_2 X (2) + B_2

Substituting for O(2), we get:

O(3) = u * (u_2 O(1) + m_2 X (2) + B_2) + m * X (3) + B

See how the output at time t=3 includes previous parameters (_u2, _m2, and _B2) – these parameters are how the RNN made its decisions at time t=2 and they have now been passed on to the current iteration of the model. If we wanted to, we could expand O(1) as well to see that the parameters from time t=1 are included as well.

Recall that we defined model memory as being dynamic and able to shift based on the situation. The RNN can do that now – by including past snapshots of itself via its previous output, the RNN acquires access to its historical parameters and can, as the situation requires, decide whether or not to include them in its decision making process.


Conclusion

I learned a great deal writing this post. Previously I knew at a very high level how RNNs worked, but I always wanted to better understand what people meant when they said that RNNs had memory. I hope you understand it better now too.

Our work is not yet finished though. Theoretically RNNs should be able to masterfully harness past experience to make decisions. But in reality, they suffer from something called the vanishing gradient problem. In a future post, we will explore why this is so and how an augmented RNN called a LSTM helps us get around this problem. Until then, cheers!

More Data Science and Business Related Posts By Me:

Business Strategy For Data Scientists

How Much Analysis Is Too Much

Business Simulations With Python

_Understanding PCA_

Understanding Bayes’ Theorem

Understanding The Naive Bayes Classifier


Related Articles