The world’s leading publication for data science, AI, and ML professionals.

The Essence of RNNs

The intuition behind the building blocks of Recurrent neural networks(RNNs)

https://en.wikipedia.org/wiki/Recurrent_neural_network
https://en.wikipedia.org/wiki/Recurrent_neural_network

While ordinary multilayer perceptrons (MLP) with only one hidden layer are undoubtedly capable of approximating any continuous function(universal approximation theorem), different applications require more complex architectures to solve a given problem. Convolutional neural networks(CNNs) are great at processing spatial information, Recurrent neural networks(RNNs) are more geared towards handling sequential data. In this article, we will be looking at how RNNs work at a fundamental level and learn about different variations of this widely used network architecture.

Sequential Data

https://en.wikipedia.org/wiki/File:Stocks29.jpg
https://en.wikipedia.org/wiki/File:Stocks29.jpg

Before explaining the general principles of RNNs, it is essential to understand the meaning of sequential data. When considering CNNs and MLPs, we always assumed that the data was sampled from and independently and identically distributed data(i.i.d), but with sequential data, that is not the case. Contrary to (i.i.d) data, the previous input points affect the outcome of the next output. Since RNNs are most widely used in natural language processing(NLP), an analogy from that field would suffice to make the point clear. Imagine textual data, all the words in a sequence affect the outcome of the following word, and that word, in consequence, affects the outcome of the following word. In all NLP oriented tasks, a certain notion of memory is required. The first word ought to affect the outcome. A more formal mathematical notation would mean that the input point (xₜ) causes an effect on (xₜ₊₁). Applications of RNNs are not only bound to NLP but are applicable to any problem that requires the use of sequential data. Most notably, RNNs are used for stock market prediction, music generation, time series analysis, and many other tasks.

Building Blocks

In this section, we will be assuming that you are familiar with feed-forward neural networks. We will be taking the concept a bit further by imposing recursion on the hidden layers; you can think of RNNs as normal neural nets but with recursion. The hidden unit’s output is a linear combination of the current input point’s data, and the previous hidden layer’s with a tanh non-linearity applied. Down below are some illustrations to prove the point.

Image by author
Image by author

The image above is not very useful in providing the best intuitive sense of how RNNs work. That is why the unfolded version of RNNs are preferred for illustrative purposes, with each input point x_t being referred to as a timestamp. If you are building a model to predict the stock market and your training data is divided into 150 discrete days, you would then have a model with 150 timestamps. The attributes <x_1,x_2,x_3,…,x_n> of each day would be fed back into the system after each day. Now one might ask, isn’t this just like having a multitude of MLPs next to each other? The answer to that would be no since we are not initiating a new model after each timestamp. We are simply using the same weight matrices over and over again. Different MLPs stacked next to each other sequentially would require different weight matrices for each MLP. The main idea behind RNNs is that the same weight matrices are used recursively. That is what makes them perfect for sequential data since you are not adding new parameters for each input point, thus preserving the model’s size to be more manageable and efficient. Below is an Unfolded version of an RNN. You will realize that each timestamp has its own output and a loss associated with it.

Image by author.
Image by author.

If we look at each timestamp individually the result would be the following:

hₜ = tanh(Wₓₕ xₜ + Wₕₕ hₜ₋₁ + b)

As mentioned above, we are simply taking the linear combination of input data and the previous timestamp’s hidden layer’s data. There are two weight matrices here. One for the input data denoted as Wₓₕ and the other for the previous timestamp’s hidden layer denoted as Wₕₕ, added with a bias term, a tanh activation function is applied to this combination. The same weight matrices are used in each timestamp.

Now let’s look at the output for each timestamp.

_y_bar = activationfunction( Wₒₕ hₜ)

Here we see that another non-linearity activation function is applied to an additional weight matrix multiplied with the current timestamp’s hidden output. For simplicity, we will refer to Wₕₕ as W and Wₓₕ as U, and finally, Wₒₕ as V.

Training RNNs and BPTT

Neural networks update parameters through automatic differentiation, which is just the repeated application of the chain rule in order to find ∂J/∂w. We will not be getting into the details of the maths here, but we would like to provide the intuition behind two inherent problems of deep RNNs, which are known as the vanishing gradient and the exploding gradient. In order to compute the gradient, we have to multiply many terms with each other. It is then not too difficult to see that multiplying many terms smaller than 1 with each other will result in a gradient that approaches zero. On the contrary, the exploding gradient problem arises when many terms larger than one are multiplied. The maths, of course, is much more subtle, but by looking at the derivation for the automatic differentiation I have provided at the end of the blog, the notion of multiplying many terms will make more sense. The vanishing gradient problem can be summarized as the following: The gradient signal from faraway is lost because it is much smaller than the gradient signal close by. This intuitively would mean that the effect of only the previous input point would be realized, and in the case of NLP, that would mean that you are not learning context. You are only learning the effect of the previous word.

Solutions

To fix the problems discussed in the previous section, we shall first look at how the method of gradient clipping is used for the exploding gradient problem. Gradient clipping is a simple method in which we take a step in the same direction(remember gradient descent), but the step is scaled down. Algorithm:

Decide a maximum allowable gradient size

∂L/∂w =g – -> max||g|| = Gₘₐₓ

If ||g|| < Gₘₐₓ proceed as usual

If not, gˢ = (g/||g||) * Gₘₐₓ

As for the vanishing gradient problem, new architectures such as LSTM and GRU must be introduced.

Gated Recurrent Units (GRUs)

The main idea behind GRUs and LSTMs is to add more parameters in order to counter the problem of short-term memory. They have internal gates that adjust the amount of information to be kept and the amount to be discarded. In this article, we will only consider the case of a simplified GRU.

https://en.wikipedia.org/wiki/Gated_recurrent_unit
https://en.wikipedia.org/wiki/Gated_recurrent_unit

Our initial formula was as the following:

hₜ = tanh(Whₜ₋₁ + U xₜ)

Now we shall define Z = Whₜ₋₁ + U xₜ and rename hₜ as g

g = tanh(Z)

We will now take the linear combination of the vanilla output and the older computation:

hₜ = [(1-λ)g + λhₜ₋₁] λ∈[0,1]

when λ = 1 →hₜ = hₜ₋₁ → pure memory

when λ = 0 → vanilla RNN

The extra parameter(λ) is an example of how GRUs and LSTMs utilize such techniques in order to achieve better results. In the full versions, you have matrix multiplication instead of the scalar parameter we introduced, but the idea is the same, take linear combinations of more parameters and apply a non-linearity activation. I will provide links to where you can learn about GRUs and LSTMs in depth.


Handwritten Derivation for the Automatic Differentiation

Image by author
Image by author
Image by author
Image by author
Image by author
Image by author
Image by author
Image by author

Conclusion

  1. RNNs are the go-to method when it comes to modeling sequential data.
  2. RNNs are considered to have varying size inputs because you can simply unroll it again when new data is available.
  3. Automatic differentiation makes taking gradients easier through the chain rule.
  4. Diminishing gradients result from the chain rule’s nature to multiply many terms, resulting in terms less than one being multiplied sequentially.
  5. Gradient clipping is used to fix the problem of exploding gradients.
  6. Different architectures such as GRUs and LSTMs need to be introduced to solve the vanishing gradient problem.
  7. Always think of neural networks as matrix multiplication. Keras and other Deep Learning packages will make much more sense.

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning (Adaptive Computation and Machine Learning series) (Illustrated ed.). The MIT Press.

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303–314

NPTEL :: Computer Science and Engineering – NOC:Machine Learning for Engineering and Science…


Related Articles