The world’s leading publication for data science, AI, and ML professionals.

Implementing Recurrent Neural Network using Numpy

A comprehensive tutorial on how recurrent neural network can be implemented using Numpy.

Getting Started

Photo by cheng feng on Unsplash
Photo by cheng feng on Unsplash

Introduction

Recurrent neural network (RNN) is one of the earliest neural networks that was able to provide a break through in the field of NLP. The beauty of this network is its capacity to store memory of previous sequences due to which they are widely used for time series tasks as well. High level frameworks like Tensorflow and PyTorch abstract the mathematics behind these neural networks making it difficult for any AI enthusiast to code a deep learning architecture with right knowledge of parameters and layers. In order to resolve these type of inefficiencies the mathematical knowledge behind these networks is necessary. Coding these Algorithms from scratch gives an extra edge by helping AI enthusiast understand the different notations in research papers and implement them in practicality.

If you are new to the concept of RNN please refer to MIT 6.S191 course, which is one of the best lectures giving a good intuitive understanding on how RNN work. This knowledge will help you understand the different notations and concept implementations explained in this tutorial.

End goal

The end goal of this blog is to make AI enthusiasts comfortable with coding the theoretical knowledge they gain from research papers in the field of Deep Learning.

Parameter Initialization

Unlike traditional Neural Networks, RNN possess 3 weight parameters, namely input weights, internal state weights (used to store the memory) and output weights. We start by initializing these parameters with random values. We initialize the word_embedding dimension and output dimension as 100 and 80 respectively. The output dimension is the total count of unique words present in the vocabulary.

The variable _prev_memory_ refers to the internal_state (these are the memory of the previous sequences).Other parameters like the gradients for updating the weights have been initialized as well. The input_weight gradients, internal_state_weight gradient and outputweight gradients have been named as dU, dW and dV respectively. Variable bptt_truncate_ refers to the number of timestamps the network has to look back while back-propagating, this is done to overcome the vanishing gradient problem.

Forward Propagation Intuition:

Input and output vectors:

Consider we have a sentence "I like to play." . In the vocabulary list lets assume that I is mapped to index 2 , like to index 45, to at index 10 and play at index 64 and the punctuation "." at index 1. To get a real life scenario working from input to output, lets randomly initialize the word_embedding for each word.

Note: You could also try it with a one hot encoded vector for each word and pass that as an input.

Now that we are done with the input, we need to consider the output for each word input. The RNN cell should output the next most probable word for the current input. For training the RNN we provide the t+1’th word as the output for the t’th input value, for example: the RNN cell should output the word like for the given input word I.

Note: The output for each individual timestamp is not exclusively determined by the current input, but by the previous set of inputs along with it, which is determined by the internal state parameter.

Now that the input is in the form of embedding vectors, the format of output required to calculate the loss should be one-hot encoded vectors. This is done for each word that is present in the input string except the first word, because we are considering just one example sentence for this neural network to learn and the initial input is the the first word of the sentence.

Why do we one-hot encode the output words ?

The reason being, raw output would just be scores for each unique word and they are not important to us. Instead we need the probabilities of each word with respect to the previous word.

How do we find the probabilities from raw output values ?

In order to solve for this problem a softmax activation function is used on the vector of scores such that all those probabilities would add up to one. Img 1 shows the input-output pipeline at a single time-stamp. The top row is the ground _truth output and the second line represent the predicted output.

Img 1: Input and output pipeline for RNN, Image by Author
Img 1: Input and output pipeline for RNN, Image by Author

Note: _During the implementation we need to take care of the key value of the outputmapper. We need to reset the key values with its timestamp values so that the algorithm knows which ground-truth word needs to be used at that particular time-stamp in-order to calculate the loss .

Before reset:
45: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
After reset:
{0: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

The RNN box calculation:

Now that we have the weights and we also know how we pass our input and we know what is expected as output, we will start with the forward propagation calculation. Below calculations are required for training the neural network.

Img 2: Image by Author
Img 2: Image by Author

Here the U represent the inputweights, W_ represent the internal_stateweights and the V represent the output weights. The input weights are multiplied with the input(x_) , the internal_stateweights are multiplied with the previous activation which in our notation is prev_memory_. The activation function used between the layers is Tanh. It provides non-linearity with eventually helps in learning.

Note: Bias term for the RNN calculations is not used as it would lead to more complex understanding in this tutorial.

Since the above code would just calculate the output for one particular time-stamp, we would now have to code the forward propagation for the entire sequence of words.

In the code below, the output string contains a list of output vectors for each time-stamp. Memory is a dictionary __ that contains parameters for each timestamp that will be essential during back propagation.

Loss Calculations:

We also defined our loss or error, to be the cross entropy loss, given by:

Img 3: Image by Author
Img 3: Image by Author

Most importantly what we need to look in the above code is line 5. As we know that the groundtruth output(y_) is of the form [0,0,….,1,..0] and predictedoutput(y^hat) is of the form [0.34,0.03,……,0.45], we need the loss to be a single value to infer the total loss from it. For this reason we use the sum function_ to get the sum of the differences/error for each value in the y and y^hat vectors for that particular time-stamp. The total_loss is the loss for the entire model inclusive of all time stamps.

Back-propagation

If you heard of back propagation, then you must have heard of chain rule and it being the vital aspect behind calculating the gradient.

Img 4: Image by Author
Img 4: Image by Author
Img 5: Image by Author
Img 5: Image by Author

Based on Img 4 above, cost C represents the error which is the change required for y^hat to reach y. Since the cost is a function output of activation a, the change reflected by the activation is represented by dCost/da. Practically, it means the change (error) value seen from the point of view of the activation node. Similarly the change of activation with respect to z is represented by da/dz and z with respect to w is given by dw/dz. We are concerned with how much the change (error) is with respect to weights. Since there is no direct relation between weights and cost, the intermediate change values from cost all the way to the weights are multiplied(as can be seen in the equation above).

Back-propagation for RNN:

Since there are three weights in RNN we require three gradients. Gradient of inputweights(dLoss/dU_), internal_stateweights(dLoss/dW_) and outputweights(dLoss/dV_).

The chain of these three gradients can be represented as follows:

:
:
Img 6: Gradient equation of the weights used in RNN, Image by Author
Img 6: Gradient equation of the weights used in RNN, Image by Author

Note: Here the T represents the transpose.

The _dLoss/dy_unactivated_ is coded as the following:

In order to know more about the loss derivatives, please refer this blog. There are two gradient functions that we will be calculating, one is the multiplication_backward and the other is addition_backward. In case of multiplication _backward we return 2 parameters, one is the gradient with respect to the weights (dLoss/dV) and the other is a chain gradient which will be a part of the chain to calculate another weight gradient. In the case of addition backward while calculating the derivative we find out that the derivative of the individual components in the add function(_ht_unactivated) are 1. For example: dh_unactivated/dU_frd= 1 as (h_unactivated = U_frd + Wfrd) and the derivative of dU_frd/dU_frd_= 1. But the number of ones are with respect to the dimension of Ufrd. To know more about the gradients you can refer to this source. That’s it, these are the only two functions required to calculate the gradients. The multiplication_backward function is used on equations that contain a dot product of the vectors and addition_backward_ on equations that contain addition of two vectors.

Img 7: Intuition of the mathematics behind the gradient function that is coded below, Image by Author
Img 7: Intuition of the mathematics behind the gradient function that is coded below, Image by Author

Now that you have analyzed and understood back-propagation for RNN, its time to implement it for one single time-stamp, which will be later used for calculating the gradients across all the time-stamps . As seen in the code below , _forward_params_t is a list that contains the forward parameters of the network at a particular time-step. Variable ds_ is a vital part as this line of code considers the hidden state of previous timestamps, which will help extract adequate useful information required while back-propagating.

For RNN, instead of using vanilla back propagation, we will be using truncated back propagation because of the vanishing gradient problem. In this technique instead of looking at just one time-stamp back, the current cell will look at k time-stamps back , where k represents the number of previous cells to look back so that more knowledge is retrieved.

Updating Weights:

Once we have calculated the gradients using back-propagation, we have to update the weights which is done using the batch gradient descent approach.

Training the sequence:

Once we have all our functions in place, we can approach our climax i.e. training the neural network. The learning rate considered for training is static, you could even use a dynamic approach of changing the learning rate based on using step decay.

Img 7: Losses- output, Image by Author
Img 7: Losses- output, Image by Author

Conclusion:

Now that you implemented a recurrent neural network, its time to take a step forward with advanced architectures like LSTM and GRU that utilize the hidden states in a much efficient manner to retain the meaning of longer sequences. There is still a long way to go. With a lot of advancements in the field of NLP there are highly sophisticated algorithms like Elmo and Bert. Understand them and try to implement it yourself. It follows the same concept of memory but brings in an element of weighted words. Since these models are highly complex, using Numpy would not suffice, rather inculcate the skills of PyTorch or TensorFlow to implement them and build amazing AI systems that can serve the community.

Inspiration to create this tutorial was from this github blog.

You can access the notebook for this tutorial here.

Hope you all enjoyed the tutorial!

References:

[1] Sargur Srihari, RNN-Gradients, https://cedar.buffalo.edu/~srihari/CSE676/10.2.2%20RNN-Gradients.pdf

[2] Yu Gong, RNN-from-scratch, https://github.com/pangolulu/rnn-from-scratch


Related Articles