Recurrent Neural Networks Explained with a Real Life Example and Python Code

Using Recurrent Neural Networks in a Sentiment Analysis task

Carolina Bento
Towards Data Science

--

Image by Author

This is the second article in a series dedicated to Deep Learning, a group of Machine Learning methods that has its roots dating back to the 1940’s. Deep Learning gained attention in the last decades for its groundbreaking applications in areas like image classification, speech recognition, and machine translation.

The first article focused on the MultiLayer Perceptron. Stay tuned if you’d like to see different Deep Learning algorithms explained with real-life examples and some Python code.

The development of the MultiLayer Perceptron was an important landmark for Artificial Neural Networks. For the first time we could stack together many perceptrons and organize them in layers, to create models that best represent complex problems.

MultiLayer Perceptron works in an atemporal, discrete way. It takes one input vector, performs a feedforward computational step, back-propagates the errors, and stops once the loss function can’t be minimized any further, generating the output.

Example of a MultiLayer Perceptron with one hidden layer. (Image by author)

But lots of real-world problems involve a time dimension. What about when the problem at hand comes in the form of a sequence?

Neural Networks for Sentiment Analysis

In the previous article you used the MultiLayer Perceptron for the task of Sentiment Analysis. You took the reviews for your parents’ cozy bed and breakfast in the countryside, trained a MultiLayer Perceptron and predicted the overall sentiment of the review.

MultiLayer Perceptron used to learn the sentiment of reviews from your parent’s bed and breakfast. (Image by author)

The MultiLayer perceptron used the tokenized word vector of each review as input, but it looked at each review a single, an atomic unit.

A more precise way of analyzing these reviews would take into account the position of each word the review, because the structure of a sentence plays a role in giving it meaning.

A much powerful model than the MultiLayer Perceptron would analyze reviews according to how each sentence is constructed. For instance, with sentences like This time around the service was great and The service was great this time around, it would be clever enough to determine these sentences have same sentiment[1]. Even though the words were shuffled around.

To go one step further in that sentiment analysis task, you need a different model.

You need a model that looks at each review as an ordered sequence of words, not an atomic unit. Each sequence of words can also have and an arbitrary length, since each sentence can be made up of a different number of words.

You need to build a Recurrent Neural Network.

Recurrent Neural Networks

Recurrent Neural Networks are used in several domains. For instance, in Natural Language Processing (NLP), they’ve been used to generate handwritten text, perform machine translation and speech recognition. But their applications are not restricted to processing language. In Computer Vision, Recurrent Neural Networks have been used in tasks like image captioning and image question-answer.

What distinguishes a Recurrent Neural Network from the MultiLayer Perceptron is that a Recurrent Neural Network is built to handle inputs that represent a sequence, like the sequence of words in a review from your parents’ bed and breakfast. But it also handles an output sequence, like when you’re translating a sentence from one language to another.

Recurrent Neural Networks act like a chain. The computation performed at each time step, depends on the previous computation.

The magic of Deep Learning training is in the hidden layers.

At the beginning of training the architecture of the network is defined, in this case, a Recurrent Neural Network. But we only know which inputs match with which outputs. The algorithm will learn how to use the hidden layers to make the best approximation of each input to output data point[1].

Visualizing a Recurrent Neural Network

One of the best ways to visualize a Recurrent Neural Network is as a cyclic computational graph[1]. In this representation the Recurrent Neural Network has three major states:

  • Input state, which captures the input data for the model.
  • Output state, which captures the results of the model.
  • Recurrent state, which is in fact a chain of hidden states, and captures all the computations between the input and output states.

Similarly to other Supervised Machine Learning Models, Recurrent Neural Networks use a loss function to compare the output of the model to the ground truth. That loss is later back-propagated and model weights are update, like in the MultiLayer Perceptron.

Recurrent Neural Network represented as a computation graph. (Image by author)

This is a compressed view, more like a summary of the mechanics of Recurrent Neural Networks. In practice, it’s easier to visualize the recurrence when you unfold this graph. Especially when working with text sequences.

In this unfolded view of a Recurrent Neural Network each computation corresponds to one step, also referred to as internal state. And each step depends on the computation from the previous step.

Recurrent Neural Network represented as an unfolded computational graph. (Image by author)

As each internal state relies on the previous one, you have information that is propagated onto each layer of neurons in the network since the beginning of the sequence. Like an old memory that is passed on to future generations.

If MultiLayer Perceptron meant stacking multiple neurons in layers, Recurrent Neural Networks means chaining MultiLayer Perceptrons, to create a sequence of dependent computations.

In the case of a Recurrent Neural Network, memories are information about the computations applied to the sequence so far.

Recurrent Neural Network Superpower: Parameter Sharing

A single weight vector is shared across all time steps in the network.

A key characteristic of Recurrent Neural Networks is parameter sharing. There’s only one set of parameters that is used, and optimized, across all parts of the network. If those parameters were not shared, the model would have to learn the parameters for each part of the input sequence and would have a much harder time generalizing examples it had not seen yet[1].

Sharing parameters gives Recurrent Neural Networks the ability to handle inputs with different lengths, and still perform predictions in a an acceptable time frame.

Shared parameters are particularly important to generalize sequences that share inputs, although in different positions. In the case of your parent’s bed and breakfast reviews, without shared parameters, the network would have a much hard time, and would do repeated work learning the same language rules multiple times, to figure out sentences like This time around the service was great and The service was great this time around, have the same output sentiment.

This is a major advantage of RNNs because, without parameter sharing, you’d have to learn a different model for each time step in your sequence, and you’d need a large training set to accommodate the different model training steps.

In practice, parameter sharing means the output function is a result of the output from previous time steps, each step updated with the same rule. The latter being another important aspect of Recurrent Neural Networks, the update rule is the same, which means, at each time step, the network applies the same activation function.

Many Hidden States, the same Activation Function

The network can have as many hidden states as you’d like to, but there’s one important constant. In each hidden state you’re always computing the same activation function. The output of each layer is calculated using the same function[3].

Activation function applied to a Recurring Neural Network to compute the hidden state h. (Image by author)

The activation function can be as simple as a linear function or the sigmoid function. But the hyperbolic tangent is also commonly used in Deep Learning, because it tends to have fewer occurrences of vanishing gradients when compared to the sigmoid.

Training a deep neural network with hyperbolic tangent is as simple as training a linear model, as long as computations are small [1].

The hyperbolic tangent, represented as tanh(x), is a solid choice for activation function. It behaves like the identity function near zero, such that tanh(0) = 0.

So, as long as the computations for this activation function are small, training a deep neural network with hyperbolic tangent is as simple as training a linear model[1].

The output layer is special: it may need Softmax

Right now you might be thinking Didn’t you say that every layer uses the same activation function?

Yes but, in classification tasks, the output layer is special.

Specifically for Neural Networks that tackle classification problems, there’s also another activation function, applied only to output layer, almost like a post-processing step. Meet the Softmax function.

The output of a binary classification problem is either a 0 or a 1. But sometimes you’re tackling a multi-class problem, for instance, if the reviews for your parents’ bed and breakfast were categorized as Positive, Neutral or Negative.

In this case, with 3 possible output classes, it’s more useful to know how likely the observation is to belong to the positive class. This is why, along with the activation function you choose for the network, the Softmax function is applied in the output layer.

Softmax equation. (Image by author)

Vector z is the result of all the computations since the first layer, it’s the vector that reaches the output layer.

For instance, if your neural network only has one linear layer, vector z can be look like:

Vector z for a linear layer. (Image by author)

With Softmax, this vector is exponentiated and, since this is a classification task, normalized across all possible K classes, turning the vector into a probability distribution. That’s why Softmax is also called the Normalized exponential function.

After applying Softmax, the sum of all values of vector z always adds up to one. Its values represent the probability that the observation given to network in the input layer belongs to each class.

If you don’t apply Softmax to the output layer, the output of your Recurrent Neural Network is a number for

each of the possible classes. The output with the highest value is the winning class for that observation[3].

Backpropagation Through Time

So far you’ve looked into the broader architecture and components of a Recurrent Neural, i.e., the activation and loss functions. Let’s focus on the learning.

As the computations flow from each hidden in a layer to the next, it moves forward, towards the output layer. Reaching the output layer you compute the loss function, meaning you compare the output generated to the expected true value for that training observation.

If the process stopped here, the network would not be able to learn. It would be a feedfoward network, since information moves forward in the network structure. But learning needs some sort of loop. Just like when you learn something in school or on your own: information gets to your brain, this is the feedfoward part, then you process it and as you do this, you sanity-check and, sometimes re-learn certain things, this is the part I call the loop.

Neural networks are inspired by the brain, mostly inspired by how neurons work. But these neural network architectures, also have this a loop part. The part where they learn how to map the input data to one of the possible outcomes.

Back to Recurrent Neural Networks!

Inputs go through all the layers and, when it reaches the output layer it computes the loss function. Now, knowing how different, or distant, from the expected result that chain of computations was, it takes the value of the loss function and computes its gradient with respect to the parameters.

Then, with the help of another algorithm, like Stochastic Gradient Descent, that gradient is sent back in the opposite direction. All the way back to the input layer.

This way you’re back-propagating the loss function.

The actual computation of the gradient is what is called back-propagation [1]. But is usually confused with the actual learning part, accomplished by algorithms like Stochastic Gradient Descent.

Reaching the input layer Stochastic Gradient Descent, or other gradient-based optimizer algorithm, adjusts the network weights and the activation function is computed again through every hidden layer.

Each weight in the network is updated by subtracting the value of the gradient with respect to the loss function (J), from the current weight vector (theta). This is also called the weight change amount (vt).

Backpropagation weight update rule (Image by author).

The process goes on until the loss function can’t be minimized anymore, so there’s no point in adjusting weights once again, and the performance of the neural network is improved to the maximum given the current training set and architecture.

More complex algorithms and Neural Network Architectures may increase performance, but it also adds complexity to all computations, leading to longer training time.

In Machine Learning, you’re always dealing with trade-offs. With more complex models some of the trade-offs are (a) increasing performance, but reducing model interpretability or (b) increasing performance, but increasing training time.

RNNs in Action: Sentiment Analysis 👍 👎

Instead of building a Recurrent Neural Network from scratch, you’ve decided to use TensorFlow’s robust library to help with classify the sentiment from reviews of your parents bed and breakfast.

First thing you need is to install TensorFlow on your machine via pip, since you’re going to use the local Python environment.

Following the TensorFlow docs:

# Requires the latest pip
> pip install --upgrade pip
# Current stable release for CPU and GPU
> pip install tensorflow

Then you noticed there’s a robust example of Text Classification using RNNs on the TensorFlow resources[6]. That’s a good basis for what you want to do, you just need to adapt it to your own task.

At a glance, setting up a Recurrent Neural Network to classify the sentiment of all reviews from your parents’ bed and breakfast involves:

  1. Organizing Training and Testing Files
  2. Loading Training and Testing Datasets
  3. Dimensionality Problem: Compressing your Dataset (Vectorization)
  4. Building the Recurrent Neural Network
  5. Model Fitting and Evaluation
  6. Accuracy and Loss Visualized

Organizing Training and Testing Files

The first step is taking the reviews your cousins have helped classify and organizing them into the corresponding directories. There will be train and test directories, each one with sub-directories that contain positive and negative reviews.

After this organization, if you run the command tree on your project directory, you’ll have something like this:

Train and testing dataset directories (Image by author)

What’s important in this step is there is only one review per file, so it can be properly loaded and processed in the following steps.

Loading Training and Testing Datasets

To load the train and test datasets you can leverage Keras utility method text_dataset_from_directory, which specifically requires the directory dataset structure you put together on the previous step.

Loading train and testing datasets (Image by author)

Dimensionality Problem: Compressing your Dataset (Vectorization)

Your parents’ bed and breakfast reviews are not that lengthy. There will be more enthusiastic customers, who write a short essay about their experience but, from a data perspective, the result is a vector of relatively low dimensionality.

But what if your parents business was so popular that Travel and Leisure magazines start writing articles, essentially reviews, about them? In this case we’re talking about much larger dataset, and each review is a magazine article of at least 800 words.

In this scenario, you’re dealing with a dimensionality problem. You have very large and sparce vectors, which makes all matrix calculations computationally challenging.

To handle this situation you want to do something similar to what you do with Principal Component Analysis. You want to compress your dataset, while keeping all of its expressiveness, all its core characteristics.

“These are built from a very large corpus of documents by a variant of principal components analysis. The idea is that the positions of words in the embedding space preserve semantic meaning; e.g. synonyms should appear near each other.” [5]

A typical preprocessing step is to reduce the dimensionality with wor2vec[4]. In TensorFlow the Keras TextVectorization layer does something similar. It takes a string and either maps it to a 1-dimensional tensor of indices or a 1-dimensional tensor of floats that represent the data in the string.

Loading data and Vectorizing training dataset. (Image by author)

At the end of this step, the training dataset is vectorized and the data preparation phase is complete.

Now you’re ready to build the Recurrent Neural Network.

Building the Recurrent Neural Network

Your Recurrent Neural Network model is, in practice, a group of Sequential layers.

The first one is the vocabulary encoder, created on the previous step. It’s used in the Embedding layer, which converts the values in the encoded vectors into a specific range.

For instance, the vocabulary in your reviews consists of 151 words, obtained by running:

len(encoder.get_vocabulary())

So the model takes the vocabulary size as input, via input_dim, and returns an output of size 64, defined using output_dim. And because you specified a vocabulary size of 151, the largest integer in the mapping will be 150.

Loading data and Vectorizing training dataset and Building the RNN. (Image by author)

The next layer, Bidirectional, indicates you want to create a bidirectional Recurrent Neural Network. This means the input of the network is propagates forwards and backwards through the NRR layers. The memories the network creates over time, as it is processing the input, are not only passed forward to the following cells, but are also passed to previous cells. Now, each cell in the network has information about the past and what lies ahead in the input sequence.

The advantage of using a Bidirectional Recurrent Neural Network is that is not just the previous information in the network that contributes to the output prediction. Knowing what is coming up ahead in the sequence can have a significant influence on how the model learns.

This advantage is a double-edged sword. As information is now propagated backwards and forwards, these networks tend to be much slower, because gradients now have a much longer dependency chain.

However, your parents’ bed and breakfast reviews is a small dataset. So you’ve decided it’s worth a try!

The last two layers in the model are Dense Layers. The second to last is used to process the model loss, with the hyperbolic tangent as activation. The last one reshapes the output to be of size 1, given that you want the output of the mode to be the positive or negative class index.

In this case, the RNN is created using 30 GRUCells. These Gated Recurrent Units (GRU) use the hyperbolic tangent as the activation function for recurrent step.

Model Fitting and Evaluation

Now that you’ve built and compiled the Recurrent Neural Network, it’s time to fit it to the training dataset and make some predictions.

A few parameters you can tune are:

  • epochs how many times you’d like the algorithm to go through the entire training set
  • shuffle in case you’d like to shuffle the training data before each epoch iteration. True by default
  • validation_steps the number of batches of samples to draw before the validation step is concluded, and the algorithm starts the new epoch
From Loading data and Vectorizing training dataset to Building and evaluating the RNN. (Image by author)

Accuracy and Loss Visualized

As a final step, it’s always interesting to visualize the loss and accuracy of the model. Especially if it’s running through multiple iterations.

import os
import tensorflow as tf
import matplotlib.pyplot as plt
# Loading Training and Test Datasets
train_dir = os.path.join('', '../datasets/train')
train_dataset = tf.keras.utils.text_dataset_from_directory(
train_dir, label_mode='int', labels='inferred', follow_links = True
)
test_dir = os.path.join('', '../datasets/test')test_dataset = tf.keras.utils.text_dataset_from_directory(
test_dir, label_mode='int', labels='inferred', follow_links = True
)
# Vectorize training dataset
VOCAB_SIZE = 5000
encoder = tf.keras.layers.TextVectorization(max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))
# Building the Recurrent Neural Network
# using GRU cells and Hyperbolic tangent as activation function
cell = tf.keras.layers.GRUCell(30, recurrent_activation='tanh')
model = tf.keras.Sequential([
encoder,
tf.keras.layers.Embedding(
input_dim=len(encoder.get_vocabulary()),
output_dim=64,
# Use masking to handle the variable sequence lengths
mask_zero=True),
tf.keras.layers.Bidirectional(tf.keras.layers.RNN(cell)),
tf.keras.layers.Dense(60, activation='tanh'),
tf.keras.layers.Dense(1)
])
# Compile model and use the algorithm Adam as optimization function
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.Adam(1e-2), metrics=['accuracy'])
# Fitting the model
history = model.fit(train_dataset, epochs=10, validation_data=test_dataset, validation_steps=10)
# Model Evaluation
test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss:', test_loss)
print('Test Accuracy:', test_acc)
# Visualize Model Loss and Accuracy
def plot_graphs(history, metric):
plt.plot(history.history[metric])
plt.plot(history.history['val_'+metric], '')
plt.xlabel("Epochs")
plt.ylabel(metric)
plt.legend([metric, 'val_'+metric])
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
plot_graphs(history, 'accuracy')
plt.ylim(None, 1)
plt.subplot(1, 2, 2)
plot_graphs(history, 'loss')
plt.ylim(0, None)

Here you’re also leveraging code from the handy example from TensorFlow’s documentation page to plot the loss and accuracy for the training and validation datasets.

Plotting model loss and accuracy throughout each epoch. (Image by author)

After making predictions on the test dataset, the accuracy is far from perfect. It hovers around 0.5 which means your model is, technically, not much better than a random guess.

The accuracy stays very much stable at 50% throughout all evaluation epochs, and the loss starts steadily increasing after the third epoch.

But these results shouldn’t discourage you. Your dataset is very small and the RNN architecture used is very simplistic, and could definitely be refined. This is just a first run on understanding RNNs.

Conclusion

There are several different ways to build a recurrent neural network, depending on the task at hand.

It can have a one-to-many structure, like when the model has to create a caption for a given image. But if you’re translating English to French or vice-versa, you’ll be building a Recurrent neural network with a many-to-many structure.

For the Sentiment Analysis task or classifying all reviews from your parents’ cozy bed and breakfast, the network had a many-to-one structure, several words in a review contributing to a single output, the sentiment class. Either positive or negative.

The network you’ve created was relatively simple, and had an unimpressive 50% accuracy.

But I hope you got a better sense of what is a Recurrent Neural Network, why it is such game-changer Deep Learning network architecture, and the kinds of real-life problems it can be applied to.

Thanks for reading!

References

  1. Goodfellow, Ian J., Bengio, Yoshua and Courville, Aaron. Deep Learning, MIT Press, 2016
  2. The Unreasonable Effectiveness of Recurrent Neural Networks
  3. Heaton, Jeff. Applications of Deep Neural Networks
  4. Tomas Mikolov Kai Chen Greg S. Corrado Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. (2013)
  5. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2021). An introduction to statistical learning : with applications in R. (2nd Edition) Springer
  6. Text classification with an RNN, TensorFlow Documentation

--

--