Recurrent Neural Network-Head to Toe

Akhilesh Rai
Towards Data Science
8 min readOct 24, 2019

--

The neurone is a building block of the human brain. It analyses complex signals within microseconds and sends signals to the nervous system to perform tasks. The architecture of the neurone is the same for every neurone which means structural layers do not change from neurone to neurone. Make these layers successive and it could very easily replicate our brain. These successive “layers" helps us with our daily activities, complex decision making, language processing.

But, how to generalise our problem across these layers? What kind of modelling would be required to generalise?

The answer came to researchers in the form of parameter sharing. It helps extend and apply the model to different forms of data. This is done by sharing members of the output as a function of previous members of the output. The members of the output are produced by the same update rule. An easier way to comprehend this structure of computations would be to use ‘Unfolding computational graphs’. Unfolding of graph results in sharing the deep network of parameters in the structure.

Consider the classical form of a dynamical system:

Unfolding the structure of this dynamical system we get:

As you can see the equation no longer involves recurrence.

Recurrent neural networks can be built in different ways, some of them can also have hidden units. When a recurrent neural network is trained to perform based on past inputs the summary is lossy, as we are mapping an arbitrary length sequence to a vector h(t). Depending on the task at hand, we also might select which past inputs we might selectively keep some aspects of the past sequence.

Unfolded graph: A recurrent network with no outputs, it processes the information from the input x by incorporating it into the state h that is passed forward through time.

We can represent the unfolded recurrence after t steps with a function g(t).

The function of past sequence takes g(t) as input.

The unfolding process has some major advantages and leads to factors that make it possible to make the model f ubiquitous which further allows generalisation.

a) Despite the length of the input sequence, the model has the same input size.

b) It is possible to use the same transition function f with the same parameters at every time step because it is specified from one state to another.

The unfolded graph illustrates the idea of explicit description and information flow both forward and backward in time by showing the path along which this information flows.

These ideas were important to building the recurrent neural network as RNN produced output at each time step and had connections between hidden units could produce an output by reading an entire sequence and then produce a single output. This leads to a conclusion that any function that is computable by a Turing machine can be computed by a recurrent neural network of a finite size. It is this nature of using past outputs, hidden layers connections that have led RNN to accomplish its laurels today.

Recurrent neural network

With the knowledge of graph unrolling and parameter sharing we now develop the RNN. We assume the hyperbolic tangent activation function. A natural way to regard output is by giving unnormalised log probabilities, we apply softmax as post-processing step.

Computation Graph to compute training loss of recurrent neural network.The sequence of output values o is compared to the training targets y, this leads to the computation of the loss function. We assume o is the unnormalised log probabilities. The loss function L internally computes y^ = softmax(o) and compares this to target y.The RNN has input to hidden connections parameterised by a weight matrix U, parameterised by a weight matrix W, and hidden to output connection parameterised by a weight matrix V.

The computation in RNN can be decomposed to three blocks of parameters:

1. From input to hidden state

2. From previous hidden state to present hidden state

3. From hidden state to the output

Each of these blocks is associated with a separate weight matrix. When the network is unfolded each of these blocks correspond to shallow transformation(transformation that affects a single layer).

The above equations specify forward propagation of this model,forward propagation begins with a specification of initial state h(0) for each time step from t = 1 to t= T

This is an image of RNN that maps input sequence to output sequence pf same length. Total loss would be sum of losses over time. L(t) is negative log likelihood of y(t).

The gradient computation involves forward propagation moving from left to right followed by a back-propagation pass moving from right to left. States computed in forward propagation must be used in back propagation so they need to be stored. The back-propagation algorithm applied through output o(t) is called ‘ back-propagation through time ’. Computing gradient through the recurrent neural network is straight forward. One simply applies the generalised back propagation algorithm to unroll the computational graph. Gradients obtained by back propagation may then be used to train an RNN.

Let's make the RNN model in Pytorch. We produce the pseudo-code here I shall post the entire quote in my GitHub link: Link

Pytorch MNIST Training

First we import the classes and the MNIST Dataset.

import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.datasets as dsets
import numpy as np
from matplotlib import pyplot as plt

n_epochs = 3
batch_size_train = 64
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 10

random_seed = 1
torch.backends.cudnn.enabled = False
torch.manual_seed(random_seed)

train_loader = torch.utils.data.DataLoader(dsets.MNIST(‘/Users/akhileshrai/Downloads’, train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=batch_size_train, shuffle=True)

test_loader = torch.utils.data.DataLoader(
dsets.MNIST(‘/Users/akhileshrai/Downloads’, train=False, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=batch_size_test, shuffle=True)

examples = enumerate(test_loader)
batch_idx, (example_data, example_targets) = next(examples)

import matplotlib.pyplot as plt

fig = plt.figure()
for i in range(6):
plt.subplot(2,3,i+1)
plt.tight_layout()
plt.imshow(example_data[i][0], cmap=’gray’, interpolation=’none’)
plt.title(“Number: {}”.format(example_targets[i]))
plt.xticks([])
plt.yticks([])
print(fig)

The data loaders load the MNIST dataset. The dataset is downloaded in the mentioned folder. The transforms are used to first convert data to tensors then normalise the data.

Let’s look at the same by printing the samples of the test dataset.

Samples of the MNIST Dataset

Let's prepare our neural network based on the above shown architecture:

class RNNModel(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
super(RNNModel, self).__init__()
# Hidden dimensions
self.hidden_dim = hidden_dim

# Number of hidden layers
self.layer_dim = layer_dim

# Building your RNN
# batch_first=True causes input/output tensors to be of shape
# (batch_dim, seq_dim, input_dim)
# batch_dim = number of samples per batch
self.rnn = nn.RNN(input_dim, hidden_dim, layer_dim, batch_first=True, nonlinearity=’tanh’)

# Readout layer
self.fc = nn.Linear(hidden_dim, output_dim)

def forward(self, x):
# Initialize hidden state with zeros
# (layer_dim, batch_size, hidden_dim)
h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_()

# We need to detach the hidden state to prevent exploding/vanishing gradients
# This is part of truncated backpropagation through time (BPTT)
out, hn = self.rnn(x, h0.detach())

# Index hidden state of last time step
# out.size() → 100, 28, 10
# out[:, -1, :] → 100, 10 → just want last time step hidden states!
out = self.fc(out[:, -1, :])
# out.size() → 100, 10
return out

input_dim = 28
hidden_dim = 100
layer_dim = 3
output_dim = 10

model = RNNModel(input_dim, hidden_dim, layer_dim, output_dim)

print(model)
print(len(list(model.parameters())))
for i in range(len(list(model.parameters()))):
print(list(model.parameters())[i].size())

Our model has 3 hidden layers, 100 hidden neurons (per layer) and takes in an input data of 28 dimensions while letting out a 10-dimensional data. The activation function we assume is the hyperbolictanh’ . The stochastic gradient descent is used to find the gradient of the cost function of a single example at each iteration instead of the sum of the gradient of the cost function of all the examples.

The model then trained over 5000 iterations, a method called early stopping is used to prevent the model from ‘overfitting’.

Early Stopping is an optimisation technique done by calculating the Validation loss. If the validation loss does not decrease over a specified number of iterations the model halts its training.

learning_rate = 0.01
min_val_loss = np.Inf
epochs_no_improve = 0
n_epochs_stop = 2
early_stop = False

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
# Number of steps to unroll
criterion = nn.CrossEntropyLoss()

# Number of steps to unroll
seq_dim = 28
early_stop = False
iter = 0
for epoch in range(num_epochs):

val_loss = 0
for i, (images, labels) in enumerate(train_loader):
# Load images as a torch tensor with gradient accumulation abilities
images = images.view(-1, seq_dim, input_dim).requires_grad_()

# Clear gradients w.r.t. parameters
optimizer.zero_grad()

# Forward pass to get output/logits
# outputs.size() → 100, 10
outputs = model(images)

# Calculate Loss: softmax → cross entropy loss
loss = criterion(outputs, labels)

# Getting gradients w.r.t. parameters
loss.backward()

# Updating parameters
optimizer.step()

val_loss += loss
val_loss = val_loss / len(train_loader)
# If the validation loss is at a minimum
if val_loss < min_val_loss:
# Save the model
#torch.save(model)
epochs_no_improve = 0
min_val_loss = val_loss

else:
epochs_no_improve += 1
iter += 1

if iter % 500 == 0:
# Calculate Accuracy
correct = 0
total = 0
# Iterate through test dataset
# Check early stopping condition
if epochs_no_improve == n_epochs_stop:
print(‘Early stopping!’ )
early_stop = True
break
else:
continue
break
if early_stop:
print(“Stopped”)
break



for images, labels in test_loader:
# Resize images
images = images.view(-1, seq_dim, input_dim)

# Forward pass only to get logits/output
outputs = model(images)

# Get predictions from the maximum value
_, predicted = torch.max(outputs.data, 1)

# Total number of labels
total += labels.size(0)

# Total correct predictions
correct += (predicted == labels).sum()

accuracy = 100 * correct / total

# Print Loss
print(‘Iteration: {}. Loss: {}. Accuracy: {}’.format(iter, loss.item(), accuracy))

Description: Training the RNN using early Early Stopping.

Model Investigation and Conclusion:

The model does well w.r.t detection of handprinted numbers that have no relation with previous numbers. If the numbers were correlated across sequences(both long term and short term correlation), the model would not do well.

The back-propagation through time algorithm is expensive because the states computed in the forward pass would be stored until they are reused in the backward pass so the memory cost is also O(T).

Another problem the RNNs could face is ‘cliffs’. This happens with nonlinear functions as they tend to have derivatives that can be very large or very small in magnitude. ‘Clipping gradients’ is a technique that is used to make gradient descent perform more reasonably and to restrict step size.

I hope this gave you an insight of how RNNs are made how their intuition shaped the world of deep learning.

Some good research papers and books to get you started:

  1. Alex Graves: Generating Sequences With Recurrent Neural Networks.
  2. Deep Learning: Ian Goodfellow
  3. Neural Networks for Pattern Recognition: C. Bishop.
  4. Y. Bengio; P. Simard ; P. Frasconi: Learning long-term dependencies with gradient descent is difficult

As always comments are welcome.

--

--