LSTMs was and still is an essential part of the growth of Machine Learning. It allows for temporal pattern recognition, in the same way that convolutional networks allow for spatial pattern recognition. Ever since, it was proposed it has mostly remained the same. How much better could it be if we tried to change/experiment with its architecture and features?
A few months ago, I built a trainable LSTM from scratch. This will be used as the control for the experiment. I will be adding, removing, magnifying and downscaling certain features and observe their effect on convergence.
Let me explain what each of the features means specifically:
Architecture:
The input of a LSTM follows a certain sequence of operations, before being outputted. Here are the parts and how they work:
General Gate:
The general gate is a normal neural network that takes in the input into the LSTM. It kickstarts the process in which the data is processed through a set of different gates.
Select Gate:
The select gate is actually a twin neural network, that takes the same input as the general gate. Although all the gates are the same, it is the way in which it is connected that gives it its purpose.
The select gate has a sigmoid function as its activation function, meaning that all values in its output range between 0 and 1. This output is multiplied by the output from the general gate. Since both neural networks architecture is identical, the output shapes of the networks are the same. This means that each value outputted by the forget gate can be matched up with a value from the general gate.
This means that we can understand a 1 from the forget to gate to mean "select to increase weightage to output" and 0 to "no effect on output", and every value in between to mean something between remembering and forgetting.
See? By observing the use of the sigmoid function and the multiply gate, we understand how the same neural network from the general gate can be repurposed to form the select gate.
Forget Gate:
The forget gate (wait for it…) shares its architecture with the general gate and the select gate. The magic (again) comes from the positioning and the activation function.
The forget gate has the same positioning and sigmoid activation as the select gate, but it also contains another part. It is multiplied by a value called collected possibilities. This is the value from the previous propagation of the network. This means that the forget gate becomes the forget gate, due to its ability to draw from its "memory".
The output from this forget gate (counting the multiplication with the previous propagation) is added to the general output. This means that a similar set output from the previous propagation is added to the general output’s value, in the similar way current memories are tainted by previous memories.
While going through these gates, observe how these start to build an ability to access temporal features of the data.
Here is a comprehensive look at this process:

- The input data is simultaneously fed into the three different neural networks.
- The general output uses the hyperbolic tangent function to prevent exploding gradients.
- The forget gate uses sigmoid and is multiplied by the collected possibilities. It is then added to the general output.
- The select gate uses the sigmoid function and is then multiplied to the previous value (general output + forget gate).
A few important things to note:
- The line in which the parts are added to the output of the general gate is called the cell state. This is important as when the network is adapted, you can understand how exactly each part is connected to each other.
- When I said that the networks are identical, I am only referring to its configuration and not its weights.
Let’s build this vanilla LSTM network from scratch:
Step 1| Prerequisites:
import numpy
from matplotlib import pyplot as plt
def sigmoid(x):
return 1/(1+np.exp(-x))
def sigmoid_p(x):
return sigmoid(x)*(1 -sigmoid(x))
def tanh(x):
return np.tanh(x)
def tanh_p(x):
return 1.0 - np.tanh(x)**2
def deriv_func(z,function):
if function == sigmoid:
return sigmoid_p(z)
elif function == relu:
return relu_p(z)
elif function == tanh:
return tanh_p(z)
For the program to work, it requires numpy for array manipulation and matplotlib for plotting the loss values. It also contains the mathematical definitions of the activation functions, as well as the derivatives of the functions.
Step 2| LSTM class:
class LSTM:
def __init__(self,network):
def plus_gate(x,y):
return np.array(x) + np.array(y)
def multiply_gate(x,y):
return np.array(x) * np.array(y)
class NeuralNetwork:
def __init__(self,network):
self.weights = []
self.activations = []
for layer in network:
input_size = layer[0]
output_size = layer[1]
activation = layer[2]
index = network.index(layer)
if layer[3] == 'RNN':
increment = network[-1][1]
else:
increment = 0
self.weights.append(np.random.randn(input_size+increment,output_size))
self.activations.append(activation)
The best programs contain a rigid structure. The LSTM class contains an init section that contains all the important part of the program. The adding gate and the multiply gate, as well as the start of the neural network class, so that it can easily be changed.
The neural network contains all of the definitions of the weights, as well as the activation functions at each layer.
Step 3| Propagation:
def propagate(self,data):
input_data = data
Zs = []
As = []
for i in range(len(self.weights)):
z = np.dot(input_data,self.weights[i])
if self.activations[i]:
a = self.activations[i](z)
else:
a = z
As.append(a)
Zs.append(z)
input_data = a
return As,Zs
This section is self-explanatory: it is obvious that every neural network requires propagation for it to yield results. This is the basic propagation that allow perceptron-type neural networks share: matrix multiplication.
Step 4| Training:
def network_train(self, As,Zs,learning_rate,input_data,extended_gradient):
As.insert(0,input_data)
g_wm = [0] * len(self.weights)
for z in range(len(g_wm)):
a_1 = As[z].T
pre_req = extended_gradient
z_index = 0
weight_index = 0
for i in range(0,z*-1 + len(network)):
if i % 2 == 0:
z_index -= 1
if self.activations[z]:
pre_req = pre_req * deriv_func(Zs[z_index],self.activations[z])
else:
pre_req = pre_req * Zs[z_index]
else:
weight_index -= 1
pre_req = np.dot(pre_req,self.weights[weight_index].T)
a_1 = np.reshape(a_1,(a_1.shape[0],1))
pre_req = np.reshape(pre_req,(pre_req.shape[0],1))
pre_req = np.dot(a_1,pre_req.T)
g_wm[z] = pre_req
for i in range(len(self.weights)):
self.weights[i] += g_wm[i]*learning_rate
self.plus_gate = plus_gate
self.multiply_gate = multiply_gate
self.recurrent_nn = NeuralNetwork(network)
self.forget_nn = NeuralNetwork(network)
self.select_nn = NeuralNetwork(network)
This training of the program is simple: it finds the partial derivative of each weight, with respect to the loss function. How does it bridge such a large mathematical gap? The program takes it step by step.
Firstly, it finds the partial derivative between the weight and the layer output, and then the layer output the network output, then the network output to the loss function. According to the chain rule, by multiplying all of these derivatives together, we should result in the derivative linking the weight to the loss function. The actual implementation is more complicated, as the computer needs to know which weight and layer it is working on, and on which order of the network it is currently on.
Step 5| Defining network parts:
def cell_state(self,input_data,memo,select):
global rnn_As,rnn_Zs
rnn_As,rnn_Zs = lstm.recurrent_nn.propagate(input_data)
yhat_plus = tanh(rnn_As[-1])
plus = self.plus_gate(yhat_plus,memo)
collect_poss = plus
yhat_mult = tanh(plus)
mult = self.multiply_gate(yhat_mult,select)
pred = mult
return pred,collect_poss
def forget_gate(self,input_data,colposs):
global forget_As,forget_Zs
forget_As,forget_Zs = lstm.forget_nn.propagate(input_data)
yhat_mult = sigmoid(forget_As[-1])
mult = self.multiply_gate(colposs,yhat_mult)
memo = mult
return memo
def select_gate(self,input_data):
global select_As,select_Zs
select_As,select_Zs = lstm.select_nn.propagate(input_data)
yhat_mult = sigmoid(select_As[-1])
select = yhat_mult
return select
This is the implementation of the parts that I have described above. It explains how the input should be handled in order to reach the output, so that it can be connected to cell state to return the output of the program.
Step 6| Defining full LSTM propagation:
def propagate(self,X,network):
colposs = 1
As = []
for i in range(len(X)):
input_data = X[i]
if i == 0:
increment = network[-1][1]
input_data = list(input_data) + [0 for _ in range(increment)]
else:
input_data = list(input_data) + list(pred)
input_data = np.array(input_data)
memory = self.forget_gate(input_data,colposs)
select = self.select_gate(input_data)
pred,colposs = self.cell_state(input_data,memory,select)
As.append(pred)
return As
This section implements the previously described propagation process where each of the gates add their output to the cell state, where the final output is formed.
Step 7| Training the LSTM in full:
def train(self,X,y,network,iterations,learning_rate):
colposs = 1
loss_record = []
for _ in range(iterations):
for i in range(len(X)):
input_data = X[i]
if i == 0:
increment = network[-1][1]
input_data = list(input_data) + [0 for _ in range(increment)]
else:
input_data = list(input_data) + list(pred)
input_data = np.array(input_data)
memory = self.forget_gate(input_data,colposs)
select = self.select_gate(input_data)
pred,colposs = self.cell_state(input_data,memory,select)
loss = sum(np.square(y[i]-pred).flatten())
gloss_pred = (y[i]-pred)*2
gpred_gcolposs = select
gpred_select = colposs
gloss_select = gloss_pred * gpred_select
gpred_forget = select*sigmoid_p(colposs)*colposs
gloss_forget = gloss_pred * gpred_forget
gpred_rnn = select*sigmoid_p(colposs)
gloss_rnn = gloss_pred*gpred_rnn
self.recurrent_nn.network_train(rnn_As,rnn_Zs,learning_rate,input_data,gloss_rnn)
self.forget_nn.network_train(forget_As,forget_Zs,learning_rate,input_data,gloss_forget)
self.select_nn.network_train(select_As,select_Zs,learning_rate,input_data,gloss_select)
As = self.propagate(X,network)
loss = sum(np.square(y[i]-pred))
loss_record.append(loss)
return loss_record
This part of the network is extremely complicated. This is because I need to calculate the partial derivative of each individual weight to the loss function of the LSTM, meaning that I need to calculate many partial derivatives, between the output and the loss function, between the output to each gate, between the gate to the neural network, from the network to each layer, and finally from each layer to each weight. It is just a lot of manual work to find the right variables together for the program to function.
Now that we have the vanilla LSTM, let’s talk about the features experimented with:
Resistors and Strengtheners:
Resistors and strengtheners are one of the simplest features to add to the LSTMs. The resistors in a LSTMs works like one from an electrical circuit, it decreases the strength of a certain signal by a fixed amount. Let’s see what changes when we put the resistors at different parts of the LSTM.
def resistor(x):
resistance = 1/resistor_strength
resist = np.full(x.shape, resistance)
return x*resist
Pretty simple! Generate an array of values, in the same shape of the value that it will be applied upon. A value called resistor strength will be the baseline of how much the number will be offset. The higher the resistor strength, the lower the value.
self.resistor = resistor
Remember to add this value within the init function of the neural network so it can be referenced.
def forget_gate(self,input_data,colposs):
global forget_As,forget_Zs
forget_As,forget_Zs = lstm.forget_nn.propagate(input_data)
yhat_mult = sigmoid(forget_As[-1])
mult = self.multiply_gate(colposs,yhat_mult)
memo = self.resistor(mult)
return memo
I placed the resistor within the forget gate, to try and decrease the effect of the forget gate. This is not only testing the resistor’s affect, but where the resistor is placed is actually testing the importance of the value to the training of the network.
Let’s compare the loss of the LSTM on a generated dataset:

This is a graph of the resistance against the minimum loss required. Clearly, resistance does not work for convergence!

In fact, a negative resistance value (strengthening) is actually more effective than the neutral resistance of 1.
Have we concluded that resistance is bad for networks?
Observe this graph when the resistance is applied on the select gate instead:
def select_gate(self,input_data):
global select_As,select_Zs
select_As,select_Zs = lstm.select_nn.propagate(input_data)
yhat_mult = sigmoid(select_As[-1])
select = self.resistor(yhat_mult)
return select

The results are inconclusive. Like other parts of the LSTM the parts only work when putting them in particular parts of the network.
Architecture Restructuring:
As I have stated already, that the function of the neural network is created by its positioning, relative to the other neural networks in the LSTM. I will no create an "ignore" gate, that should be able to perform its namesake function.
The ignore gate is placed in between the forget gate and the general RNN. It uses the sigmoid activation function, as well as a multiplication gate, in which it is multiplied against the general neural network’s output.
How does this give the LSTM the ability to forget? This creates a matrix of values, deemed as the "filtered possibilities". The sigmoid function, similarly to the select function, acts as the neural network’s expression to its approval to exact values of the general neural network’s output. Since this is so low down in the cell state, the collected possibilities that are used for the memory are now based upon filtered possibilities. The LSTM has therefore gained the ability of selective memory.
Here is the code that I used to implement this:
Defining propagation of ignore gate:
def ignore_gate(self,input_data):
global ignore_As,ignore_Zs
ignore_As,ignore_Zs = lstm.ignore_nn.propagate(input_data)
ignore = sigmoid(ignore_As[-1])
return ignore
Adding to the init function of the LSTM:
self.plus_gate = plus_gate
self.multiply_gate = multiply_gate
self.recurrent_nn = NeuralNetwork(network)
self.forget_nn = NeuralNetwork(network)
self.select_nn = NeuralNetwork(network)
self.ignore_nn = NeuralNetwork(network)
self.resistor = resistor
Adding to LSTM propagation function:
def propagate(self,X,network):
colposs = 1
As = []
for i in range(len(X)):
input_data = X[i]
if i == 0:
increment = network[-1][1]
input_data = list(input_data) + [0 for _ in range(increment)]
else:
input_data = list(input_data) + list(pred)
input_data = np.array(input_data)
ignore = self.ignore_gate(input_data)
memory = self.forget_gate(input_data,colposs)
select = self.select_gate(input_data)
pred,colposs = self.cell_state(input_data,ignore,memory,select)
As.append(pred)
return As
Adding to the training function (new section with manually calculated derivatives):
def train(self,X,y,network,iterations,learning_rate):
colposs = 1
loss_record = []
for _ in range(iterations):
for i in range(len(X)):
input_data = X[i]
if i == 0:
increment = network[-1][1]
input_data = list(input_data) + [0 for _ in range(increment)]
else:
input_data = list(input_data) + list(pred)
input_data = np.array(input_data)
ignore = self.ignore_gate(input_data)
memory = self.forget_gate(input_data,colposs)
select = self.select_gate(input_data)
pred,colposs = self.cell_state(input_data,ignore,memory,select)
loss = sum(np.square(y[i]-pred).flatten())
gloss_pred = (y[i]-pred)*2
gpred_gcolposs = select
gpred_select = colposs
gloss_select = gloss_pred * gpred_select
gpred_forget = select*sigmoid_p(colposs)
gloss_forget = gloss_pred * gpred_forget
gpred_ignore = select*sigmoid_p(colposs)*yhat_mult
gloss_ignore = gloss_pred * gpred_ignore
gpred_rnn = select*sigmoid_p(colposs)*ignore
gloss_rnn = gloss_pred*gpred_rnn
self.recurrent_nn.network_train(rnn_As,rnn_Zs,learning_rate,input_data,gloss_rnn)
self.ignore_nn.network_train(select_As,select_Zs,learning_rate,input_data,gloss_select)
self.forget_nn.network_train(forget_As,forget_Zs,learning_rate,input_data,gloss_forget)
self.select_nn.network_train(select_As,select_Zs,learning_rate,input_data,gloss_select)
As = self.propagate(X,network)
loss = sum(np.square(y[i]-pred))
loss_record.append(loss)
return loss_record
Conclusion:
What I have done with resistance, strengthening and the architectural restructuring of the LSTM is just the beginning! I am sure that you will have the creativity and intuition to truly do something good with the basic framework that I have given you in this article. You might even be able to come up with some features!
My links:
If you want to see more of my content, click this link.