Creating Deep Neural Networks from Scratch, an Introduction to Reinforcement Learning

Part II: Reinforcement Learning and Backpropagation

Published in

Towards Data Science

13 min readApr 17, 2020

A dog learning to play fetch [Photo by Humphrey Muleba on Unsplash]

This post is the second of a three part series that will give a detailed walk-through of a solution to the Cartpole-v1 problem on OpenAI gym — using only numpy from the python libraries.
The first part laid the foundations, creating an outline of the program and building the feed-forward functions to propagate the state of the environment to its action values. This part will focus on the theory behind cumulative reward and action values in reinforcement learning and on building the backpropagation mechanism. These are foundational pieces for our agent’s learning process.

By the end of the previous article, we had a simple program loop: on every time step, the agent observes the state of the environment and passes that through its neural network (that was randomly initialized) to obtain predicted values for each action. It then (with probability 1-epsilon) picks the action with the greatest predicted reward. This process is then repeated on the next timestep.

At the moment, however, the agent does not know what happened after a particular action was taken because there is no feedback. When we are training a pet, for example, it is important that we create feedback loops of positive or negative reinforcement depending on whether their actions are desirable or undesirable. A dog remembers that fetching the ball earned him a treat (a valuable reward) in the past, and is more likely to prioritize fetching the next time a similar situation arises. Similarly, it is important that we implement memory in our program such that the agent can keep track of the rewards and the resulting state after taking actions.

Experience Replay

We’ll do this by adding experiences to the agent’s memory and using them to improve the agent in a process called experience replay. Modify the main program loop as such,

# The main program loop
for i_episode in range(NUM_EPISODES):
    observation = env.reset()
    # Iterating through time steps within an episode
    for t in range(MAX_TIMESTEPS):
        env.render()
        action = model.select_action(observation)
        prev_obs = observation
        observation, reward, done, info = env.step(action)
        # Keep a store of the agent's experiences
        model.remember(done, action, observation, prev_obs)
        model.experience_replay(20)
        # epsilon decay
        ...

We will also add the relevant code to the RLAgent, first in the init function,

self.memory = deque([],1000000)

and the declaration of remember,

def remember(self, done, action, observation, prev_obs):
        self.memory.append([done, action, observation, prev_obs])

Note that this memory implementation uses deque, a simple data structure that allows us to ‘append’ memories and keeps track of the latest n entries, where n is the size of the deque. Deque is a collection so we also need to add its import statement,

from collections import deque

Finally, we add the experience_replay method that will help the agent learn from experiences to improve its gameplay,

1 def experience_replay(self, update_size=20):
2    if (len(self.memory) < update_size):
3        return
4    else: 
5    batch_indices = np.random.choice(len(self.memory), update_size)
6    for index in batch_indices:
7       done, action_selected, new_obs, prev_obs = self.memory[index]
8       action_values = self.forward(prev_obs, remember_for_backprop=True)
9       next_action_values = self.forward(new_obs, remember_for_backprop=False)
10      experimental_values = np.copy(action_values)
11      if done:
12         experimental_values[action_selected] = -1
13      else:
14         experimental_values[action_selected] = 1 + self.gamma*np.max(next_action_values)
15      self.backward(action_values, experimental_values)
16      self.epsilon = self.epsilon if self.epsilon < 0.01 else self.epsilon*0.995
17   for layer in self.layers:
18      layer.lr = layer.lr if layer.lr < 0.0001 else layer.lr*0.995

There’s a lot to unpack in this method, but let’s go through it step by step.

First, there’s a simple check to see if we have enough experiences to start learning. If not, we wait until we do.

Next, we randomly sample from all our stored memories and get the indices of update_size memories. For each of these indices, we retrieve (line 7) the memory datum associated with it.

We then calculate 3 things —

action_values that are calculated with the prev_obs variable,
next_action_values that are calculated with the obs variable, and is used for calculating the experimental value in the next calculation, and
experimental_values (explained in the section on Cumulative Reward and Value Functions)

Once these values have been calculated, we feed the experimental values and the action value predictions to a self.backward function that will calculate the difference between these values and use that to make changes to the weight matrices. The backward function is examined and implemented in a later section on Backpropagation.

Finally, we update the epsilon (rate of exploration) and learning rate variables for the system. The learning rate is a property used by the backpropagation algorithm that determines the size of the step it takes during learning. Note that we have moved the epsilon update to this method from its original place in the main loop.

Action and Experimental Values

The code block pasted above has 3 calculations on lines 8–14. We’ll now go through each of these.

The first, action_values, is the agent’s current estimate of the value of each action in a given state (prev_obs) — the same value calculated and used in the select_action function from Part I.

The second calculation is next_action_values. This is the set of predicted values for both actions from the next state (new_obs), i.e. the state obtained after the agent took an action during the episode that created this memory.

The next_action_values is only a temporary variable that is used in a subsequent calculation —for experimental_values. This is the target value that our agent has learnt from this particular ‘experience’, and has two forms:

[lines 11–12] If the pole has tipped over in the next observation, then the episode ends and the sum of all future rewards after this state is (-1).
[lines 13–14] If the pole has not tipped over in the next observation, then the expected future reward is the sum of the immediate reward observed (+1, since the pole has not tipped over) and a discounted — by a factor of gamma — proportion of expected future reward from that state.

Note that these two forms for the experimental value are only applied to the action that was selected during the episode. The other actions (only one in the cartpole problem) are not updated since we do not have any new experimental knowledge about those actions from the given state.

The experimental_values quantity is crucial. It captures new empirical information that the agent has learned about the value of taking an action in a given state.

Cumulative Reward and Value Functions

This section will take a step back to formalize some reinforcement learning theory that is implicit in the code we have written so far. First, note that so far we have talked about values of actions in a particular state — but what does that really mean? How is that value obtained? Why is the variable experimental_values calculated in the way that it is?

The rest of this section will discuss some Reinforcement Learning theory. They reference freely and borrow heavily from a great paper published in 2015 that showed the power of Deep Q-Networks and used a common architecture to play the Atari 2600 games. If this does not interest you, feel free to skip to the Backpropagation section where we continue with the implementation.

We begin by defining the ‘cumulative reward’ or ‘return’ at a particular time step t as the sum of all future rewards after a time step until the end of the episode. Formally,

where rₜ is the immediate reward received at each time step, and γ is the discounting factor. We also define a quantity called the optimal action value function Q*(s,a) —

This is a maximum of the ‘expected’ cumulative reward for the state-action pair (sₜ,aₜ) over all policies, where a policy defines what action must be taken in a given state. It represents the ‘true’ value of an action in a given state — i.e. the maximum expected return starting from a particular state s, selecting action a, and then playing optimally thereafter with perfect knowledge of the environment. The expectation notation captures the fact that the environment may be stochastic in general.

The optimal action value function above obeys an important identity called the Bellman equation,

What this identity means is that if we know the optimal action values for all the actions from the resulting state s’ after taking action a in state s, then the optimal action value function Q*(s,a) is the expected sum of the immediate reward observed after this action and the (discounted) maximum optimal action value over all actions a’ that can be taken from state s’. The expectation notation captures the fact that s’ (and therefore also r) may be probabilistically determined by the initial state s and the action a.

Most reinforcement learning agents learn by using the Bellman equation as an iterative update, which would in our case be —

a quantity that will converge to the optimal action value function Q*(s,a) as i tends to infinity. Despite its similarity to line 14 in the code block above, there is a difference between what we are doing and what this equation represents. This iterative update suggests we keep a table of all possible state action pairs and update each value in that table incrementally. This is practically impossible since we have a continuous state space! Instead, we are using a neural network based function approximator Q(s,a;W) to estimate the action-value function, where W represents the weights in our network. This function approximator is called a Q-network — a deep Q-network or DQN in our case. We have indeed been building a DQN all this time!

Instead of updating all state-action pairs, it is more practical to iteratively minimize the expectation of the loss function, defined as:

There are two points to note here,

Instead of calculating and minimizing the total expected loss for this value, given all (s,a,r,s’) combinations from the agent’s memory, we use stochastic gradient descent to calculate the loss for a single sample (the memory datum that we retrieve in line 7) in each step.
The subscripts (i - 1) and (i) for the weights W in the above equation suggest that we are keeping a snapshot of our weights that is used in the next update. This is indeed the ideal and complete implementation and we will most likely get around to it in Part III, but for now, we will not implement this detail. We only use one set of weights for calculating both the predicted and target values (i.e. experimental_values in the code block above). In reality, running stochastic gradient descent on the loss function above without a fixed network for calculating target values ends up minimizing a sum of the empirical loss and the variance in the target value, which is not what we want (we only want to minimize the empirical loss). You can read more about this detail here [page 9]. For our simple use case, this difference luckily does not destabilize our solution.

Armed with this knowledge, we can now go back and answer the question on the meaning of action values.

The action-values for a particular state (computed by the forward function) at a given point in time in training are just the model’s estimation of the optimal action value function at that time. In other words, it is the estimation of the total discounted reward the agent expects to receive from a given state for each action. In the specific case of the cartpole problem, this is a discounted sum of the number of timesteps that the agent expects to stay alive from a particular state. The experimental value, on the other hand, is the target value that the network attempts to get closer to on each iteration.

With the points mentioned above, the last equation for the loss function empirically boils down to —

with the experimental_values and action_values as we have defined them in the definition of the experience_replay function in the previous section.

This is the quantity that we will seek to minimize in every iteration (for a given sample). The difference between experimental_values and action_values will form the basis for updating the weights of our network, implemented in the backward function.

Backpropagation

Reinforcement Learning theory helped us define the loss function in the last section as the squared difference between experimental and action values. Now that we have these values, the optimization problem is the same as minimizing the error function in any other neural network, through backpropagation.

We will now implement the last piece of the RLAgent.experience_replay code, the backward function:

def backward(self, calculated_values, experimental_values): 
  delta = (calculated_values — experimental_values)
  for layer in reversed(self.layers):
  delta = layer.backward(delta)

This function first calculates the difference between the calculated values (the predicted action_values) and the experimental value of taking an action in a state. This error is then ‘propagated’ backwards through each layer going from the last hidden layer to the input layer, and the weights are updated based on this error. Intuitively, each layer is told how much its estimate of its output differs from its expected output needed to generate the experimental values in the output layer.

This function calls the NNLayer.backward function:

1 def backward(self, gradient_from_above):
2    adjusted_mul = gradient_from_above
3    # this is pointwise
4    if self.activation_function != None:
5        adjusted_mul = np.multiply( relu_derivative(self.backward_store_out),gradient_from_above)
6        D_i = np.dot( np.transpose(np.reshape(self.backward_store_in, (1,len(self.backward_store_in)))), np.reshape(adjusted_mul, (1,len(adjusted_mul))))
7        delta_i = np.dot(adjusted_mul, np.transpose(self.weights))[:-1]
        self.update_weights(D_i)
        return delta_i

First, (lines 4–5) if this layer has an associated activation function for its output, we pointwise-multiply the derivative of the activation function (the ReLU function in our case) with the error received from the layer above. The resulting value — adjusted_mul — is used for two further calculations in each layer:

D_i [line 6]: This is the actual derivative of the loss function with respect to weights at this layer. This value is then passed to the NNLayer.update_weights method to make the actual updates. Without all the fluff for dealing with matrices, this is a simple dot product between the (column matrix) input received in this layer and the (row_matrix) adjusted_mul value calculated above.
delta_i [line 7]: This is the calculated ‘error’ for the inputs to this layer (i.e. the output from the previous layer). This is passed backward to the previous layer to enable its own derivative and delta calculations.

A full treatment of the derivation of the backpropagation algorithm and a step-by-step implementation is given in this instructive blog, but we will not go into the details here. Although my implementation is slightly different from the ones given in that blog, it might be a good exercise to check that the functions I have described actually implement backpropagation correctly.

The two remaining pieces of the algorithm are the relu_derivative and the NNLayer.update_weights functions. Let’s go through them now.

The derivative of the ReLU function is pretty straightforward — the ReLU function returns the input value if the value is greater than 0 and 0 otherwise. So the derivative of this function is just the identity (i.e. 1) function for x>0, and 0 otherwise. Here’s the implementation for a matrix input —

def relu_derivative(mat):
  return (mat>0)*1

Finally, the NNLayer.update_weights function:

def update_weights(self, gradient):
        self.weights = self.weights - self.lr*gradient

Great, this simple implementation should be enough to get a convergent algorithm working and train our agent! A working implementation of all the code from these two parts can be found on my Github.

Trained Cartpole Agent

When we run this program (remember to call env.render() on every timestep!), we get a fairly well-trained agent after a few hundred episodes. The task is considered solved once the agent achieves an average reward above 195 over 100 consecutive episodes. This usually takes between 100-300 time steps to achieve, however, it may take longer on some runs because of unfortunate initialization. A more detailed analysis of convergence guarantees and performance will be done in the next post, but here is a sample full-length run from that agent,

The trained cartpole agent!

Great! Seems like our agent is able to balance the pole pretty well. Compare this with the random agent that we started out with in the previous article.

Although the code that we have built is sufficient to train this agent, it can be optimized for faster convergence and higher stability. This will be the focus of the next post.

Whew! We got quite a lot done in this part. Here’s a summary:

Implemented an ‘experience replay’ mechanism, where the agent stores all the (state, action, reward, next_state) in memory and randomly samples from these for its weight updates.
Went through the theory of reinforcement learning and its relevance to the cartpole problem, and derived a mechanism for updating the weights to eventually learn the optimal state-action values.
Implemented the backpropagation algorithm to actually train the network.

There are still a few more tasks to be done here, and we will pick these up in the next and final part of the series.

Replace the update_weights method with an Adam based-optimizer, which keeps track of individual learning rates for all parameters. This will help the algorithm converge faster.
Refactor the global and model specific constants for better configurability.
Analyze the performance of the method with a few different hyperparameter configurations.
Implement a snapshot network used to calculate the target values that is periodically updated to the current Q-values of the network.

See you once again next time!

Links to Part I & Part III.