How does the Bellman equation work in Deep RL?
The connection between the Bellman equation and Neural Networks, with formulas, examples and Python code
In the Bellman equation, the value function Φ(t) depends on the value function Φ(t+1). Despite this, the value of Φ(t) can be obtained before the state reaches time t+1. We can do this using neural networks, because they can approximate the function Φ(t) for any time t. We will see how it looks in Python. In the last two sections, we present an implementation of Deep Q-learning algorithm and some details of tensor calculations using the PyTorch package.
States, actions and policy map
State space and action space
The Markov decision process (MDP) provides the mathematical framework for Deep Reinforcement Learning (RL or Deep RL). For the real problem, we define in MDP the following parameters: {S, A, R, P, γ}, where S is the state space, A is the action space, R is the set of rewards, P is the set of probabilities, γ is the discount rate.
While in Computer Vision, the agent learns from a Big Number of Images, the agent in Deep RL learns from a Big Number of Episodes, where for any state, the agent explores several actions and receives different replies (rewards) from the MDP environment.
If there is only one action for each state and all rewards are the same, the MDP is reduced to a Markov chain. For a large number of MDP environments, see Table of environments of OpenAI/gym.
CartPole or Inverted Pendulum
One example of an MDP environment is Cartpole (a.k.a. an Inverted Pendulum), a pendulum with a center of gravity above its pivot point. It’s unstable, but can be controlled by moving the pivot point under the center of mass. For environment CartPole-v0, the states and actions are as follows:
For this environment, the state space has dimension=4 and is of type Box(4). The action space has dimension=2 and is of type Discrete(2).
Types of gym spaces:
gym.spaces.Discrete(n)
: discrete values from 0 to n-1.gym.spaces.Box
: a multi-dimensional vector of numeric values, the upper and lower bounds of each dimension are defined byBox.low
andBox.high
.
Policy map 𝜋, deterministic and stochastic policies
The policy map 𝜋 is defined as 𝜋(a|s) = Pr{At = a | St = s} meaning that the policy 𝜋 is the probability of action a performed at state s (at time t).
Example. Consider recycling robot equipped with arms to grab the cans. The state space S =[low, high] , where ‘low’ and ‘high’ are the states of the robot charge, the action space A =[search, recharge, wait]. We consider two policy types: deterministic and stochastic.
State-value function and Bellman equation
Return value
Assume that the state space is discrete, which means that the agent interacts with its environment in discrete time steps. At each time t, the agent receives a state St including the reward Rt. The cumulative reward is named return, we denote it as Gt. The future cumulative discounted reward is calculated as follows:
Here, γ is the discount factor, 0 < γ < 1. Thus, the return at the time t can be obtained using the return at the time t+1, namely
This is the recursive relation for the return value Gt.
State-value function and Bellman equation
The state-value function for the policy 𝜋 is defined as follows:
Here, 𝔼𝜋 is the expectation for Gt, and 𝔼𝜋 is named as expected return. By (1) and (2) we derive the eq. (3). This is the Bellman equation.
Thus, the state-value v_𝜋(s) for the state s at time t can be found using the current reward R_{t+1} and the state-value at the time t+1.
Action-value function and optimal policy
Action-value function
Now, we would like to define the action-value function associated with the policy 𝜋 :
We can introduce comparison of two policies as follows:
In this case, we say that policy 𝜋’ is better than policy 𝜋.
Optimal state-value function
It is not necessary that any two policies are comparable, however, there is always a policy which is better than all other policies. Such a policy is said to be optimal policy, it is denoted by 𝜋*. An optimal policy is guaranteed to exist, but may be not the only one.
The goal of the agent is to find the optimal policy. Finding the optimal policy is the main goal of Deep RL.
Different optimal policies have the same value function, we denote it by v*. In fact, we have v* = v(𝜋*). The function v* it is said to be the optimal state-value function. The optimal state-value function can be defined as follows:
Optimal action-value function
For any deterministic policy 𝜋, the action a is uniquely determined by the current state s, i.e, a = 𝜋(s). Then, for the deterministic policy 𝜋 in (4), the action can be dropped, i.e., we get eq. (2). In other words, for the deterministic policy 𝜋, we have the following relation between state-value function and action-value function:
This is not so for stochastic policies
Similarly to optimal action-value function v*(s), see (5), we define the optimal action-value function q*(s,a) as follows:
Suppose, we have an optimal action-value function q*(s,a). Then the optimal policy can be determined as follows:
Here, A(s) is the set of actions possible for the state s.
For the deterministic policy 𝜋 , we find the new action for the current state by relation a = 𝜋(s). For the stochastic policy 𝜋, we can find the new action by relation a = 𝜋*(s), where 𝜋* is the optimal policy, see (7).
Here’s what an agent should do: first find the optimal action-value function, and then find the optimal policy using formula (7). The last statement is true with some limitations. Exceptions are possible, for example, due to ε-greedy mechanism.
Q-table and Temporal Difference Learning
Q-table
A Q-table is the matrix of the shape [state, action]. We initialize all slots of this matrix to zero. The appropriate Python code is as follows:
import numpy as np
Q = np.zeros((state_size, action_size))
We will do updates in Q(s, a) for every pair (s,a) after each step or action.
Temporal Difference
How is the Q-table is updated after a single step? The most popular method for updating Q-table is the Temporal Difference Learning or TD-learning. We add updates on each step until the episode ends.
The return Gt in eq.(8) is called an alternative estimate , see (1). This value a.k.a. TD-target. The value Q(s_t, a_t) in (8) is called a current estimate. Using (1), we can rewrite eq. (8) as follows:
Algorithm Sarsa
Sarsa is acronym for the sequence state–action–reward–state–action. These five elements of the sequence are as follows:
The agent is in the current state s_t, then the agent chooses the action a_t, gets the reward r_t, after that the agent enters the state s_{t+1}, and chooses the following action a_{t+1}. This loop is executed for all episodes until value num_episodes, see pseudo-code of algorithm Sarsa below. At each step, the Q-value Q(s, a) is updated by (9), see the yellow line in of the Sarsa pseudo-code.
Learning rate α
The learning rate α determines the behavior of the algorithm Sarsa. Too large values α will keep our algorithm far from convergence to optimal policy. If α=1 then Q(s_t, a_t) ← Gt, i.e. Q-value always will be most recent return, no any learning. The too small values α lead to learning too slow. If α=0 then Q(s_t, a_t) ← Q(s_t, a_t), never updated.
Algorithm Q-learning
Q-learning
Algorithm Q-learning (a.k.a Sarsamax) differ from Sarsa in eq.(9) as follows: Instead of the value of Q(s,a) at time t, we use the maximum value of Q(s,a), where a runs through all possible actions in the moment t, see (10) and the yellow line in the Q-learning pseudo-code.
Greedy action
At any time step t, for state s_t, there exists at least one action a, whose estimated value Q(s_t, a) is maximal. This action a is called greedy action. The associated policy 𝜋*(s) is called greedy policy, see eq. (7). When we select one of greedy actions, we are exploiting our current knowledge of the actions. If instead, we choose one of the non-greedy actions, then we are exploring, because this enables to improve our estimate of the non-greedy action’s value.
Sarsa vs. Q-learning
An off-policy agent learns the optimal policy independently of the agent’s actions. An on-policy agent learns the policy being carried out by the agent. Q-learning is an off-policy algorithm because the optimal policy is learning by greedy action a_gr in the formula for maximum, see (10), however, the next action a_t can be another one. Sarsa is an on-policy algorithm because in (9) the agent learns optimal policy and behaves using the same policy Q(s_t,a_t).
Q-learning may have worse performance in each episode than Sarsa, however, Q-learning learns the optimal policy.
Neural Network in Python
Question
Answer: by a neural network. We will see how it looks in Python.
Function Approximation
The reason that so much attention is paid to neural networks is because they can approximate the output of any continuous mathematical function. This is possible due to the Kolmogorov theorem stating that multivariate functions can be expressed via a combination of sums and compositions of (a finite number of) univariate functions.
Our Q-value function, the function of two vector parameters Q(s, a) can be represented as a certain artificial neural network (nn) as exact as we want.
The realization of Q-learning algorithm with the Deep Learning technology, i.e., with neural networks is called Deep Q-Network or DQN.
Project DQN
Python package PyTorch is an open source deep learning library developed by Facebook’s AI Research lab. We provide several DQN snippets implemented with using PyTorch. This code is taken from my implementation of training an agent with the ‘Banana’ environment. The agent is trained to navigate and collect bananas in a certain square world. However, this code is fairly general and can be used for many environments with discrete state space.
Snippets
We present several fragments that help to understand how, using neural networks, we can elegantly implement the DQN algorithm.
Function Agent.__init__
: two neural networks (q_local
and q_target
) are constructed by the model Qnetwork.
Each model Qnetwork
contains two hidden layers.
Method
Agent.learn()
:
The difference between Q_expected
and Q_targets
should be minimized using PyTorch methods, see method learn()
.
In class ReplayBuffer:
values of s_t
(state
) and s_{t+1}
(next_state
) are sampled by function sample()
, the data is stored by function add()
.
In method dqn():
double loop by episodes and time steps; here, the values ‘state’, ‘next_ state’, ‘action’, ‘reward’ and ‘done’ are generated.
Tips related to PyTorch
Here are some tips related to PyTorch methods, see figure “Three tensors” below.
detach()
Method detach()
indicates that no backpropagation for gradient of tensor loss
will be executed for Q_targets_next.
This is possible since tensor loss
depends only on Q_targets
and Q_expected,
see method learn()
.
max(1)
The shape of each network here is [64, 4] where 64 is the number of states in the batch (BATCH_SIZE
=64), and 4 is the number of possible actions( move forward, move backward, turn left, turn right
) . The expression max(1)
means getting the maximum for each of the 64 states. The maximum is taken by running through all 4 actions. Recall that Q-Learning finds max value running over all actions, see (10). In the figure below, we give a numerical example of 64 x 4 tensor self.q_target(next_states).detach()
.
Tensors associated with neural networks
max (1)[0]
The fact is that max(1)
returns the list of two tensors: max(1)[0],
the tensor containing maximum values; max(1)[1],
the tensor containing values ‘column number’ at which maximum was found. We need only max(1)[0]
, see figure above.
unsqueeze(1)
Now we want to place this row vector of the shape [64] in the column with the form [64,1]. This is done using the unsqueeze(1)
method, see tensor Q_targets_next
in the figure above.
Q_targets
How we calculate Q_targets
? For any ‘state’ in the batch, the value ‘done’ is 1 if the episode is finished, otherwise ‘done’ is 0. Then the line of Q_targets
is calculated by eq.(10) if and only if the associated episode is not finished.
gather(1, actions)
This method gathers values along the axis specified by dim = 1. Note that dim=0 is associated with rows, dim=1 means columns. Each row in self.q_local(states)
consists of four Q-values associated with four actions. Thus, for each row, along the columns, the method gather
takes Q-value associated with the action number in the tensor actions
, see figure below.
Conclusion
Combining the Bellman equation, Neural Networks, Kolmogorov’s theorem, we get an amazing technology, Deep RL. This technology provides new approaches and new algorithms that can solve previously unsolvable problems.
What “unsolvable problems” do Bellman equations and Deep RL actually allow? Let us point, for example, to the project AlphaZero, a computer program which is master the games of Chess, Shogi and Go. AlphaZero within 24 hours of training achieved a superhuman level of play in Chess by defeating world-champion program Stockfish.
We examined one particular case of Deep RL, the Deep Q-learning algorithm. In the last two sections, we presented an implementation of this algorithm and some details of tensor calculations using the PyTorch package.
References
[1] M. Tavora, The Approximation Power of Neural Networks (with Python codes) (2019), TowardsDataScience
[2] X. Zheng, Brief Introduction to Reinforcement Learning (2019), LinkedIn
[3] H. van Hasselt, A. Guez, D. Silver, Deep Reinforcement Learning with Double Q-learning (2015), arXiv:1509.06461
[4] A. Singh, Reinforcement Learning: Bellman Equation and Optimality (Part 2) (2019), TowardsDataScience
[5] J. Hui, RL — DQN Deep Q-network (2018), Medium
[6] D. Silver, T. Hubert, J. Schrittwieser, D. Hassabis, AlphaZero: Shedding new light on chess, shogi, and Go (2018), DeepMind