The world’s leading publication for data science, AI, and ML professionals.

The Bellman Equation

V-function and Q-function Explained

DEEP REINFORCEMENT LEARNING EXPLAINED – 08

In the previous post, we have been able to check with the Frozen-Lake Environment example the limitations of the Cross-Entropy Method. Continuing with the style of this series of gradually adding knowledge, in this post, we will present another type of Agent that allows us to solve the previous problem. These are the so-called Value-based Agents that store the value function and base their decisions on it. For this purpose, we will present the Bellman equation, one of the central elements of many Reinforcement Learning algorithms, and required for calculating the value functions in this post.

Spanish version of this publication

3. Funciones de valor y la ecuación de Bellman

Value-based Agents

Remember that the Agent’s goal is to find a sequence of actions that will maximize the return: the sum of rewards (discounted or undiscounted – depending on the value of gamma) during an episode or the entire life of the Agent, depending on the task.

In a continuous task, this is infinity. We can solve this with the help of the discount factor already introduced in post 2, which lies in the [0:1] range. The formula for the discounted Return G at time step t is as follows:

Although the sum is still infinite, if γ<1, then Gt will have a finite value. If γ=0, the Agent is only interested in the immediate reward and discards the long-term return. Conversely, if γ=1, the Agent will consider all future rewards equal to the immediate reward.

We can rewrite this equation with a recursive relationship:

In short, the Agent must be able to exploit this information that we have been able to express with this Return G to make their decisions. We also refer to this expression as a Discounted Return.

Almost all Reinforcement Learning algorithms executed by the Agents involve estimating value functions – functions of states or of state-action pairs. These are the so-called Value-based Agents.

A value function estimates how good it is for the Agent to be in a given state (or how good it is to perform a given action in a given state) in terms of return G. Note that the return G of an Agent may depend on the actions it will take. Accordingly, value functions introduced in post 2 are defined with respect to particular ways of acting, called policies, and usually denoted by 𝜋.

The V-function: the value of the state

The first value function we will introduce is V-function. Generally speaking, we can say that V-function answers the basic question of "What to expect from here?".

More formally, the V-function, also referred to as the state-value function, or even the value function, or simply V, measures the goodness of each state. In other words, how good or bad it is to be in a particular state according to the return G when following a policy __ 𝜋.

That is, we can define the V-function as an expected total reward (discounted or undiscounted – depending on the value of gamma) that is obtainable from the state. In a formal way, the value of V𝜋(𝑠) is:

that describes the expected value of the total return G, at time step t starting from the state s at time t and then following policy 𝜋. It is used expectation 𝔼[.] in this definition because the Environment transition function might act in a stochastic way.

To clarify the concept a little more, for beginners, let’s consider a straightforward Environment with three states:

  • The Agent’s initial state 0
  • The final state 1 that the Agent is in after executing action "left" from the initial state. The reward obtained from this is r=+1
  • The final state 2 that the Agent is in after taking action "right." The reward obtained from this is r=+2

We can represent this with the following Environment’s states transition with rewards:

The Environment is always deterministic – every action succeeds, and we always start from state 0. Once we reach either state 1 or state 2, the episode ends.

Now, the question is, what is the value of state 0? (indicated with V(0)). An important detail is that the value of one state is always calculated (dependent) in terms of some policy that our Agent follows. Even in a simple Environment, our Agent can have different behaviors, each of which will have its own value for state 0. Consider some example of policy:

  1. The Agent always goes left
  2. The Agent always goes right
  3. The Agent goes left with a probability of 0.5 and right with a probability of 0.5
  4. The Agent goes left with a probability of 0.2 and right with a probability of 0.8

The value of state 0, V(0), in each of the above policies is:

  1. The value of state 0 in the case of the "always left" Agent is V(0)_=_1.0 (every time it goes left, it obtains +1 and the episode ends).
  2. For the "always right" Agent, the value of state 0 is V(0)= 2.0 (every time it goes right, it obtains +2 and the episode ends).
  3. For the "0.5 left+0.5 right" Agent, the value is V(0)_=_1.0 x 0.5 + 2.0 x 0.5 = 1.5
  4. For the "0.2 left+0.8 right" Agent, the value is V(0)_=_1.0 x 0.2 + 2.0 x 0.8 = 1.8

Due to the goal of the Agent is to get as much total reward as possible, the optimal policy for this Agent in this simple one-step Environment is policy 2, the policy **** "always right".

But the preceding example may give a false impression that we should "being greedy" and always take action with the highest reward. Unfortunately, it’s not that simple. For example, let’s extend our preceding Environment with yet another state that is reachable from state 2. State 2 is no longer a terminal state but a transition to state 3, with a (very) bad reward of –10:

The new Environment’s states transition with rewards can be represented as:

With that addition, for each policy, the V(0) will be:

  1. For the "always left", is the same: V(0)_=+_1.0
  2. For the "always right": V(0)= 2.0 + (–10.0) = –8.0
  3. For the "0.5 left+0.5 right": V(0)_=_1.0 0.5 + (2.0+(-10.0) ) 0.5 = -3.5
  4. For the "0.2 left+0.8 right": V(0)_=_1.0 0.2 + (2.0+(-10.0) ) 0.8 = -6.2

So, the best policy for this new Environment is now policy 1: "always go left". Notice that once the Agent has chosen the "right" action in state 0, the bad reward is unavoidable, as from state 2, we have only one exit.

This example based on a naïve Environment pursues that the reader realises the complexity of this optimality problem and prepares him or her to see the importance of the Bellman equation. What is the Bellman equation?

The Q-function: The value of the action

In post 2 we extended the definition of state-value function to state-action pairs, defining a value for each state-action pair, which is called the action-value function, also known as Q-function or simply Q. It defines the value of taking action a in state s under a policy π, denoted by Q𝜋(𝑠,𝑎), as the expected Return G s_tarting from s, taking the action 𝑎, and thereafter following policy π:****_

In this equation again it is used expectation 𝔼[.] because the Environment transition function might act in a stochastic way.

Now that we have both Q and V defined, let’s formalize their relationship. We denote with π(𝑎|𝑠) the probability that a policy, π, selects an action, 𝑎, given a current state, 𝑠. Note that the sum of probabilities of all outbound actions from s is **** equal to 1:

We can assert that the state-value function is equivalent to the sum of the action-value functions of all outgoing (from s) actions a, multiplied by the policy probability of selecting each action:

#

The Bellman equation shows up everywhere in the Reinforcement Learning literature, being one of the central elements of many Reinforcement Learning algorithms. In summary, we can say that the Bellman equation decomposes the value function into two parts, the immediate reward plus the discounted future values.

This equation simplifies the computation of the value function, such that rather than summing over multiple time steps, we can find the optimal solution of a complex problem by breaking it down into simpler, recursive subproblems and finding their optimal solutions.

To facilitate understanding of the formulation in the following sections, the next two backup diagrams summarize a convention of names given to the variables and their relationship:

Backup diagrams for (a) _V_𝜋(s) and (b) Q𝜋(s,a).
Backup diagrams for (a) _V_𝜋(s) and (b) Q𝜋(s,a).

In these diagrams, P means the probability of action a, issued in state s, ending up in state s’ (with reward r).

Bellman equation for the State-value function

We already saw that we could define the discounted return, G, in recursive terms. Let’s now see how to recursively the Bellman equation is defined for the state-value function:

This equation tells us how to find the value of a state s following a policy 𝜋. We can intuitively see that it recursively breaks down the value computation into an immediate expected reward from the next state, r(s,a), plus the value of a successor state, V𝜋(s‘), with a discount factor 𝛾. The above equation also expresses the stochasticity of the Environment with the sum over the policy probabilities.

The Bellman equation is important because it gives us the ability to describe the value of a state _s, V_𝜋(s), with the value of the s_‘ s_tate, V𝜋(s‘), and with an iterative approach that we will present in the next post, we can calculate the values of all states.

Unfortunately, in most scenarios, we do not know the probability P and the reward r, so we cannot solve MDPs by directly applying the Bellman equation. But do not worry, in the next post we present some alternatives to find them from experience.

Bellman equation for the Action-value function

We also have the Bellman equation for the action-value function:

In the same way as the state-value function, this equation tells us how to find recursively the value of a state-action pair following a policy 𝜋.

And due we shown that the state-value function V(s‘) is equivalent to the sum of the action-value functions Q(s’,a’) of all outgoing actions a’, multiplied by the policy probability of selecting each action, 𝜋(a’|s’), the previous formula can be expressed as follows:

Optimal Policy

Remember that the goal of the Agent is to maximize the total cumulative reward in the long run. The policy, which maximizes the total cumulative reward is called the optimal policy.

Optimal value function

A policy π′ is defined to be better than or equal to a policy π if and only if ′ (s)≥​(s) for all states s. An optimal policy **** __ π∗​ satisfies π∗≥π for all policies π. An optimal policy is guaranteed to exist but may not be unique. That means that there could be different optimal policies, but they all share the same value functions, the "optimal" value functions.

The optimal value function is one which yields maximum value compared to all other value function (following using other policies).

When we say we are solving an MDP it actually means we are finding the optimal value function. So, mathematically optimal state-value function can be expressed as :

In the above formula, v∗(s) tells us what is the maximum reward for state s we can get from the system.

Similarly, optimal state-action value function indicates the maximum reward we are going to get if we are in state s and taking action a from there on-wards:

We also can define V(s) via Q(s,a) so the value of some state equals the value of the maximum action we can execute from this state:

and

The Bellman equation of optimality

Bellman proved that the optimal state value function in a state s is equal to the action a, which gives us the maximum possible expected immediate reward, plus the discounted long-term reward for the next state s’:

Bellman also proved that the optimal state-action value function in state s and taking action a is:

In the future posts of this series, we will show examples of how to use the Bellman equation for optimality. As we will see along with this series, the Bellman equation is a keystone to find the optimal values of the value functions to obtain an optimal policy for an Agent.

What is next?

In future posts, you will see these formulas in practice by solving the Frozen-Lake Environment. However, to be able to do this, we have one important thing still missing: a general way to calculate those V-values and Q-values. In the next post, we will present the Value Iteration method for it.

See you in the next post!.

For more detail of the content of this post, the reader can review the excellent book Reinforcement Learning, Second Edition, by Richard S. Sutton and Andrew G. Barto.


Deep Reinforcement Learning Explained Series

by UPC Barcelona Tech and Barcelona Supercomputing Center

A relaxed introductory series that gradually and with a practical approach introduces the reader to this exciting technology that is the real enabler of the latest disruptive advances in the field of Artificial Intelligence.

Deep Reinforcement Learning Explained – Jordi TORRES.AI

About this series

I started to write this series in May, during the period of lockdown in Barcelona. Honestly, writing these posts in my spare time helped me to #StayAtHome because of the lockdown. Thank you for reading this publication in those days; it justifies the effort I made.

Disclaimers – These posts were written during this period of lockdown in Barcelona as a personal distraction and dissemination of scientific knowledge, in case it could be of help to someone, but without the purpose of being an academic reference document in the DRL area. If the reader needs a more rigorous document, the last post in the series offers an extensive list of academic resources and books that the reader can consult. The author is aware that this series of posts may contain some errors and suffers from a revision of the English text to improve it if the purpose were an academic document. But although the author would like to improve the content in quantity and quality, his professional commitments do not leave him free time to do so. However, the author agrees to refine all those errors that readers can report as soon as he can.

Don’t forget to give us your 👏 !


Related Articles