Reinforcement Learning: an Easy Introduction to Value Iteration

Learn the fundamentals of RL and how to apply Value Iteration to a simple example problem

Published in

Towards Data Science

15 min readSep 10, 2023

Value Iteration (VI) is typically one of the first algorithms introduced on the Reinforcement Learning (RL) learning pathway. The underlying specifics of the algorithm introduce some of the most fundamental aspects of RL and, hence, it is important to master VI before progressing to more complex RL algorithms. Yet, this can be tricky to get your head around.

This article is designed to be an easy-to-understand introduction to VI, which will assume the reader to be new to the field of RL. Let’s get started.

Already know the basics of RL? → Skip to how to use Value Iteration.

The basics of RL

Let’s start with a textbook definition, then I’ll break it down using an easy example.

Overview of the reinforcement learning training process

RL is one of the three key machine learning paradigms beside supervised and unsupervised learning. RL is not a singular algorithm, but rather a framework which encompasses a range of techniques and approaches for teaching agents to learn and make decisions in their environments.

In RL, an agent interacts with an environment by taking various actions. The agent is rewarded when those actions lead to desired states and punished when they don’t. The agent’s objective is to learn a strategy, called a policy, that guides its actions to maximize the reward it accumulates over time. This trial-and-error process refines the agent's behavioral policy, allowing it to take optimal or near-optimal behaviors in its environment.

The book An “Introduction to Reinforcement Learning” by Richard S. Sutton and Andrew G. Barto is considered the best in the field for those wishing to gain a solid understanding of RL.. and it’s available for free!

Let’s define an example problem

This image depicts the game of golf in its simplest form. We will use this as golf has a clearly defined goal - get the ball in the hole.

In our example, the golf ball can be in one of three positions: on the fairway; on the green; or in the hole. We start on the fairway and aim to get closer to the hole with each shot, with the hole sitting on the green.

In RL, each of these positions is referred to as a state of the environment. You can think of the state as being a snapshot of the current environment (the golf course), which also records the ball position.

In our game, the agent takes shots of hitting the ball, starting at the ball on fairway state. The agent simply refers to the entity that is in control of taking actions. Our game has three available actions: hit to fairway; hit to green; and hit in hole.

Transition probabilities in the game of golf

Of course, when you take a shot, the ball may not go where you want it to. Therefore, we introduce a transition function linking actions to states with some probability weighting.

For example, we may miss the green when we take a shot from the fairway and end up still on the fairway. Written in an RL context, if we are in the ball on fairway state and take the action hit to green, there is a 90% probability we will enter the ball on green state but a 10% probability we will re-enter the ball on fairway state.

A reward of 10 for getting the ball in the hole

Every time the agent takes an action, we refer to this as a step through the environment. Based upon the action just taken, the agent observes the new state it ends up in as well as a reward. A reward function is an incentive mechanism for pushing the agent in the right direction. In other words, we design the reward function to shape the desired behavior of our agent. In our simplified golf example, we provide a reward of 10 for getting the ball in the hole.

The design of environment dynamics (the transition and reward functions) is not a trivial task. If the environment does not represent the problem you are attempting to solve, the agent will learn a policy reflecting a correct solution for an incorrect problem. We will not cover such design elements here, but it’s worth noting.

Markov Decision Processes

To represent a problem in a way the agent understands, we must formalise it as a Markov Decision Process (MDP).

A MDP is a mathematical model which describes our problem in a structured way. It represents the agent's interactions with the environment as a sequential decision-making process (i.e., one action after another).

It consists of the environment dynamics (I’m going to add some math notation here to shorten things):

a finite set of states s ∈ S.
a finite set of actions a ∈ A.
a transition function T(s′∣s,a) returning the probability of reaching state s′, given the current state s, the current action a.
a reward function R(s,a,s′) returning a scalar reward based on reaching next state s′, after being in state s, and taking action a.

Note that if there is some uncertainty or randomness involved in the transitions between states (i.e. taking the same action in the same state twice may lead to different outcomes), we refer to this as a stochastic MDP. We could also create a deterministic MDP, where transitions and rewards are entirely predictable. This means when an agent takes an action in a specific state, there is a one-to-one correspondence between the action and the resulting next state and the reward.

Visualised as an MDP, our golf problem looks pretty much the same as the images depicted earlier. We will use S = {s1, s2, s3} for shorthand.

Our golf example problem written as an MDP

The use of an MDP assumes that what’s going to happen next in the environment only depends on what’s happening right now — the current state and action — and not on what happened before. This is called the Markov property, and it’s important in RL as it reduces computational complexity. I’ll explain this more later.

What is Value Iteration?

Value Iteration (VI) is an algorithm used to solve RL problems like the golf example mentioned above, where we have full knowledge of all components of the MDP. It works by iteratively improving its estimate of the ‘value’ of being in each state. It does this by considering the immediate rewards and expected future rewards when taking different available actions. These values are tracked using a value table, which updates at each step. Eventually, this sequence of improvements converges, yielding an optimal policy of state → action mappings that the agent can follow to make the best decisions in the given environment.

VI leverages the concept of dynamic programming, where solving a big problem is broken down into smaller subproblems. To achieve this in VI, the Bellman equation is used to guide the process of iteratively updating value estimates for each state, providing a recursive relationship that expresses the value of a state in terms of the values of its neighbouring states.

This won’t make much sense now. The easiest way to learn VI is to break it down step-by-step, so let’s do that.

How does the Value Iteration algorithm work?

The image below depicts the steps of the algorithm. Don’t be put off, it's easier than it appears.

Firstly, we need to define some parameters for our training.

Theta θ represents a threshold for convergence. Once we reach θ, we can terminate the training loop and generate the policy. It’s essentially just a way to ensure the policy we create is accurate enough. If we stop training too early, we may not learn the best actions to take.
Gamma γ represents the discount factor. This is a value which determines how much our agent values future rewards compared to immediate rewards. A higher discount factor (closer to 1) indicates that the agent values long-term rewards more, while a lower discount factor (closer to 0) places greater emphasis on immediate rewards.

To understand the discount factor better, consider an RL agent playing chess. Let’s say you have the opportunity to capture your opponent’s queen in the next move, which would yield a significant immediate reward. However, you also notice that by sacrificing a less valuable piece now, you could set up a future advantage that might lead to checkmate and an even bigger reward later. The discount factor helps you balance this decision.

(1) Initialisation: Now we have the parameters defined, we want to initialise our value function V(s) for all states in S. This typically means we set all values to 0 (or some other arbitrary constant) for every state. Think of the value function as a table that tracks a value for each state, updating frequently.

A table with two columns, S and V(S). In S there are cells with states s1 and s2. In V(s) there are cells with values of zero and zero. — An initialised value table

(2) Outer loop: Now everything is set up, we can start the iterative process of updating our values. We begin in the outer loop, which repeats until the convergence criteria are met (until Δ < θ).

At each pass of the outer loop, we begin by setting Δ = 0. Delta Δ is used to represent the change in value estimates across all states, and the algorithm continues iterating until this change Δ falls below the specified threshold θ.

(3) Inner loop: For every state s in S, we:

set a variable v to the current value of that state V(s), remember - this is fetched from our value table (so on the first pass, v = V(s) = 0)
perform the bellman equation to update V(s)
update Δ (we’ll come back to this)

The Bellman Equation

This line of the algorithm is the most important. It requires that we update the value of the current state we are looking at in the loop. This value is calculated by considering all available actions from that specific state (a 1-step look ahead). When we take each of those possible actions it will present us with a set of possible next states s′ and respective rewards r.

So, for each of those next states s′ and respective rewards r, we perform p(s′, r|s, a)[r + γV(s′)]. Let's break this up:

p(s′, r|s, a) the probability of being in state s, taking action a, and ending up in next state s′ (this is just our transition function)
[r + γV(s′)] the reward r of ending up in next state s′ (we get that from our reward function) + our discount γ * by the value of that next state s′ (we get that from our value table)
We then multiply these two parts p(s′, r|s, a) * [r + γV(s′)]

Remember, this calculation is just for one next state s′ (the third level of the tree), we need to repeat this for each possible next state s′ after taking a.

Once we have done this, we sum all the results we just got Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]. We then repeat this for each action a (the second level in the tree).

Once these steps are complete, we will have a value associated with each possible action a from the current state we are looking at in the inner loop s. We choose the highest using maxₐ and set this equal to our new value for that state V(s)←maxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)].

Remember, this process covered only one state s (the first level in the tree)

If we were programming this, it would be 3 for loops for each level in the tree:

(3) Inner loop (continued): Before moving to the next pass in the inner loop, we perform a comparison between the current value of Δ and the difference between the previous value of this state v and the new value for this state we just calculated V(s). We update Δ to the larger of these two: Δ ← max(Δ,| v - V(s)|). This helps us track how close we are to convergence.

Ok, this process completes one pass of the inner loop. We perform step (3) for each s in S before breaking out of the inner loop and performing a check on the convergence condition Δ < θ. If this condition is met, we break out the outer loop, if not, we go back to step (2).

(4) Policy extraction: By this time, we have likely performed multiple passes through the outer loop until we have converged. This means our value table will be updated to represent the final value of each state (in other words, ‘how good it is to be in each state’). We can now extract a policy from this.

Remember, the policy π is essentially a mapping from states → actions, and for each state, it selects the action that maximizes the expected return. To calculate this, we perform the exact same process as before, but instead of getting a value for state s using maxₐ, we get the action a that gives us the best value using argmaxₐ.

And that’s it!

Policy Iteration is another dynamic programming algorithm. It is similar to VI except it alternates between improving the policy by making it greedy with respect to the current value function and evaluating the policy’s performance until convergence, often requiring fewer iterations but more computation per iteration.

Solving the example using Value Iteration

VI should make even more sense once we complete an example problem, so let’s get back to our golf MDP. We have formalised this as an MDP but currently, the agent doesn’t know the best strategy when playing golf, so let’s solve the golf MDP using VI.

We’ll start by defining our model parameters using fairly standard values:

γ = 0.9 // discount factor
θ = 0.01 // convergence threshold

We will then initialise our value table to 0 for states in S:

// value table

V(s0) = 0
V(s1) = 0
V(s2) = 0

We can now start in the outer loop:

Δ = 0

And three passes of the inner loop for each state in S:

// Bellman update rule
// V(s) ← maxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]

//******************* state s0 *******************//

v = 0

// we have only looked at one action here as only one is available from s0
// we know that the others are not possible and would therefore sum to 0

V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
            T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]

V(s0) = max[0.1 * (0 + 0.9 * 0) + 
            0.9 * (0 + 0.9 * 0)]

V(s0) = max[0] = 0


// Delta update rule
// Δ ← max(Δ,| v - V(s)|)

Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 0|] = 0

//******************* state s1 *******************//

v = 0

// there are 2 available actions here

V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) + 
            T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
            T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) + 
            T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 0) + 
            0.1 * (0 + 0.9 * 0),
            0.1 * (0 + 0.9 * 0) + 
            0.9 * (10 + 0.9 * 0)]

V(s1) = max[0, 9] = 9

Δ = max[Δ, |v - v(s1)|] = max[0, |0 - 9|] = 9

//******************* state s2 *******************//

// terminal state with no actions

This gives us the following update to our value table:

V(s0) = 0
V(s1) = 9
V(s2) = 0

We don’t need to worry about s2 as this is a terminal state, meaning no actions are possible here.

We now break out the inner loop and continue the outer loop, performing a convergence check on:

Δ < θ = 9 < 0.01 = False

Since we have not converged, we do a second iteration of the outer loop:

Δ = 0

And another 3 passes of the inner loop, using the updated value table:

//******************* state s0 *******************//

v = 0

V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
            T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]

V(s0) = max[0.1 * (0 + 0.9 * 0) + 
            0.9 * (0 + 0.9 * 9)]

V(s0) = max[7.29] = 7.29

Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 7.29|] = 7.29

//******************* state s1 *******************//

v = 9

V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) + 
            T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
            T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) + 
            T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 7.29) + 
            0.1 * (0 + 0.9 * 9),
            0.1 * (0 + 0.9 * 9) + 
            0.9 * (10 + 0.9 * 0)]

V(s1) = max(6.7149, 9.81) = 9.81

Δ = max[Δ, |v - v(s1)|] = max[7.29, |9 - 9.81|] = 7.29

//******************* state s2 *******************//

// terminal state with no actions

At the end of the second iteration, our values are:

V(s0) = 7.29
V(s1) = 9.81
V(s2) = 0

Check convergence once again:

Δ < θ = 7.29 < 0.01 = False

Still no convergence, so we continue the same process as above until Δ < θ. I won’t show all the calculations, the above two are enough to understand the process.

After 6 iterations, our policy converges. This is our values and convergence rate as they change over each iteration:

Iteration   V(s0)        V(s1)        V(s2)        Δ             Converged
1           0            9            0            9             False
2           7.29         9.81         0            7.29          False
3           8.6022       9.8829       0            1.3122        False
4           8.779447     9.889461     0            0.177247      False
5           8.80061364   9.89005149   0            0.02116664    False
6           8.8029969345 9.8901046341 0            0.0023832945  True

Now we can extract our policy:

// Policy extraction rule
// π(s) = argmaxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]

//******************* state s0 *******************//

// we know there is only one possible action from s0, but let's just do it anyway

π(s0) = argmax[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
            T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))

π(s0) = argmax[0.1 * (0 + 0.9 * 8.8029969345) + 
            0.9 * (0 + 0.9 * 9.8901046341)]

π(s0) = argmax[8.80325447773]

π(s0) = hit to green

//******************* state s1 *******************//

π(s1) = argmax[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) + 
            T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
            T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) + 
            T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 8.8029969345) + 
            0.1 * (0 + 0.9 * 9.8901046341),
            0.1 * (0 + 0.9 * 9.8901046341) + 
            0.9 * (10 + 0.9 * 0)]

π(s1) = argmax[8.02053693401, 9.89010941707]

π(s1) = hit in hole

Our final policy is:

π(s0) = hit to green
π(s1) = hit in hole
π(s2) = terminal state (no action)

So, when our agent is in the Ball on fairway state (s0), the best action is to hit to green. This seems pretty obvious since that is the only available action. However, in s1, where there are two possible actions, our policy has learned to hit in hole. We can now give this learned policy to other agents who want to play golf!

And there you have it! We have just solved a very simple RL problem using Value Iteration.

Limitations to dynamic programming

It’s important to note that Value Iteration, along with other Dynamic Programming algorithms, has its limitations. Firstly, it assumes that we have complete knowledge of the dynamics of the MDP (we call this model-based RL). However, this is rarely the case in real-world problems, for example, we may not know the transition probabilities. For problems where this is the case, we need to use other approaches such as Q-learning (model-free RL).

Secondly, for bigger problems, as the number of states and actions increases, the size of the value table grows exponentially (think about trying to define all the possible states of chess). This results in the ‘curse of dimensionality’ problem, where the computational and memory requirements escalate rapidly, making it challenging to apply DP to high-dimensional problems.

Nevertheless, VI is great to learn as it introduces some of the key foundational concepts of RL which form the basis of more complex algorithms that you may go on to learn.

Thanks for reading!

I hope this article has provided an easy-to-understand introduction to reinforcement learning, and Value Iteration specifically.

If you learnt something new here, please give this a 👏 and follow!

Unless otherwise stated, all images are created by the author.