Implement Value Iteration in Python - A Minimal Working Example

In the age of artificial intelligence, exact algorithms are not exactly hot. If a machine cannot learn it by itself, what’s the point? Why bother solving Markov Decision Process models with solutions that do not scale anyway? Why not dive straight into reinforcement learning algorithms?

It still pays off to learn the classical Dynamic Programming techniques. First of all, they are still widely used in practice. Many software developer jobs incorporate dynamic programming as part of their interview process. Although there are only so many states and actions you can enumerate, you may be surprised of the real-world problems that can still be resolved to optimality.

Second, even if only interested in reinforcement earning, many algorithms in that domain are firmly rooted in dynamic programming. Four policy classes may be distinguished in reinforcement learning, one of them being value function approximation. Before moving to such approaches, having an understanding of the classical value iteration algorithm is essential. Fortunately, it happens to be outlined in this very article.

The Four Policy Classes of Reinforcement Learning

Value iteration

The elegance of the value iteration algorithm is something to be admired. It really just takes a few lines of mathematical expressions, and not many more lines of code. Let’s check the seminal denotation by Sutton & Barto:

The intuition is fairly straightforward. First, you initialize a value for each state, for instance at 0.

Then, for every state you compute the value V(s), by multiplying the reward for each action a (direct reward r+ downstream value V(s')) with the transition probability p.

Suppose we indeed initialized at 0 and direct rewards r are non-zero. The discrepancies will directly become visible in the difference expression |v-V(s)|, where v is the old estimate and V(s) the new one. So, the error Δ will exceed the threshold θ (a small value) and a new iteration follows.

By performing sufficiently many iterations, the algorithm will converge to a point where |v-V(s)|<θ for every state. You can then resolve the argmax to find the optimal action for each state; knowing the true value functions V(s) equates having the optimal policy π.

Note that optimality is not guaranteed ifθ is set too large. For practical purposes a reasonably small error tolerance will do, but for the mathematical sticklers among us it is good to keep in mind the optimality conditions.

A minimal working example

With the mathematics out of the way, let’s continue with the coding example. To keep the focus on the algorithm, the problem is extremely simple.

The problem

Consider a one-dimensional world (a row of tiles), with a single terminating state. Hitting the terminating state yields a reward of +10, every other action costs -1. The agent can move left or right, but – to not make it too trivial -the agent moves into the wrong direction 10% of the time. It’s a very simple problem, but it has (i) a direct reward, (ii) an expected downstream reward, and (iii) transition probabilities.

The algorithm

The Python algorithm stays pretty close to the mathematical outline provided by Sutton & Barto, so no need to expand too much. The full code fits in a single Gist:

Some experiments

Ok, some experiments then. We start by showing two iterations of the algorithm in full detail.

First, we set v equal to V(0), which is 0:v=V(0)=0

Next, we update V(0). Note r is fixed for each state; we effectively only sum over the set of next states via s'.

Value iteration step 1, state 0 [image by author]

That seems like a lot of computational effort for such a small problem. Indeed, it’s easy to see why dynamic programming does not scale well. In this case, all values V(s) are still at 0 – as we just started – so the estimated state value V(0) is simply the direct reward -1.

Let’s try one more, somewhat further along the line. Same computation, but now the values V(s) have been updated throughout several iterations. We now have V=[5.17859, 7.52759, 10.0, 7.52759, 5.17859]. Again, we plug in the values:

Value iteration step 4, state 0 [image by author]

Thus, we update V(0) from 5.18 to 5.56. The error would be Δ=(5.56–5.18)=0.38. In turn, this will affect the updates of the other states, and this process continues until Δ<θ for all states. For state 0, the optimal value is 5.68, hit within 10 iterations.

The most interesting parameter to test here is the error tolerance θ, which influences the numbers of iterations before convergence.

Number of iterations required for various error tolerances θ. The lower the tolerance, the more iterations are needed before the algorithm converges.

Final words

Value iteration is one of the cornerstones for Reinforcement Learning. It is easy to implement and understand. Before moving towards more advanced implementations, make sure to grasp this fundamental algorithm.

Some other minimal working examples you might be interested in:

Implement Policy Iteration in Python – A Minimal Working Example

A Minimal Working Example for Deep Q-Learning in TensorFlow 2.0

A Minimal Working Example for Continuous Policy Gradients in TensorFlow 2.0

A Minimal Working Example for Discrete Policy Gradients in TensorFlow 2.0

References

Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

Related Articles

Enhancing Python Documentation: A Step-by-Step Guide to Linking Source Code

Dynamic Programming: The Framework

Mastering Dynamic Programming

Foundational RL: Dynamic Programming

A Deep Dive into Problem States

Spotting Unfair or Unsafe AI using Graphical Criteria

A New Era of Massively Parallel Simulation: A Practical Tutorial Using ElegantRL

How Do Chatbots Understand?

A Low Down on Machine Learning

Implement Value Iteration in Python – A Minimal Working Example