Sugun’s take on…

Value Iteration to solve OpenAI Gym’s FrozenLake

Understanding and Implementation of Value Iteration from scratch…

Sugun Segu
Towards Data Science
5 min readJun 14, 2020

--

Under my narration, we will formulate Value Iteration and implement it to solve the FrozenLake8x8-v0 environment from OpenAI’s Gym.

This story helps Beginners of Reinforcement Learning to understand the Value Iteration implementation from scratch and to get introduced to OpenAI Gym’s environments.

Introduction: FrozenLake8x8-v0 Environment, is a discrete finite MDP.

We will compute the Optimal Policy for an agent (best possible action in a given state) to reach the goal in the given Environment, therefore getting maximum Expected Reward (return).

Dumb Agent using Random Policy

The agent controls the movement of a character in a [8x8] grid world. Some tiles of the grid are walkable[F], and others lead to the agent falling into the water[H]. Additionally, the movement direction of the agent is uncertain (unknown policy) and only partially depends on the chosen direction (Environmental Dynamics). The agent is rewarded (1) for finding a walkable path to a goal tile with (0) each step.

Taken from OpenAI Gym
FrozenLake-v0

action_space: Discrete(4), agent can take 4 discrete actions: Left(0), Down(1), Top(2), Right(3).

state_space: Discrete(64), discrete 64 grid cells.

Transition Probability: Due to environmental uncertainty, the transition probability for example, given state (0) action (1) will be…

Attributes of the environment: ‘env.env.nA’, ‘env.env.nS’ gives the total no of actions and states possible.

P[s|a] = P[s’], s’, r, done

Probability of reaching successor state (s’) and its reward (r).

For more details on the environment, FrozenLake8x8-v0.

Let’s understand Policy Iteration:

Prediction and Control

Policy Evaluation: Determining the State-Value function Vπ(s), for a given policy(π).

For a given policy (π), the initial approximation, v0, is chosen arbitrarily, ‘0’ for the terminal state, and the successive approximation of value function using the Bellman’s equation as an update rule.

Bellman Expectation Equation as the update rule

Where, Policy: π(a|s) is the probability of taking action(a) in the state(s) under policy(π). Environmental Transition Dynamics: P(s’,r|s,a) is the probability of reaching successor state(s’) and receiving a reward(r) from the state(s) taken action(a), Gamma is a discount factor.

Take a second to understand the pseudo-code of Iterative Policy Evaluation

We iterate the update rule until the Change in Value estimate over iteration becomes negligible.

Policy Control: Improving the existing Policy(π)

In our case, we act greedy on the expected value function which gives us deterministic policy. Taking an action(a) that has the highest value from the state(s), Simple.

argmax() function returns the action which can take us to a higher valued state.
Implementation of argmax()

Policy iteration:

Taken from Richard Sutton’s Introduction to Reinforcement Learning

Both Policy Evaluation and Policy Improvement will execute iteratively for every state to improve the policy and value function.

Have a Quick look at the Psuedo code of Policy Iteration
Have a quick look at the Psuedo code of Policy Iteration

Dude…

Who uses this Synchronous DPs with multiple steps and multiple sweeps, let’s combine both steps using “Bellman Optimality Equation” and call it “Value Iteration” and say it comes under “Generalized Policy iteration”.

Cool…

Value Iteration:

So forget everything and let’s come to Value iteration (No, I am just Joking… don’t forget anything).

In value iteration, we do not run policy evaluation to completion. We perform just one sweep over all the states and act greedily with current value function.

Implementation of Bellman’s Optimality Equation

and By the time the Change in value estimate becomes negligible(<theta), the given policy will strictly converge into Optimal Policy… Dude, I am not saying it, Bellman did.

Bellman Optimality Equation as the update rule
Implementation of Value Function

So, the update rule does not refer to any specific policy instead of the action which maximizes the current Value Estimate.

The only thing waiting is to Call the Value Function and evaluate the apply the policy:

Solution:

From the Policy, we extract value and action to take from a given state.

Deterministic Policy

We apply the policy onto the Environment and run for 100 episodes:

Implementation of Policy onto FrozenLake8x8-v0

Conclusion:

We saw the formulation of Value iteration and extracted Optimal Policy to reach the goal, and implemented the same on OpenAI’s FrozenLake environment.

Agent using Optimal Policy and reaching the goal in 8 steps… but getting affected by Environmental Dynamics

Take away:

Even though we got a strictly converged Optimal Policy, the agent always can’t reach the Goal, given the “Environmental dynamics”.

From the next post, we will use algorithms that can better formulate the Optimal Policy. so tune in for more…

For the complete code of the algorithm.

References:

[1] Reinforcement Learning: An Introduction | Second Edition by Richard S. Sutton & Andrew G. Barto.

[2] RL Course by David Silver: DeepMind.

[3] Open AI Gym.

Sources:

[4] Reinforcement Learning Specialization: University of Alberta.

[5] Deep Reinforcement learning: UC Berkeley

--

--