Sugun’s take on…

Value Iteration to solve OpenAI Gym’s FrozenLake

Understanding and Implementation of Value Iteration from scratch…

Sugun Segu

Published in

Towards Data Science

5 min readJun 14, 2020

Under my narration, we will formulate Value Iteration and implement it to solve the FrozenLake8x8-v0 environment from OpenAI’s Gym.

This story helps Beginners of Reinforcement Learning to understand the Value Iteration implementation from scratch and to get introduced to OpenAI Gym’s environments.

Introduction: FrozenLake8x8-v0 Environment, is a discrete finite MDP.

We will compute the Optimal Policy for an agent (best possible action in a given state) to reach the goal in the given Environment, therefore getting maximum Expected Reward (return).

The agent controls the movement of a character in a [8x8] grid world. Some tiles of the grid are walkable[F], and others lead to the agent falling into the water[H]. Additionally, the movement direction of the agent is uncertain (unknown policy) and only partially depends on the chosen direction (Environmental Dynamics). The agent is rewarded (1) for finding a walkable path to a goal tile with (0) each step.

action_space: Discrete(4), agent can take 4 discrete actions: Left(0), Down(1), Top(2), Right(3).

state_space: Discrete(64), discrete 64 grid cells.

Transition Probability: Due to environmental uncertainty, the transition probability for example, given state (0) action (1) will be…

Attributes of the environment: ‘env.env.nA’, ‘env.env.nS’ gives the total no of actions and states possible.

P[s|a] = P[s’], s’, r, done

Probability of reaching successor state (s’) and its reward (r).

For more details on the environment, FrozenLake8x8-v0.

Let’s understand Policy Iteration:

Prediction and Control

Policy Evaluation: Determining the State-Value function Vπ(s), for a given policy(π).

For a given policy (π), the initial approximation, v0, is chosen arbitrarily, ‘0’ for the terminal state, and the successive approximation of value function using the Bellman’s equation as an update rule.

**Bellman Expectation Equation** as the update rule

Where, Policy: π(a|s) is the probability of taking action(a) in the state(s) under policy(π). Environmental Transition Dynamics: P(s’,r|s,a) is the probability of reaching successor state(s’) and receiving a reward(r) from the state(s) taken action(a), Gamma is a discount factor.

Take a second to understand the **pseudo-code** of **Iterative Policy Evaluation**

We iterate the update rule until the Change in Value estimate over iteration becomes negligible.

Policy Control: Improving the existing Policy(π)

In our case, we act greedy on the expected value function which gives us deterministic policy. Taking an action(a) that has the highest value from the state(s), Simple.

**argmax()** function returns the **action** which can take us to a **higher valued state**.

Policy iteration:

Taken from Richard Sutton’s Introduction to Reinforcement Learning

Both Policy Evaluation and Policy Improvement will execute iteratively for every state to improve the policy and value function.

Have a Quick look at the Psuedo code of Policy Iteration — Have a quick look at the **Psuedo code** of **Policy Iteration**

Dude…

Who uses this Synchronous DPs with multiple steps and multiple sweeps, let’s combine both steps using “Bellman Optimality Equation” and call it “Value Iteration” and say it comes under “Generalized Policy iteration”.

Cool…

Value Iteration:

So forget everything and let’s come to Value iteration (No, I am just Joking… don’t forget anything).

In value iteration, we do not run policy evaluation to completion. We perform just one sweep over all the states and act greedily with current value function.

**Implementation** of **Bellman’s Optimality Equation**

and By the time the Change in value estimate becomes negligible(<theta), the given policy will strictly converge into Optimal Policy… Dude, I am not saying it, Bellman did.

**Bellman Optimality Equation** as the update rule

**Implementation** of **Value Function**

So, the update rule does not refer to any specific policy instead of the action which maximizes the current Value Estimate.

The only thing waiting is to Call the Value Function and evaluate the apply the policy:

Solution:

From the Policy, we extract value and action to take from a given state.

We apply the policy onto the Environment and run for 100 episodes:

**Implementation** of **Policy** onto FrozenLake8x8-v0

Conclusion:

We saw the formulation of Value iteration and extracted Optimal Policy to reach the goal, and implemented the same on OpenAI’s FrozenLake environment.