Sugun’s take on…
Value Iteration to solve OpenAI Gym’s FrozenLake
Understanding and Implementation of Value Iteration from scratch…
Under my narration, we will formulate Value Iteration and implement it to solve the FrozenLake8x8-v0 environment from OpenAI’s Gym.
This story helps Beginners of Reinforcement Learning to understand the Value Iteration implementation from scratch and to get introduced to OpenAI Gym’s environments.
Introduction: FrozenLake8x8-v0 Environment, is a discrete finite MDP.
We will compute the Optimal Policy for an agent (best possible action in a given state) to reach the goal in the given Environment, therefore getting maximum Expected Reward (return).
The agent controls the movement of a character in a [8x8] grid world. Some tiles of the grid are walkable[F], and others lead to the agent falling into the water[H]. Additionally, the movement direction of the agent is uncertain (unknown policy) and only partially depends on the chosen direction (Environmental Dynamics). The agent is rewarded (1) for finding a walkable path to a goal tile with (0) each step.
action_space: Discrete(4), agent can take 4 discrete actions: Left(0), Down(1), Top(2), Right(3).
state_space: Discrete(64), discrete 64 grid cells.
Transition Probability: Due to environmental uncertainty, the transition probability for example, given state (0) action (1) will be…
Attributes of the environment: ‘env.env.nA’, ‘env.env.nS’ gives the total no of actions and states possible.
P[s|a] = P[s’], s’, r, done
Probability of reaching successor state (s’) and its reward (r).
For more details on the environment, FrozenLake8x8-v0.
Let’s understand Policy Iteration:
Prediction and Control
Policy Evaluation: Determining the State-Value function Vπ(s), for a given policy(π).
For a given policy (π), the initial approximation, v0, is chosen arbitrarily, ‘0’ for the terminal state, and the successive approximation of value function using the Bellman’s equation as an update rule.
Where, Policy: π(a|s) is the probability of taking action(a) in the state(s) under policy(π). Environmental Transition Dynamics: P(s’,r|s,a) is the probability of reaching successor state(s’) and receiving a reward(r) from the state(s) taken action(a), Gamma is a discount factor.
We iterate the update rule until the Change in Value estimate over iteration becomes negligible.
Policy Control: Improving the existing Policy(π)
In our case, we act greedy on the expected value function which gives us deterministic policy. Taking an action(a) that has the highest value from the state(s), Simple.
Policy iteration:
Both Policy Evaluation and Policy Improvement will execute iteratively for every state to improve the policy and value function.
Dude…
Who uses this Synchronous DPs with multiple steps and multiple sweeps, let’s combine both steps using “Bellman Optimality Equation” and call it “Value Iteration” and say it comes under “Generalized Policy iteration”.
Cool…
Value Iteration:
So forget everything and let’s come to Value iteration (No, I am just Joking… don’t forget anything).
In value iteration, we do not run policy evaluation to completion. We perform just one sweep over all the states and act greedily with current value function.
and By the time the Change in value estimate becomes negligible(<theta), the given policy will strictly converge into Optimal Policy… Dude, I am not saying it, Bellman did.
So, the update rule does not refer to any specific policy instead of the action which maximizes the current Value Estimate.
The only thing waiting is to Call the Value Function and evaluate the apply the policy:
Solution:
From the Policy, we extract value and action to take from a given state.
We apply the policy onto the Environment and run for 100 episodes:
Conclusion:
We saw the formulation of Value iteration and extracted Optimal Policy to reach the goal, and implemented the same on OpenAI’s FrozenLake environment.
Take away:
Even though we got a strictly converged Optimal Policy, the agent always can’t reach the Goal, given the “Environmental dynamics”.
From the next post, we will use algorithms that can better formulate the Optimal Policy. so tune in for more…
For the complete code of the algorithm.
References:
[1] Reinforcement Learning: An Introduction | Second Edition by Richard S. Sutton & Andrew G. Barto.
[2] RL Course by David Silver: DeepMind.
[3] Open AI Gym.
Sources:
[4] Reinforcement Learning Specialization: University of Alberta.