Reinforcement Learning — Model Based Planning Methods Extension

Implementation of Dyna-Q+ and Priority Sweeping

Jeremy Zhang
Towards Data Science

--

In last article, we walked through how to model an environment in an reinforcement learning setting and how to leverage the model to accelerate the learning process. In this article, I would like to further the topic and introduce 2 more algorithms, Dyna-Q+ and Priority Sweeping, both based on Dyna-Q method that we learnt in last article. (If you find some game settings confusing, please check my last article)

In the following paragraphs, we will leverage these 2 algorithms to solve 2 problems:

  1. What if the model is wrong?
  2. How to update the Q function more efficiently?

What if the model is wrong?

In last article, I introduced an example of Dyna-Maze, where the action is deterministic, and the agent learns the model, which is a mapping from (currentState, action) -> (nextState, reward) , by recording the steps that it experiences along the way, and then at the planning stage, the model being learnt is applied n times to reinforce the learning process. In a nutshell, the process can be summarised as

  1. Learning a model by experience
  2. Fully trust the model and apply it to reinforce the value function

But things do not always go in this way, as the environment can be complex and dynamic.

Models may be incorrect because the environment is stochastic and only a limited number of samples have been observed, or because the model was learned using function approximation that has generalised imperfectly, or simply because the environment has changed and its new behaviour has not yet been observed. When the model is incorrect, the planning process is likely to compute a suboptimal policy.

Shortcut Maze

Consider a case called shortcut maze, in which the environment is dynamically changing.

An agent starts at S and aims to reach G as fast as possible, and the black grey blocks are areas that the agent can not pass through. The picture on the left stands for the original setting, and our agent is able to find the shortest path, through the left side of the board all the way to G , using Dyna-Q method. However, the environment will change at a certain time stamp, and a shortcut on the rightmost of the board will be open. Under this setting, will the Dyna-Q agent still be able to find the optimal solution?

The answer is no. Even with ϵ-greedy method, which the agent always explore with a certain probability, it is very unlikely to locate the optimal path from the leftmost to the rightmost, as the already-learnt Q function would always guide the agent to choose the path on the left, and there is not strong enough motivations, or rewards pushing the agent to explore other path.

Dyna-Q+ & Implementation

How to resolve the problem? The essence is to keep the agent to be able to explore new states to fit the changing environment, and the trick that drives agent to explore is to give reward. Thus, here we introduce the theory of Dyna-Q+:

The agent keeps track for each state–action pair of how many time steps have elapsed since the pair was last tried in a real interaction with the environment. The more time that has elapsed, the greater (we might presume) the chance that the dynamics of this pair has changed and that the model of it is incorrect. To encourage behaviour that tests long-untried actions, a special “bonus reward” is given on simulated experiences involving these actions. In particular, if the modelled reward for a transition is r, and the transition has not been tried in τ time steps, then planning updates are done as if that transition produced a reward of r + κ*sqrt(τ), for some small κ. This encourages the agent to keep testing all accessible state transitions and even to find long sequences of actions in order to carry out such tests.

To summarise, the algorithm is totally the same as Dyna-Q except that it keeps track of number of time a state has been visited, and give reward to state that has long not been visited(as these state could have possibly changed as time goes on).

Now let’s make some modifications on Dyna-Q and implement Dyna-Q+(As the basic settings are mostly the same, the following codings I will majorly focus on the difference). You can also check the full implementation here.

Init

In the init function, two components are added in order to give reward to non-visited states. self.time keeps track of the total time steps within each episode(it will be reset after the game ends), and self.timeWeight is essentially the κ in the reward function, which signifies how much we would like the agent to explore. Besides those general settings, a model of environment is again initialised, but this time it is a mapping of (currentState, action) -> (reward, nxtState, timestep) .

Model Update

To make the code more organised, an update model function is defined:

The function updates the model at each time step. One needs to notice that we would like consider actions that has never been tried before in the planning stage, and we define those actions lead the agent back to the original state with a reward of 0, thus you can see that for action never occured has a model of:

self.model[state][a] = (0, state, 1)

With reward set to 0, state set as the same and time step set to 1(so that the number of times not visited this state could be high). In addition, for action that take place in that state would be marked with the current time step.

Play Function

The play function follows the same structure as in Dyna-Q method:

After each step, we update the model by calling self.updateModel() function, and the time step stored in the model is used in the following loop to add on to the reward:

_reward += self.timeWeight * np.sqrt(self.time - _time)

Besides the original reward by taking the specific action in a state, an extra reward is assigned. The self.time - _time is in essential the number of times the not been visited.

Dyna-Q & Dyna-Q+ Comparison

By giving extra reward to non-explored state, Dyna-Q+ is more likely to detect the changing environment, while Dyna-Q can hardly do it.

From Reinforcement Learning an Introduction

Referring to the result from Sutton’s book, when the environment changes at time step 3000, the Dyna-Q+ method is able to gradually sense the changes and find the optimal solution in the end, while Dyna-Q always follows the same path it discovers previously. In fact, as the planning step reinforces the experience in Dyna-Q, the more planning steps, the less likely for Dyna-Q agent to find the optimal path, in contrast, the planning steps in Dyna-Q+ increases values for not-explore-enough states and actions, resulting the agent more likely to explore and find the optimal path.

Priority Sweeping(Update more efficiently)

We have now gone through the basics of formulating a reinforcement learning with dynamic environment. You might have noticed that in the planning stage, there are actually many invalid updates, especially at the beginning phase when the Q functions of all states and actions are 0 and rewards along the way are also 0. In these scenarios, the temporary difference,

R + Q(S', A') - Q(S, A)

equals 0, thus the Q function updates of many states, actions are still 0. So here comes the question: are we able to update more efficiently?

Here I will introduce a method called priority sweeping, which focuses on updating non-zero values in the planning stage. The intuition is since many updates are 0s, are we able to only update values that are higher than a certain threshold, thus make the learning faster?

It is natural to prioritise the updates according to a measure of their urgency, and perform them in order of priority. This is the idea behind prioritised sweeping. A queue is maintained of every state–action pair whose estimated value would change nontrivially if updated , prioritised by the size of the change.

Priority Sweeping

There are few points you need to notice:

  1. Model is mapping of (currentState, action) -> (reward, nxtState)
  2. Only values of temporal difference larger than θ will be included in the queue
  3. In the planning stage, all states lead to the selected state will also be updated

The reason that all precedent states, following a non-trivial updated state, need to be updated is as the current state update value is non-trivial, backward update will absolutely result in non-zero updates. (Consider the example we talked in value iteration where we carried a backward update when the goal is reached, all states value updated along the way are non-zero and useful)

Algorithm Implementation

Time to get our hands on the implementation, and you can check the full implementation here.

Init

In the init function, we initialised a threshold θ , a priority queue to store state and action pairs by their priority, and a dictionary of predecessors in order to update all states lead to the current updated state.

Play Function

The major difference lies in the play function:

At each step, instead of direct update the Q value of current state, action pair, a tmp_diff is recorded and insert into the priority queue if the value is big enough. And then, both the model and predecessors are updated, notice that the predecessor dict is a mapping of nxtState -> List((currentState, action), ...) , as many states and actions could lead to a same state.

In the planning stage(in side the for loop), the top priority state, action pair is retrieved ( self.queue.get(1) ), and all states(within the pre_state_action_list ) lead to that state are updated.

Dyna-Q & Priority Sweeping Comparison

As we have talked above, priority sweeping updates non-trivial values along the process and thus is more efficient and faster.

From Reinforcement Learning an Introduction

Refer to the plot from Sutton’s book, priority is way faster to find the optimal solution than Dyna-Q.

The algorithm introduced above is for deterministic environment, for non-deterministic environment, as Sutton stated:

Extensions of prioritized sweeping to stochastic environments are straightforward. The model is maintained by keeping counts of the number of times each state–action pair has been experienced and of what the next states were. It is natural then to update each pair not with a sample update, as we have been using so far, but with an expected update, taking into account all possible next states and their probabilities of occurring.

Conclusion

In this article, we learnt two algorithms, and the key points are:

  1. Dyna-Q+ is designed for changing environment, and it gives reward to not-exploit-enough state, action pairs to drive an agent to explore
  2. Priority Sweeping is capable to accelerate the learning process by updating and propagating values with non-trivial updates

And lastly, please check out my Github. You are welcomed to contribute, and if you have any questions or suggestions, please raise comment below!

Reference:

--

--