In previous articles, we have talked about reinforcement learning methods that are all based on model-free methods, which is also one of the key advantages of RL learning, as in most cases learning a model of environment can be tricky and tough. But what if we want to learn a model of environment or what if we already have a model of environment and how can we leverage that to help the learning process? In this article, we will together explore RL methods with environment as a model. The following will be structured as:
- Start with basic idea of how to model an environment
- Implement an example in Python using the theory we just learnt
- Further ideas to extend the theory to more general cases

Model the Environment
An agent starts from a state, by taking an available action in that state, the environment gives it feedback, and accordingly the agent lands into next state and receive reward if any. In this general settings, the environment gives an agent two signals, one is its next state in the setting, and the other is reward. So when we say to model an environment, we are modelling a function mapping (state, action)
to (nextState, reward)
. For example, consider a situation in a grid world setting, an agent bashes its head into the wall, and in response, the agent stays where it is and gets a reward of 0, then in the simplest format, the model function will be (state, action)-->(state, 0)
, indicating that the agent with this specific state and action, the agent will stay at the same place and get reward 0.
Algorithm
Let’s now look into how a model of environment can help improve the process of Q-learning. We start by introducing the simplest form of an algorithm called Dyna-Q:

The way Q-learning leveraging models to backup policy is simple and straight forward. Firstly, the a, b, c, d
steps are exactly the same as general Q-learning steps(if you are not familiar with Q-learning, please check out my examples here). The only difference lies in step e
and f
, in step e
, a model of the environment is recorded based on the assumption of deterministic environment(for non-deterministic and more complex environment, a more general model can be formulated based on the particular case). Step f
can be simply summarised as applying the model being learnt and update the Q function n
times, where n
is a predefined parameter. The backup in step f
is totally the same as it is in step d
, and you may think it as repeating what the agent has experienced several times in order to reinforce the learning process.
Typically, as in Dyna-Q, the same reinforcement learning method is used both for learning from real experience and for planning from simulated experience. The reinforcement learning method is thus the "final common path" for both learning and planning.

The graph shown above more directly displays the general structure of Dyna methods. Notice the 2 upward arrows in Policy/value functons
, which in most cases are Q functions that we talked before, one of the arrow comes from direct RL update
through real experience
, which in this case equals the agent exploring around the environment, and the other comes from planning update
through simulated experience
, which, in this case, is repeating the model the agent learnt from real experience
. So in each action taking, the learning process is strengthened by updating the Q function from both actual action taking and model simulation.
Implement Dyna Maze
I believe the best way to understand an algorithm is to implement an actual example. I will take the example from Reinforcement Learning an introduction, implement it in Python and compare it with general Q learning without planning steps(model simulation).
Game Setting

Consider the simple maze shown inset in the Figure. In each of the 47 states there are four actions, up
, down
, right
, and left
, which take the agent deterministically to the corresponding neighbouring states, except when movement is blocked by an obstacle or the edge of the maze, in which case the agent remains where it is. Reward is zero on all transitions, except those into the goal state, on which it is +1
. After reaching the goal state (G)
, the agent returns to the start state (S)
to begin a new episode.
The whole structure of this implementation is to have 2 classes, the first class represents the board, which is also the environment, that is able to
- Take in an action and output the agent next state(or position)
- Give reward accordingly
And the second class represents the agent, which is able to
- Explore around the board
- Keep track of a model of the environment
- Update the Q functions along the way.
Board Implementation
The first class of the board settings are similar with many board games we talked before, you can checkout full implementation here. I will eliminate my explanation here(you can check my previous articles to see more examples), as a result, we will have a board look like this:

The board is represented in an numpy array, where z
indicates the block, *
indicates the agent’s current position and 0
indicates empty and available spots.
Agent Implementation
Initilisation
Firstly, in the init
function we will initialise all the parameters required for the algorithm.
Besides those general Q-learning settings(learning rate, state_actions, …), a model of (state, action) -> (reward, state)
is also initialised as python dictionary, and the model will only be updated along with the agent’s exploration in the environment. The self.steps
is the number of time the model is used to update the Q function in each action taking, and self.steps_per_episode
is used to record the number of steps in each episode(we will take it as a key metrics in the following algorithm comparison).
Choose Actions
In the chooseAction
function, the agent will still take ϵ-greedy action, where it has self.exp_rate
probability to take a random action and 1 - self.exp_rate
probability to take a greedy action.
Model learning and Policy updating
Now let’s get to the key point of policy updating using models being learnt along agent’s exploration.
This implementation follows exactly as the algorithm we listed above. At each episode(game playing), after the first round of Q function update, the model will also be updated with self.model[self.state][action]=(reward, nxtState)
, and then Q function will be repeatedly updated by self.steps
number of times. Notice that in side the loop, the state
and action
are both randomly selected from the previously observations.
Experimenting with different steps
When the number of steps is set to 0, the Dyna-Q method is essentially Q-learning. Let’s compare the learning process with steps of 0, 5 and 50.

The x-axis is the number of episodes and y-axis is the number of steps to reach the goal. The task is to get to the goal as fast as possible. From the learning curve, we observe that the learning curve of planning agent(with simulated model) stabilises faster than non-planing agent. Referring to the words in Sutton’s book:
Without planning (n = 0), each episode adds only one additional step to the policy, and so only one step (the last) has been learned so far. With planning, again only one step is learned during the first episode, but here during the second episode an extensive policy has been developed that by the end of the episode will reach almost back to the start state
The additional model simulation and backup further reinforced the agent’s experience, thus resulted in a faster and more stable learning process.(checkout the full implementation)
How to generalise the idea?
The example we explored here surely has limited use, as the state is discretised an the action is deterministic. But the idea of modelling environment to accelerate learning process has unlimited use.
For discrete states with non-deterministic action
A probability model could be learnt rather than a straight forward 1 to 1 mapping we introduced above. The probability model should be constantly updated through the learning process, and during the backup stage, the (reward, nextState)
could be chosen non-deterministically with a probability distribution.
For continuous states
The Q function update will be slightly different(I will introduce it in further articles), and the key would be to learn a more complex and general parametric model of the environment. This process could involve general supervised learning algorithms with current state, action as input and next state and reward as output.
In next post, we will learn further ideas to improve Dyna methods and talk about situations when the model is wrong!
And lastly, please check out the full code here. You are welcomed to contribute, and if you have any questions or suggestions, please raise comment below!
Reference: