A top view of how Model Based Reinforcement Learning works.

Update: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com
Introduction
If you have ever played a real-time strategy game (RTS) such as Age of Empires or others, you surely know that you start by an almost black screen. The first thing you do is to send units in every direction to scout the terrain and discover the enemy location, as well as the strategic locations and resources such as mines, forests, etc…

What you are actually doing is in fact building a map (or model) for the world you are in. After creating this map whether partially or completely you start planning your moves and prepare your upcoming battles.

By planning, we mean simulating actions in your head without really executing them in the environment of the game. This simulation will save you time, however, it is based on the model you have built. If this model is inaccurate and biased your planning will fall apart.
Let’s consider the case of a chess game:

Suppose you are playing a chess game, you made a move, then you waited till your opponent makes his move, then you make another move and your opponent follows with his own move, then you discover that your first move was not that good enough after all! Usually, if you are learning, your teacher will let you roll back the moves so you can learn from your mistake. However, you have lost quite some time.
On the other hand, if you simulate the moves in your head (which is what everyone does) and tell yourself "if I do this move my opponent can counter with this move, then I do this move, etc…" you would avoid all the previous scenario.
What you are actually doing is that you are unfolding in your head a search tree based on the model you know about chess, and from this tree, you will choose the best move that will potentially lead to winning.
Now replace yourself with an AI agent, and you get Model-Based Reinforcement Learning.
You can clearly see how this will save training time. Of course, it won’t be apparent in small environments with high reactivity (Grid World for example), but for more complex environments such as any Atari game learning via model-free RL methods is time consuming, while on the other hand making a reduced set of actions to create a model, then use this model to simulate episodes is a much more efficient.
What is a Model?
Abstractly, a model is your own representation of the reality or the environment you are in. In RL this translates into having a representation M of the [MDP](https://medium.com/@zsalloum/basics-of-reinforcement-learning-the-easy-way-fb3a0a44f30e) [S, A, P, R]. This means having a version i (better be as accurate as possible) of the real MDP.
If we assume that the states space S and the the transition probabilities A are known the model Mi will become [S, A, Pi, Ri].
So according to the model Mi going from state S to S’ after performing action A, is subject to the probability Pi(S’ | S, A), similarly having a reward r’ when at state S and performing action A is subject to the relation Ri ( r’ | S, A).
Difference between Model-Based and Model-Free
Model free methods learn directly for experience, this means that they perform actions either in the real world (ex: robots )or in computer (ex: games). Then they collect the reward from the environment, whether positive or negative, and they update their value functions.
This is a key difference with Model-Based approach. Model-Free methods act in the real environment to learn.
Conversely Model-Based algorithm uses a reduced number of interactions with the real environment during the learning phase. It aims to construct a model based on these interactions, and then use this model to simulate the further episodes, not in the real environment but by applying them to the constructed model and get the results returned by that model.
As described previously, this has the advantage of speeding the learning, since there is no need to wait for the environment to respond nor to reset the environment to some state to resume learning.
On the downside however, if the model is inaccurate, we risk learning something completely different from the reality.
Another point worth noting, is that Model-Based algorithm will still use Model-Free methods either to construct the model or in the planning/simulation.
Learning the Model
Learning the model consists of executing actions in the real environment and collect the feedback. We call this experience.
So for each state and action the environment will provide a new state and reward. Based on this collection of experiences we try to deduce the model. As one can guess, this is no other than supervised learning problem.

We solve this problem by using one of the supervised learning techniques that are available, it can be a regression or neural networks or something else.
Depending on the supervised learning technique, the model will be represented by a table lookup, neural network, or others…
Concrete Example
The following example depicts two states A and B with transitions from A to B and from B to two possible terminal states. We assume there is no discount factor.
To build the model, we run several episodes in the real environment, we collect the landing states and the results and then we deduce the [P, R] of the model as shown below:

We can see that according to our experience (the tests we run) going from A to be is 100% with 0 reward, while from B to the upper terminal state is 75% of the time with reward 1, and to the lower terminal state 25% of the time with reward 0.
This tells that if we make a simulation using this model (not the real environment) and we run an arbitrary set of tests (sampling) each time we are at A the model will tell us that we will move to B 100% of the time with r = 0 and each time we are at be we can go up 75% of the time with r = 1 and 25% of the time with r =0.

Now we can compute the value function using a method such as Monte Carlo which leads in our example to V(A) = 1 and V(B) = .75. As a reminder of Monte Carlo value computation: G(s) = r + 𝛄 G(s’) where G is the return at state s **** after each episode. V(s) = average G(s) for all episodes
V(B) = (6 G1 + 2G2) / 8 = (6 1 + 2 * 0) / 8 = .75 V(A) = ([G(A) + G(B)] + [G(A) + G(B)]) / 2 = ((0 + 1) + (0 +1)) / 2 = 1 PS. We considered there is no discounting so 𝛄 = 1
Main Loop
In summary the main loop of Model-Based RL is as follows:
- We act in the real environment, collect experience (states and rewards),
- then we deduce a model, and use it to generate samples (planning),
- we update the value functions and policies from samples,
- use these value functions and policies to select actions to perform in the real environment,
- then restart the loop again.

Dyna Architecture
A variation of the Model-Based RL, called Dyna Architecture. Instead of using the real experience to only construct a model, it is also used to update the value functions.

Dyna-Q Algorithm
The algorithm starts by initializing the Q and Model then enters the main loop. It starts at the current state, selects an action according to a policy, execute the action in the real environment, observes the reward R and the new state S’. With the R and S’ available it updates Q(S,A) and the Model. Note that S’ becomes now the current state. Now it enters the planning phase. The algorithm enters a second loop which it iterates n times, inside this loop the algorithm randomly selects a state and an associated action, apply them to the model and gets the reward and the new state from the Model, then it updates the Q(S,A) as in the outer loop.

Simulation Based Search
It can be brought to your attention that the planning phase as described previously, randomly samples state from the observed state space, and does not give any emphasis on the current state the agent is in.
Simulation Based search changes this process by taking the current state St which we arrived to in the real environment and the planning phase it runs multiple simulation starting from St. Notice that this is no more random state as in Dyna-Q.
So the planning phase is becomes now a loop that generates K episodes from the model M starting from the current state St.

The superscript k refers to the index of the episode, where K is the total number of episodes.

Simulation-Based search of the planning phase gives place to many types of implementations, such as Simple Monte Carlo Search, Monte Carlo Tree Search (MCTS).
Simple Monte Carlo Search
Simple Monte Carlo Search consists of a Model and a simulation policy π (policy dedicated for the planning phase). Then starting from the real current state St, we take each possible action a and we generate K simulation starting St and action a.

After that evaluate each simulation by computing Q(St, a) as the average of all returns of the simulated episodes

Now choose the action to execute in the real environment by selecting the one that maximizes the Q value function at St

Monte Carlo Tree Search
The Monte Carlo Tree Search algorithm has been discussed in detail in the article Monte Carlo Tree Search in Reinforcement Learning. It is used as a building block of Alpha Zero from Deep Mind.
Other Resources
You might also be interested in the following article: Model-Based and Model-Free Reinforcement Learning – Pytennis Case Study