The world’s leading publication for data science, AI, and ML professionals.

Top Down View at Reinforcement Learning

Stitch together the different parts and branches of Reinforcement Learning

Photo by Josh Power on Unsplash
Photo by Josh Power on Unsplash

Update: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com

When you are new to Reinforcement Learning you will no doubt be bombarded with weird terms, like Model-Based, Model-Free, On Policy, Off Policy etc…

Soon you will find it exhausting to keep track of this terminology that seem to appear all over the place, without obvious link between its terms.

This article will try to put all these terms into perspective so that beginners don’t feel overwhelmed.

Disclaimer: this article assumes that you already know what is Reinforcement Learning and some of the existing Algorithms. It does not introduce or explain any particular algorithm, but it will try to put together the different branches so that you get a comprehensive big picture, and how those branches fit in.

Reinforcement Learning

As already established Reinforcement Learning is a framework that lets agent learn decision making from experience. It consists of an agent interacting with an environment, where the it takes actions and collects rewards. The goal of the agent is to collect the maximum rewards.

For this we need to set up the following definitions.

State

The state is the collection of elements or features that describe a situation at a certain moment or time step. Example of state could be the position of a robot, its orientation, as well as the landscape around it, the wind speed, the temperature, etc…

Environment Sate

The Environment State is the state that describes the environment at one moment or a time step. The state of the environment might contain so much details that might not be possible, nor interesting to include in any computation. For example the state of the atoms in robot movement problem is not interesting to take into account.

Agent State

The Agent State is the state as it is perceived by the agent. The agent might not be able to detect the full state of the environment, for example a robot with a fixed camera can’t see 360° view. In poker game, the agent has only knowledge of the opponent public cards. Generally the agent state differs from the environment state, but in the simplest cases, they are the same, such as in some board games.

In fully observable environment, the agent sees the full environment state, so in this manner the observation is the environment state. The agent is said to be in Markov Decision Process (MDP).

Partially observable environment, the agent gets partial info. This is called Partially Observable MDP (POMDP), the env can still be MDP but the agent does not know it.

History

The history is a sequence of observations , actions and rewards.

Markov Decision Process

Markov Decision Process (MDP) is a mathematical framework for modeling decision making in situations. A process is Markov if the next state is dependent only on the current state, any past state is irrelevant.

Sate/Action Value Functions

State and Action Value functions, are functions that give a value to a certain state s, or an action a performed on state s. The idea is to assess the importance of being at a certain state and/or perform a certain action relative to other states or actions. In short they tell how valuable is to be at that state and how good is to take that action. Imagine a game of chess where the white has an opportunity for a checkmate. Being in such position is a very valuable state, and among all possible actions at that position, performing the move to checkmate is the best action to be taken. In problems with limited number of states, it is easy to compute the exact value of states and action. However when the number of states becomes extremely large, the need to approximate will be more urgent in order to save time and resources. For those kind of problems Function Approximation is used.

Policy

A policy _𝜋(s) i_s a function that maps a state to an action. It is like being in some situation and you ask yourself "what should I do now?". The policy tells you what action to take. Policy can be deterministic which means the same state leads the same action, or it can be stochastic in which the same state leads to different actions according to some probability distribution. A gentle introduction to policy can be found in the article "Reinforcement Learning Policy for Developers"

Model

A model predicts what the environment will do next. For example a transition probability predicts what will be the next state, a reward function predicts the next reward. A model does not automatically gives us a good policy we still need to plan.

Learning & Planning

Two fundamental problems in RL:

  • Learning: the environment is initially unknown, the agent learns by interacting with the environment.
  • Planning: A model is given, the agent plan in the model (without external interaction). By planning we mean reasoning, thinking, searching.

Prediction & Control

Prediction is the evaluation of the future given a policy, while Control is the optimisation of the future, by finding the best policy that maximizes the cumulative reward.

Exploration vs Exploitation

Exploration is about finding more information about the environment. Exploitation is about getting advantage of the acquired info to maximize rewards. It is important to know how much to explore and how much to exploit.

Model based / Model Free

Model based Reinforcement Learning is the knowledge of the environment dynamics, for instance the transition probabilities between states, as well as the rewards. This can be known for board games but it will be very difficult to have it in real life problems.

Model can be given or can be learned, then it can be subject to planning, or a study in a way that does not require to take real actions. Planning phase uses specialized algorithms such as Monte Carlo Tree Search that is used in AlphaZero. It goes without saying that the model MUST be accurate enough to represent the real problem, otherwise it would be a waste of time and resources to plan actions based on inaccurate model, which will lead to poor performance in the real environment.

Dynamic Programming is one of the model based algorithms.

On the other hand Model Free algorithms do not rely on models to learn, they learn through direct experience, meaning they take actions in the real environment. Among such algorithms we can find Monte Carlo and Temporal Difference (TD).

On Policy / Off Policy

In TD learning we compute the action value Q(s,a) at sate s, by also taking into consideration the action value at the next state Q(s’, a’).

Q at next state is subject to On/Off policies methods.

On Policy consists of computing the Q(s, a) value based on a certain policy 𝜋, meaning Q(s, a) needs the value Q(s’, a’). To get the value of Q(s’, a’) we need action a_‘ which is gotten using the same policy 𝜋 that brought action a . When moving to state s’ we still follow the action a’ t_hat has been previously determined. The algorithm using On Policy is called SARSA.

In contrast, Off Policy computes Q(s, a) by using the max Q(s’) over all available actions at s’. This means that we don’t have to chose a specific action a’ according to policy _𝜋, we simply choose the Q(s’, a’) that have the highest value, then when moving to state s‘, we don’t necessarily follow the action that brought the maximum Q value. T_he algorithm using this technique is called Q-Learning.

Categorizing Agents

Agents can be categorized in different ways following the types of components their algorithms uses to learn and make decisions:

Value based

  • No policy (it is implicit)
  • Value function

Policy Based

  • Policy
  • No value function

Actor Critic

  • Policy
  • Value Function

Orthogonally to the above categories, each type of the above agents can be :

Model Free

  • Policy / Value functions
  • No model

Model Based:

  • Optional Policy / Value functions
  • Model

The following picture summarizes the categories of agents that are commonly used in Reinforcement Learning

This picture gives a glimpse of some RL algorithms and their categories:

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#a-taxonomy-of-rl-algorithms
https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#a-taxonomy-of-rl-algorithms

Related Articles