The world’s leading publication for data science, AI, and ML professionals.

Crystal Clear Reinforcement Learning

Comprehensive & concise concepts of Reinforcement Learning (RL)

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Reinforcement learning (RL) is the hottest field of Artificial Intelligence and Machine Learning with many breathtaking breakthroughs in the last couple of years. This article is an attempt to give a concise view of the entire RL spectrum without going too much into math and formula while not losing sight of trees in the dense & complex forest.

Now first, let’s understand what is Reinforcement Learning or RL?

RL means taking optimal action with long term results or cumulative rewards in mind.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Reinforcement learning (RL) is learning by interacting with an environment. A Reinforcement Learning agent learns from the consequences of its actions rather than from being explicitly taught, and it selects its actions based on its past experiences (exploitation) and also by new choices (exploration), which are essential, trial and error learning just like a child learns. The reinforcement signal that the RL-agent receives is a numerical reward, which encodes the success of an action’s outcome, and the agent seeks to learn to select actions that maximize the cumulative reward over time.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Before we dive deep inside RL, lets first take a look at why RL has so much importance in Artificial Intelligence and Machine Learning horizon. Refer to the below diagram that shows RL is used in every sphere of life.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Now, I will cover a few terms frequently used to explain Reinforcement Learning, and one must understand these before diving into an algorithm and more engaging concepts.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

In the above diagram, I have listed name, standard notation, and pictorial representation. Now I will define each of them.

Agent

Anything that senses its environment using some kind of sensor and able to generate action in the environment is called an agent. The agent executes the action, receives observation, and rewards.

Environment

The environment is the overall representation of which agent interacts. The agent is not considered part of the environment. The environment receives an action, emits observation, and the reward.

State

The state describes the current situation and determines what happens next. The agent may have a partial view of the state, which is called observation.

Reward

When an agent takes action in a state, it receives a reward. Here the term "reward" is an abstract concept that describes feedback from the environment. A reward can be positive or negative. When the reward is positive, it is corresponding to our usual meaning of reward. When the reward is negative, it is corresponding to what we usually call "punishment."

Pixabay, Atari & Tesla images used for this illustration picture
Pixabay, Atari & Tesla images used for this illustration picture

The reward is the feedback the agent gets from the environment at every time step. It can then use this reward signal (can be positive for a good action or negative for a bad action) to conclude how to behave in a state. The goal, in general, is to solve a given task with the maximum reward possible. That is why many algorithms have a tiny negative reward for each action the agent takes to animate it to solve the task as fast as possible. The reward is a design decision that is critical in RL as this would encourage the RL agent to optimize the behavior. For example, winning the chess game at less time or drive the car at average speed but without any collision (car can speed up in empty high-way while drive slowly in the busy road). Reward tells what is good in an immediate sense. For example, in the Atari game, every time step agent gets points or loses points.

Discount Factor

Reward now is more valuable than reward in the future. The discount factor, usually denoted as γ, is a factor multiplying the future expected reward and varies on the range of [0,1]. It controls the importance of future rewards versus the immediate ones. The lower the discount factor is, the less important future rewards are, and the agent __ will tend to focus on actions that will yield immediate rewards only. The fact that the discount rate is bounded to be smaller than 1 is a mathematical trick to make an infinite sum finite. This helps to prove the convergence of certain algorithms. The discount factor is used to determine the present value of the future reward (similar to the present value of future money discounted due to factors like inflation).

State Value

State Value ‘V(S)’ is the total amount of reward an agent can expect to accumulate over the period starting from that state. Unlike the Reward signal, value function specifies what is good in the long run.

Action

Action is what an agent can do in each state. For example, a robot stride can only be, say, 0.01 meter to 1 meter. The Go program can only put down its piece in one of 19 x 19 (that is 361) positions. A self-driving car may move in different directions.

Action Value

Action Value or State-action value function (Q function) specifies how good it is for an agent to perform a particular action in a state with a policy π. The Q function is denoted by Q(S, A). It indicates the value of taking an action in a state following a policy π. ‘Q’ here means the quality of an action.

Policy

The policy is the agent’s action selection from a state. A policy is the mapping of action in a state that can be either deterministic or stochastic. The policy is denoted by π.

Model of the Environment

The model is mimics of environment. It knows the transition from state to state and a high-level understanding of the environment.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Now, I will cover a few essential concepts like exploration, exploitation, on policy, off policy, etc. before we dive into the RL algorithm.

Exploitation and Exploration

Exploitation and exploration concept is inherently linked to human nature, where we, as humans we prefer known compared to the unknown. For example, going to a restaurant, you can choose to go to your favorite restaurant since you already like the food there, but unless and until you try another restaurant you won’t know if there exists a better restaurant.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Exploitation is thus going or doing the same action, which gives the best value from a state (it is often called Greedy action), while exploration is to try out new activities which may give a better return in the long run even though the immediate reward may not be encouraging. In the above diagram, if the agent only considers immediate reward by following the red path to gain the maximum reward, it will later find the blue path that has a higher value even through immediate reward is lower. That’s why exploration is needed to make a better long term return.

On-policy & Off-policy

In Off-policy learning, we evaluate target policy (π) while following another policy called behavior policy (μ) (this is like a robot following a video or agent learning based on experience gained by another agent). On policy means the agent is following the same policy that it uses to update the target.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Planning & Learning

RL algorithm can be classified as a planning or learning algorithm. In planning, the agent needs to know the model of the environment (the agent knows transition probability), and it then plans the optimal policy under that model. In the learning problem, the agent does not understand the dynamics of the environment, and it needs to interact through trial and error to learn about the environment to come up with an optimal policy or optimal state values.

Model-based and Model-free

In a Model-based RL problem, the agent knows the model of the environment, which means it knows transition dynamics, and the main objective is to plan optimal actions. In Model-free RL, the agent does not know transition or other dynamics; rather, it has to learn by interacting with the environment what the optional actions that result in an optimal outcome.

Reinforcement Learning ‘Planning’ Algorithm

Markov Decision Processes (MDP)

Markov Decision Process (MDP) is a mathematical representation of a complex decision-making process. (MDP) formally describe an environment for Reinforcement Learning. The environment is fully observable and stationary (meaning rules do not change with time). Almost all RL problems can be formalized using MDP. MDP is defined by:

  • A state S, which represents every state that one could be in, within a defined world. Only the present matters, which means that the transition function solely depends on the current state S and not any of the previous states.
Pixabay images used for this illustration picture
Pixabay images used for this illustration picture
  • A model or transition function P, which is a function of the current state, the action is taken, and the state where we end up. This transition produces a certain probability of ending up in state S’, earning reward R on the way, starting from the state S, and taking the action A.
  • A reward is a scaler value for being in a state. It tells us the usefulness of entering the state.

The final goal of the MDP is to find a policy that can tell us, for any state, which action to take (remember, mapping of states to actions is our policy). The optimal policy is the one that maximizes the long-term expected reward. Once an optimal policy is found, the RL problem is solved.

In RL, we consider how to compute the state-value function V(S) for an arbitrary policy π. This is called policy evaluation in the RL literature (also referred to as the prediction problem). The process of making a new policy that improves on an original policy, by making it greedy with respect to the value function of the original policy, is called policy improvement or control.

Generalized Policy Iteration

Generalized policy iteration (GPI) refers to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two methods. It’s like Messi and Ronaldo playing on the same team for a football match. They pass the football every time but compete with each other for personal goal score, but they also assist each other to win the game for the team. Similarly, GPI is like a football (soccer) game where the ultimate aim is to find the optimal policy/optimal value function. At one sweep (maybe full episode or every step or few steps), the policy is evaluated to figure out value under this policy. Then the policy is improved by taking greedy action against old policy. This is repeated until convergence happens, and an optimal policy or optimal value is found.

Bellman Equation

The value of a particular state is equal to the reward in the current state, plus the discounted rewards one gets from that point on, transitioning from the current state S to a future state S’.

Value Iteration

To solve the Bellman equation, we normally start with arbitrary values for all states and then update them based on the neighbors (which are all the states that it can reach from the current state). Finally, we repeat that until convergence. This process is called "value iteration." Value Iteration is a combination of one (or more than one) sweep of policy evaluation and then perform another sweep of policy improvement. Repeat the process until convergence happens.Value iteration includes: finding optimal value function + one policy extraction. There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e., converged).

Policy Iteration

This is a GPI process in which policy is improved by performing monotonically improving policies and value functions: by repeating policy evaluation and then performing a policy improvement. Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations. This way of finding an optimal policy is called policy iteration. Policy iteration includes policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.

RL ‘Learning’ algorithm

In the RL problem, our objective is to either find optimal values for every state or find an optimal policy. Based on how the algorithms are formulated, they are broadly grouped into three categories. We can also different models based on the policy (on or off)

Value-Based Algorithm

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Monte Carlo Learning

Monte Carlo method is a straightforward concept where agents learn about the states and reward when it interacts with the environment. In this method, agents generate experienced samples, and then based on average return, value is calculated for a state or state-action. Below are critical characteristics of Monte Carlo (MC) method:

  1. There is no model (the agent does not know state MDP transitions)
  2. the agent learns from sampled experience
  3. learn state value vπ(s) under policy π by experiencing average return from all sampled episodes (value = average return)
  4. only after a complete episode, values are updated (because of this algorithm convergence is slow and update happens after an episode is Complete)
  5. There is no bootstrapping
  6. Only can be used in episodic problems

In Monte Carlo Method instead of expected return, we use empirical return that agent has sampled based following the policy. Consider the below diagram where we see five different trajectories (trajectory is the path taken by the agent in an episode) that agent experience from start till the end of the episode. In this case, for starting state is present in all five samples, and the value of that state would be an average of 5 values sampled in the five rollouts.

There are two varieties of the MC method: First Visit MC and Every Visit MC.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

The only first-time visit is counted in the computation of value for a state, even if that state is visited multiple times. In Every Visit MC, every time the state is visited in an episode, the state value is impacted by the number of visits (Refer to S3 state value in the below example.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

TD(0)

TD(0) is the simplest form of Temporal difference (TD) learning. In this form of TD learning, after every step value function is updated with the value of the next state and along the way, the reward is obtained. This observed reward is the key factor that keeps the learning grounded, and the algorithm converges after a sufficient number of sampling (in the limit of infinity). All TD method has the below characteristics:

  1. There is no model (the agent does not know state MDP transitions)
  2. the agent learns from sampled experience (Similar to MC)
  3. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for an outcome (they bootstrap like DP).
  4. It can learn from incomplete episodes; thus, this method can be used in continuous problems as well.
  5. TD updates a guess towards a guess and revise the guess based on real experience.

Refer to the below diagram where state value is updated after every step from the next state value (learning rate(α) is a factor that decides how much to be taken from the difference between the next state and current state to update the current state). This difference is called the TD error.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

SARSA

One of the TD algorithms for control or improvement is SARSA. SARSA name came from the fact that agent takes one step from one state-action value pair to another state-action value pair and along the way collect reward R (so its the S( t), A( t), R (t+1), S (t+1) & A(t+1) tuple that creates the term S, A, R, S, A). SARSA is an on-policy method. SARSA use action-value function Q and follow the policy π. GPI (Generalized Policy Iteration) is used to take action based on policy π ( ε-greedy to ensure exploration as well as greedy to improve the policy). Refer to the below diagram where we can see the next action can be any of the next actions based on policy.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Q-Learning

Q-learning is an off-policy algorithm. DQN (Deep Q-Learning) which made a Nature front page entry, is a Q-learning based algorithm (with few additional tricks) that surpassed human-level expertise in Atari game (I will cover DQN in details in a future post). In the Q-learning, the target policy is a greedy policy and behavior policy is the ε-greedy policy (this ensures exploration).

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Refer below diagram. The next action is based on a greedy policy.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Expected SARSA

Expected SARSA is just like Q-learning except that instead of the maximum over next state-action pairs, it uses the expected value (average based on probability), taking into account how likely each action is under the current policy. Given the next state, the Q-learning algorithm moves deterministically in the same direction while SARSA follows as per expectation, and accordingly, it is called Expected SARSA. Refer below, and the next action is based on the expected value.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

The main difference in these control algorithms based on the SARSA concept is shown in the below diagram.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Next, we will cover a few algorithms which are based on the temporal-difference concept, but values are not updated immediately but updated after few steps or updated based on the average of few states.

n-Step TD / SARSA

In n-Step TD, the value of the state or action value updated after a certain number of the time steps. In these cases, the agent collects more rewards along the way before values are updated. Values are updated from all discounted rewards collected along the way and value function fo state or state-action after a certain number of steps.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

TD(λ)

TD(λ) is an extension of the n-step TD. The intuition is to average all of the possible n-step returns into a single return. We weight the n-step return using a weight that decays exponentially with time. This is done by introducing a factor λ (value between 0 and 1) and weighing the nth return as ( n−1) power λ. Since we want all of these weights to sum to one (to have a weighted average), we normalize them.

We can see that when λ = 1, only the last term is kept, and this is essentially Monte Carlo method, as the state, action process goes all the way to the end, when λ = 0, the term reduces TD (0) method, and for 0<λ<1, the method becomes mixed of weighted n-step TD method or a mixed of Montecarlo and TD method.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

This can be further improved using eligibility traces, a nifty idea that allows us to update every state value based on remembering what happened in the past and use current information to update the state-values for every state we’ve seen so far. The eligibility traces combine two things: both how frequent and how recent a state is. In RL, it would be useful to extend what learned at the next state also to previous all states to accelerate learning. To achieve this objective, it is necessary to have a short-term memory mechanism to store the states which have been visited in the last steps. This is where the eligibility trace concept comes into the picture. λ∈[0,1] is a decay parameter called trace-decay or accumulating trace, which defines the updated weight for each state visited. The traces decrease in time. This allows giving a small weight to infrequent states and significant weight to recently visited states. For λ=0, we have the TD(0) case, and only the immediately preceding prediction is updated. TD(1) can be considered an extension of MC methods using a TD framework. In MC methods, we need to wait until the end of the episode to update the states. In TD(1), we can update all the previous states online, and we do not need the end of the episode. Let’s see now what happens to a specific state trace during an episode. I will take into account an episode with seven visits where five states are visited. The state s1s1 is visited twice during the episode. Let’s see what happens to its trace.

Refer to the above illustration; in the beginning, the trace is equal to zero. After the first visit to S1 (1st step), the trace goes up to 1.0, and then it starts decaying. After the second visit (3rd step), +1 is added to the current value (0.50), obtaining a trace of 1.50. After three more step S1 is revisited, making the trace value as 1.75 (Note, here exponential decay is just shown as 0.25 reduction in every step for illustration freedom, actual value would be different).

How does TD(λ) update the value function? In TD(λ), the previous states are accessible, but they are updated based on the eligibility trace value. States with a small eligibility trace will be updated of a small amount, whereas states with high eligibility traces will be substantially updated.

Deep Q Network (DQN)

This was one of the major success stories of integrating Deep Learning and Reinforcement Learning that opened up a floodgate of the new algorithms and renewed interest in the field of Reinforcement learning. The deep neural network is used as a function approximator for Q value. Deep Q-Learning simply applies a neural network to approximate the Q function for each action and state, which can save vast amounts of computational resources, and potentially expand to continuous time action spaces.

Experience replay: We run the agent and store say the last 100 thousand transitions (or video frames) into a buffer and sample a mini-batch of samples of size 512 from this buffer to train the deep network. As we randomly sample from the replay buffer, the data is more independent of each other and closer to i.i.d (independent and identically distributed). This makes raining stable.

Target network: We create two deep networks θ- (Target Network) and θ (Q-network). We use the first one to retrieve Q values while the second one includes all updates in training. After say 50,000 updates, we synchronize θ- with θ. The purpose is to fix the Q-value targets temporarily, so we don’t have a moving target to chase. Besides, parameter changes do not impact θ- immediately, and therefore even the input becomes near i.i.d.

Below are the diagram and steps for DQN.

  1. Preprocess and feed the States (S) to the DQN, which will return the Q-values of all possible actions in the state
  2. Select an action using the epsilon-greedy policy (Behaviour Policy). With the probability epsilon, we select a random action a, and with probability 1-epsilon, we choose an action that has a maximum Q-value, such as A= argmax (Q(S, A, θ))
  3. Perform this action in a state S and move to a new state S to receive a reward R. This state S’ is the preprocessed image of the next game screen in case of Atari simulator. We store this transition in our replay buffer as <S, A, R, S’>
  4. Next, sample some random batches of transitions from the replay buffer and calculate the loss, which is just the squared difference between target Q (reward + discounted maximum next Q-value (Greedy Target Policy)) and predicted Q.
  5. Perform gradient descent concerning our actual network parameters to minimize this loss
  6. After every N iterations, copy our actual network parameters (θ)to the target network parameters (θ-)
  7. Repeat these steps for M number of episodes to converge the parameter θ

Double Q-Learning

Q-Learning performs very poorly in some stochastic environments. Poor performance is caused by the large overestimation of action values due to the use of Max Q(S’, A’) in Q-learning. The main concept is to reduce overestimation by decomposing the max operation in the target into action selection and action evaluation. In Doble Q-learning, we maintain two Q-value functions Q-A and Q-B (in Double Deep Q Network, these two are supported through two different function approximations called Target network and Deep Q Network), each one gets an update from the other for the next state. Q-A decides the action that results in the best action value in the next state, and Q-B decides what is the expected value of state action using that action selected by Q-A (here algorithm is maintaining two networks).

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Dueling DQN

Q values can, can be decomposed into two pieces: the state Value function V and the advantage value A. Advantage function captures how better an action is compared to the average action at a given state (Refer to A3C for a visual explanation of Advantage function). At the same time, as we know, the value function captures how good it is to be at this state.

The whole idea behind Dueling Q Networks relies on the representation of the Q function as a sum of the Value and the advantage function.

We simply have two networks to learn each part of the sum, and then we aggregate their outputs. Refer to the below diagram where fully connected (FC) layer we have two heads (one for V and one for all A’s).

Adding this type of structure to the network allows the network to differentiate actions from one another better, and significantly improves the learning. In many states, the values of the different actions are very similar, and it is less critical which action to take. This is especially important in environments where there are many actions to choose from.

Policy-Based Algorithm

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

In policy-based methods, instead of learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the policy function that maps state to action (select actions without using a value function). The key idea underlying policy gradients is to push up the probabilities of actions that lead to higher returns, and push down the probabilities of actions that lead to a lower return until we arrive at the optimal policy.

The policy gradient methods target modeling and optimizing the policy directly. The policy π(A|S) is usually modeled with a parameterized function of θ. The value of the reward (objective) function depends on this policy, and then various algorithms can be applied to optimize θ for the best reward. In Policy-based methods, we explicitly build a representation of a policy (mapping π:S→A) and keep it in memory during learning.

It is natural to expect policy-based methods are more useful in continuous space. Because there are an infinite number of actions and (or) states to estimate the values for and hence value-based approaches are way too expensive computationally in the continuous space. For example, in generalized policy iteration, the policy improvement step requires a full scan of the action space, suffering from the curse of dimensionality.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

REINFORCE

REINFORCE (Monte-Carlo policy gradient) relies on an estimated return by Monte-Carlo methods using episode samples to update the policy parameter θ. REINFORCE works because the expectation of the sample gradient is equal to the actual gradient. In REINFORCE, the agent collects a trajectory τ of one episode using its current policy and uses it to update the policy parameter. Below are the steps for the REINFORCE.

  1. Initialize the policy parameter θ at random.
  2. Generate one trajectory on policy π
  3. Estimate the return for each time step till the terminal state
  4. Update policy parameters: θ by one step gradient ascent (the gradient of the value function can be computed as an expectation of the gradient of the (log of) the policy gradient times the reward – This is the main trick in policy gradient)
  5. Repeat 2–4 till convergence

REINFORCE trains a stochastic policy in an on-policy way. This means that it explores by sampling actions according to the latest version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and the training procedure. Throughout the training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. This may cause the policy to get trapped in local optima. The below diagram is an explanation of REINFORCE. We want to update the policy parameter in which we have a positive return and reduce for which we have a negative return.

One important thing to note, here no Markov property is used, and this can be used in partially observed MDP as well without any change in algorithm.

One major drawback of this is high variance and slow convergence.

REINFORCE with Baseline

To reduce the variance, one idea is to subtract a value called baseline B(S) from the return G(S) where B(S) does not depend on the action A. Mathematically still can be seen that Policy gradient still gives an unbiased result in the expectation.

So now, the question would be how to choose B(S). One of the choices for the baseline is to compute the estimate of the state value, V(S, w), where w is a parameter vector learned by some methods such as Monte Carlo.

Actor-Critic Algorithm

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

Two main components in the policy gradient are the policy model and the value function. It makes a lot of sense to learn the value function in addition to the policy, since knowing the value function can assist the policy update, such as by reducing gradient variance in vanilla policy gradients, and that is precisely what the Actor-Critic method does where Actor is Policy and Critic is value function.

Actor-critic methods consist of two models, which may or may not share parameters:

  • Critic updates the value function parameter ‘w,’ and depending on the algorithm, and it could be action-value Q or state-value V.
  • The actor updates the policy parameter ‘θ’ in the direction suggested by the critic.

As we saw earlier, one way to reduce variance and increase stability for the policy gradient method is subtracting the cumulative reward (return) by a baseline. This baseline subtraction is unbiased in expectation. So what we are doing here is adjusting the return through some baseline, which reduces the variance. There are many ways to improve the REINFORCE algorithm.

A3C

The Asynchronous Advantage Actor-Critic (A3C) algorithm is a classic policy gradient method with a particular focus on parallel training. This is a classic policy gradient method with a specific emphasis on parallel training. This algorithm uses multiple agents, with each agent having its network parameters and a copy of the environment. These agents interact with their respective environments Asynchronously, learning with each interaction. A global network controls each agent. As each agent gains more knowledge, it contributes to the total knowledge of the global network. The presence of a global network allows each agent to have more diversified training data. This setup mimics the real-life environment in which humans live as each human gains knowledge from the experiences of some other human, thus allowing the whole "global network" to be better.

Unlike some more straightforward techniques which are based on either Value-Iteration methods or Policy-Gradient methods, the A3C algorithm combines the best parts of both the methods ie the algorithm predicts both the value function as well as the optimal policy function.

Typically in the implementation of Policy Gradient, the value of Advantage, the agent also learns how much better the rewards were than its average(expectation). This gives a new-found insight into the agent into the environment, and thus the learning process is better. The following expression provides the advantage function.

Advantage (S,A) = Action-Value (S,A) – State Value (S).

Refer to the below diagram that explains the advantage function for a state that has got four actions with the probability of the action is shown (for the explanation this is used as deterministic discrete action, but the concept is equally applicable for stochastic continuous action space as well). Here highest advantage value in action A3 and actor would follow this action more in this state as suggested by the critic (advantage function).

Refer to the below diagram that shows how A3C architecture works with workers and global network parameters. At the same time, different agents acquire different information from the environment, and all collaborate to learn a global understanding of the environment and problem.

A2C

A2C is a "synchronous," deterministic version of A3C; that’s why it is named as "A2C" with the first "A" ("asynchronous") removed. In A3C, each agent talks to the global parameters independently, so it is possible sometimes the thread-specific agents would be playing with policies of different versions. Therefore the aggregated update would not be optimal. To resolve the inconsistency, a coordinator in A2C waits for all the parallel actors to finish their work before updating the global parameters. Then in the next iteration, parallel actors start from the same policy. The synchronized gradient update keeps the training more cohesive and potentially to make convergence faster. A2C is also able to utilize GPUs more efficiently and work better with large batch sizes while achieving the same or better performance than A3C.

Refer to the below diagram. All n agent threads wait until they all have an update to perform. Then we average gradients over these n threads with one update to the global network(s). Indeed, this should be more effective due to larger batch sizes.

TRPO

To improve training stability, we should avoid parameter updates that change the policy too much at one step. Trust region policy optimization (TRPO) carries out this idea by enforcing a KL divergence constraint on the size of policy update at each iteration. TRPO is an on-policy algorithm. TRPO can be used for environments with either discrete or continuous action spaces.

TRPO updates policies by taking the largest step possible to improve performance while satisfying a special constraint on how close the new and old policies are allowed to be. This constraint is expressed in terms of KL-Divergence, a measure of the distance between probability distributions of old and new policies.

This is different from the standard policy gradient, which keeps new and old policies close in parameter space. But even seemingly small differences in parameter space can have considerable differences in performance – so a single wrong step can collapse the policy performance. This makes it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sample efficiency. TRPO nicely avoids this kind of collapse and tends to quickly and monotonically improve performance. Refer to the below diagram, which is a cartoon illustration of how TRPO uses the trust-region (circular space) to restrict the following policy so that new policy does not drift too much and then never reach the optimal policy. Red, yellow, and orange policies are outside of the trust-region (set by the KL-divergence) and thus did not manage to recover or settled in local optima.

Pixabay images used for this illustration picture
Pixabay images used for this illustration picture

TRPO aims to maximize the objective function subject to, trust-region constraint, which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter δ.In this way, the old and new policies would not diverge too much when this hard constraint is met. Refer to the below diagram to understand how the constraint works for two policies.

PPO

TRPO is quite complicated to implement. Proximal policy optimization (PPO) simplifies it by using a clipped surrogate objective while retaining similar constraints and similar performance. PPO gets rid of the computation created by constrained optimization as it proposes a clipped surrogate objective function. Instead of imposing a hard constraint, it formalizes the constraint as a penalty in the objective function.

With PPO, we simplify the problem by turning the KL divergence from a constraint to a penalty term, similar to for example, to L1, L2 weight penalty (to prevent weights from growing large values).

PPO-Penalty approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient throughout training so that it’s scaled appropriately.

PPO-Clip doesn’t have a KL-divergence term in the objective and doesn’t have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy. PPO does hard clipping the policy ratio (ratio of updated policy with old) to be within a small range around 1.0, where 1.0 means the new policy is the same as old.

DDPG

DDPG (Deep Deterministic Policy Gradient), is a model-free off-policy actor-critic algorithm, that learns Q-function (Action Value) and Policy concurrently. It uses off-policy data and the Bellman equation to learn the Q-function and uses the Q-function to learn the policy. DDPG can be thought of as being deep Q-learning for continuous action spaces.

  • DDPG is an off-policy algorithm (use replay buffer)
  • DDPG can only be used for environments with continuous action spaces

When there are a finite number of discrete actions, the max poses no problem because we can just compute the Q-values for each action separately and directly compare them, which gives us the action that maximizes the Q-value. But when the action space is continuous, we can’t exhaustively evaluate the space (we have to evaluate an infinite number of actions), and since it would need to be run every time the agent wants to take any action in the environment, which is not feasible.

In DDPG, we use a replay buffer to compute targets followed by an update of Q-function (Gradient descent) and then update policy (Gradient ascent). We want to learn a deterministic policy that gives the action that maximizes the Q-function. Since the action space is continuous, and we assume the Q-function is differentiable with respect to an action, we perform gradient ascent to solve maximum Q value with respect to policy parameters only. Note that the Q-function parameters are treated as constants here when we do gradient ascent to update the policy parameter.

TD3: Twin-Delayed Deep Deterministic Policy Gradient

The twin-delayed deep deterministic policy gradient (TD3) algorithm is a model-free, online, off-policy actor-critic algorithm. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking because it exploits the errors in the Q-function. The TD3 algorithm is an extension of the DDPG algorithm.

TD3 does following improvements over DDPG.

  1. A TD3 agent learns two Q-value ("Twin") functions and uses the minimum value function estimate during policy updates.
  2. TD3 updates the policy (and target networks) less frequently ("Delayed") than the Q-function (one policy update for two Q-function updates, that’s how the twin delayed name came )
  3. When updating the policy, a TD3 agent adds noise to the target action, which makes the policy less likely to exploit actions with high Q-value estimates.

The above diagram shows the step in the algorithm. I will try here to describe the steps in a simplified manner.

  1. The first step is to initialize function approximation parameters for policy and Q-function (two Q parameters ), set target parameters as main parameters (1 for policy and 2 for Q-function)
  2. Repeat for many episodes to collect state, action (we add noise and clipping ), reward, next state, the terminal state indicator to build the replay buffer
  3. Randomly sample a batch of transition details from replay buffer and compute target action A’ (S’)(use Actor Target network, clipping and noise attributes)
  4. Compute target Q-value, Q’1 in Critic Target1 network
  5. Compute target Q-value, Q’2 in Critic Target2 network
  6. Take the minimum of Q-value from Q’1 & Q’2.
  7. Calculate loss for Q-function using one-step gradient descent in Critic Model1 (taking Q-value from step-6 as the target )
  8. Calculate loss for Q-function using one-step gradient descent in Critic Model2 (taking Q-value from step-6 as the target )
  9. Calculate total loss taking step 7 & 8 and update the parameter for Critic Model1 and Critic Model2
  10. Once in every two iterations, we update the policy parameter of Actor Model by one step gradient ascent using the Q-value output from Critic Model1 (Step 7)
  11. Once in every two iterations, we update the parameter for target networks (Actor Target, Critic Target1 & Critic Target2) by Polyak averaging.

We continue the above steps until convergence of parameters.

SAC (Soft Actor-Critic)

Soft actor-critic (SAC)is based on maximum entropy Reinforcement Learning, a framework that aims to both maximize the expected reward (which is the standard RL objective) and to maximize the policy’s _entropy (_Policies with higher entropy are more random), which intuitively means that maximum entropy reinforcement learning prefers the most random policy that still achieves a high reward.

Soft Actor-Critic (SAC) incorporates the clipped double-Q trick (like DDPG), and a central feature of SAC is entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy. This has a close connection to the exploration-exploitation trade-off: increasing entropy results in more exploration, which can accelerate learning later on. It can also prevent the policy from prematurely converging to a bad local optimum.

  • SAC is an off-policy algorithm.
  • SAC can work in continuous or discrete action spaces.

IMPALA

IMPALA (Importance Weighted Actor-Learner Architecture) is inspired by the popular A3C architecture which uses multiple distributed actors to learn the agent’s parameters. Each of the actors uses a clone of the policy parameters to act in the environment.

IMPALA uses an actor-critic set up to learn a policy and a baseline function V. The architecture consists of a set of actors, repeatedly generating trajectories of experience and one or more learners that use the experiences sent from actors to learn policy off-policy.

Periodically, actors pause their exploration to share the gradients they have computed with a central parameter server that applies updates by computing gradients. In IMPALA The process of generating experiences is decoupled from learning the parameters policy and a baseline function V. Multiple actors generate experience in parallel, while the learner optimizes both policy and value function parameters using all the generated experience. Actors update their parameters with the latest policy from the learner periodically. Because acting and learning are decoupled, we can add many more actor machines to generate a lot more trajectories per time unit. As the training policy and the behavior policy are not totally synchronized, there is a gap between them and thus we need off-policy corrections known as V-trace.

Image Ref: Deep Mind
Image Ref: Deep Mind

Central learner computes gradients, resulting in a model that has completely independent actors and learners.

Image from Google AI Blog
Image from Google AI Blog

To take advantage of the scale of modern computing systems, Importance Weighted Actor-Learner Architectures can be implemented using a single learner machine or multiple learners performing synchronous updates between themselves. Separating the learning and acting in this way also has the advantage of increasing the throughput of the whole system since the actors no longer need to wait for the learning step like in architectures such as batched A2C (as explained in below diagram).

Image Ref: Deep Mind
Image Ref: Deep Mind

However, decoupling the acting and learning causes the policy in the actor to lag behind the learner. In order to compensate for this difference, IMPALA uses an off-policy advantage actor-critic formulation called V-trace which compensates for the trajectories obtained by actors being off policy.

Final Note

Reinforcement learning is complicated, but that does not mean it needs to be esoteric. This field is evolving rapidly; new ideas and papers are coming out in rapid succession. I have tried to avoid all mathematical expressions, which are essential to understand more complex algorithms. Interested readers can dive into the reference list below to enjoy mathematical details and further explanation.

I look forward to seeing your feedback or sharing any other more straightforward explanation or reference to any different algorithm that you may want to see as part of this article.

Thanks for reading. You can connect me on LinkedIn.


For only $5/month, get unlimited access to the most inspiring and uplifting content… Click on the link below to become a Medium member and support my writing. Thank you! https://baijayanta.medium.com/membership

Reference :

https://spinningup.openai.com/en/latest/ https://github.com/dennybritz/reinforcement-learning https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html


Related Articles