Mixing policy gradient and Q-learning

Published in

Towards Data Science

5 min readAug 23, 2019

Policy gradient algorithms is a big family of reinforcement learning algorithms, including reinforce, A2/3C, PPO and others. Q-learning is another family, with many significant improvements over the past few years : target network, double DQN, experience replay/sampling …

I’ve always wondered if it was possible to take the best of the two and create a better learning algorithm. And this dream has come true, when I discovered Mean actor critic [1].

Quick background

In reinforcement learning, there are rewards and the goal is to build an agent maximizing its overall reward over an episode. This quantity is :

Starting from a time t, where r is the reward, and gamma is a discount factor. Infinity represents the end of the episode, so if it is finite almost surely, we can take gamma = 1.

In reality, we have a policy π, which is a probability distribution on the actions, for each state s. This implies that we want to maximize the expected value of G starting from t, and we call this number the value.

We can also define Q, which is the value conditioned on an action : that is to say the expected overall reward starting from state s and using action a.

Mixing policy gradient and Q-learning

Quick background

The idea

Written by Grégoire Delétang