Mixing policy gradient and Q-learning
Policy gradient algorithms is a big family of reinforcement learning algorithms, including reinforce, A2/3C, PPO and others. Q-learning is another family, with many significant improvements over the past few years : target network, double DQN, experience replay/sampling …
I’ve always wondered if it was possible to take the best of the two and create a better learning algorithm. And this dream has come true, when I discovered Mean actor critic [1].
Quick background
In reinforcement learning, there are rewards and the goal is to build an agent maximizing its overall reward over an episode. This quantity is :
Starting from a time t, where r is the reward, and gamma is a discount factor. Infinity represents the end of the episode, so if it is finite almost surely, we can take gamma = 1.
In reality, we have a policy π, which is a probability distribution on the actions, for each state s. This implies that we want to maximize the expected value of G starting from t, and we call this number the value.
We can also define Q, which is the value conditioned on an action : that is to say the expected overall reward starting from state s and using action a.