Update 1: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com
Update 2: If you are new to the subject, it might be easier for you to start with Reinforcement Learning Policy for Developers article.
Overview
In a previous article we learned about Policy Based learning and explored what are the advantages and disadvantages of this methods. In this article we will delve into mathematical details in order to derive the right equation for Policy Gradient.
But before we start let’s have a quick reminder of why policy learning is useful:
-
Policy learning is effective in high-dimensional or continuous action spaces. Large state and action spaces consume huge memory and computation power. On the other hand learning the policy is less consuming since we learn manageable set of parameters that define the policy.
-
Can learn stochastic policies. Stochastic policies are better than deterministic policies, especially in 2 players game since any deterministic behavior will be easily faced by a counter measure.
The Math
We need to always and always remind ourselves what is our objective, because it is very easy to forget about it. Keep saying to yourself that the aim is to maximize the rewards in every Reinforcement Learning problem.
So starting from this fundamental principle we define J(π) as the expected rewards R(s, a) that we get in every time step going from zero to the end of the episode T, following a policy π(π). If we define the episode as a trajectory π½ going from t=0 to T, then the expected rewards is the sum of all possible trajectories of the probability that π½ is selected according to π, times the return of this trajectory R(π½). This leads us to the following equation:

Again our aim is to find the set of parameters π that will maximise J(π)(which is the expected return).

From calculus we know that in order to find the extremum (maximum or minimum) of a function we compute its derivative. Which we will do in the following set of equations:

Notice the trick above in which we multiply and divide by P(π½, π) (this is legal since P(π½, π)/P(π½, π) = 1). The aim of this trick is to reach a term π»P(π½, π)/P(π½, π), because β«df/f .dx= log(f(x)), and thus derivative of log(f(x)) is df/f. From this we find that π»log P(π½, π ) = π»P(π½, π)/P(π½, π). And magically we get the following equation.

However, since we can’t practically compute every possible trajectory π½,we will fallback to get a reduced number of (m) trajectories. Here too we can see that P(π½, π ) (the one before the log) magically vanishes. The reason is that when we have chosen the (m) trajectories there is no more probabilities of selecting them, because they are already selected.
We end up with this equation:

Notice that we moved from expected value (over all possible trajectories) to an average of (m) selected trajectories.
Now let’s focus on π»log P(π½, π ) alone. The probability that a trajectory is followed, is equal to the the product of the probabilities that each step of this trajectory is reached when following a certain policy π(π). The following line shows that log P(π½, π ) is transformed into the product of P(S(t+1)|St, At).π(π)

Intuitively we say that we followed a trajectory, when we have passed through all of its steps. So the probability of following it to the end, is equal to the product of probabilities that we passed by every step of this trajectory.
We know that the log of the product is equal to the sum of the logs:

The second line of the set of equations above shows that we decomposed Ξ£log(P.π) even further into Ξ£(log(P) + log(π)), then Ξ£log(P) + Ξ£log(π). When applying the derivative over π on the two terms the π»Ξ£log(P) disappears because P does not depend on π.
Now that we have found that π»log P(π½, π ) = π» Ξ£log(π(a|s,π) we can place it back in the initial equation and retrieve the final result as shown below:

Conclusion
As we have seen deriving the Policy Gradient is very elegant and intelligent procedure. It is essential to fully grasp the essence of this derivation since it is the basis of many well know policy based methods, such as Actor-Critic.