Tips and Tricks

The last article in this Reinforcement Learning series on the cliff walking problem focused on the vanilla Policy Gradient algorithm (REINFORCE). Although bearing strong similarities, there are some important nuances in the deep learning variant – using a neural network representation – that require attention. This article shows a full Python implementation using Tensorflow 2.0.
The Cliff Walking Problem
The cliff walking problem entails an agent (starting in the bottom left corner) that needs to reach the bottom right corner to earn a reward and win the game[1]. Each step yields a small cost to encourage following the shortest path, yet the cliff at the bottom should be avoided.
![Representation of cliff walking world [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1M79MspX4MckRX3wqKCgAxA.png)
Deep discrete policy gradient
In policy gradient solutions to this problem, the stochastic policy _πθ assigns a probability to each of four actions (left, right, up, down) that can be taken on any tile. Formally, the policy (using a softmax function) can be denoted as follows:

The keen observer might note a slight variation compared to the vanilla implementation – instead of the vector dot product _ϕ(s,a)^⊤⋅θ = ∑_i ϕ(s,a)_i ⋅θi encountered in textbooks, we now adopt a generic parameterized function f(ϕ(s,a);θ).
The feature vector ϕ(s,a) used in the previous article was directly based on the state (four theta parameters per state, representing each action). As such, we effectively trained the model on individual states. With a neural network representation, such a segregation is not possible. The weights connecting the input layer to the first hidden layer can be trained individually, yet all remaining weights represent a compact representation valid for all states.
In effect, this is both the strength and weakness of the actor network. For large state spaces we simply cannot observe every single state – we need a general representation, which is precisely what the network offers. However, designing a single representation that captures all states is a major challenge.
Concretely, we replace ϕ(s,a)^⊤θ by a generic parameterized function f(ϕ(s,a);θ), with f being the neural network and θ the network weights. Other than that, it’s the same beast as before.
There are two main variants for network architecture. We may input a feature vector based on both state and action and outputting a single value (the post-decision approach, requiring to run the same model four times), or use a feature vector based on solely the state and outputting values for all four actions. We implement the latter here.
![Example of discrete actor network. The network takes the state as input (a 48-dimensional one-hot vector for this problem) and outputs probabilities for each action (left, right, up, down). The weights in the first layer are unique for each tile, but the hidden layers are a general representation for all states. [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1g5-e0SWHWWTBPcmKkecH4A.png)
A convenience of working with TensorFlow (or any other deep learning library, for that matter) is that training is largely automated; we simply need to feed the network a loss function and training happens almost on its own. For discrete policy gradients, we adopt the log loss (or cross-entropy loss) function:

Tensorflow implementation
First things first: we need an actor network. The snippet below constructs it. The softmax activation in the final layer ensures that output is a vector of probabilities summing to one, whereas the weight initialization ensures that initial action probabilities are equal.
Next, we need the loss function. We use the cross-entropy loss function in this case:
We don’t have to bother with manually computing gradients; all that is needed is the loss function. The GradientTape
functionality keeps track of our mathematical operations and uses that memory to compute the gradients for us. The update procedure – performed after each episode – is outlined below. Note that the gradients are computed at once for all losses per reward in the trajectory; this is because intermediate updates would alter the actor network, and thereby the computed probabilities needed for the gradients (this approach is certainly not the only solution though).
The full implementation can be found on my GitHub:
GitHub – woutervanheeswijk/cliff_walking_public: Cliff walking reinforcement learning example, with…
Didn’t quite get it yet? No problem: this minimal working example walks you through the discrete policy gradient algorithm step by step:
A Minimal Working Example for Discrete Policy Gradients in TensorFlow 2.0
Experiments
As usual, the implementation would not be complete without testing it. Let’s give it a try, comparing to vanilla policy gradient (the basic REINFORCE implementation), SARSA (the value-based on-policy comparison) and Deep Q-learning (the value-based neural network counterpart).
Vanilla policy gradient struggled quite a bit to attain stable solution, suffering from high variance and adverse effects of exploration. Previously, Deep Q-learning also struggled much more than vanilla Q-learning. The deep policy gradient seemingly combines the worst of both worlds, how does it fare? Place your bets.

It did not go well. In fact, this is the first time I decided to tailor the reward structure, raising the relative reward of hitting the goal. Initially, probabilities for each move are 0.25; as such, the odds of reaching the target are fairly low. The actor network also generalizes over states, e.g., if moving "right" at the start is bad, that reduced probability tends to carry over to other tiles.
Effectively, if the target is not found during the first thousand iterations or so, chances to find a satisfactory policy are fairly minimal. Adding an entropy bonus didn’t make a dent either. With high learning rates, the algorithm might converge to a local optimum (e.g., directly jump into the cliff or stay in a corner) without ever observing the target. With low learning rates, occasionally reaching the target simply does not provide a strong enough reward signal for repetition.
So, I cheated a bit and increased the reward by a factor 10, considerably improving reliability. Let’s call it reward shaping. With that out of the way, it is time to compare performances, emphasizing that individual runs display large fluctuations.
![Comparison between vanilla policy gradient and deep policy gradient. Both struggle with high variance, but typically converge to comparable policies in the end. The deep variant seems less stable though. [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1q-3_91iolDq-C7r7lumd3Q.png)
![Comparison between vanilla SARSA and Deep policy gradient. SARSA converges much quicker and reaches a more stable policy. [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1G-oiT-zeYPQOjIZjwDqvLA.png)
![Comparison between Deep Q-learning and Deep Policy Gradient. Although both struggle to learn, Deep Q-learning often converges earlier and to a better policy. [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1fOCSQ5KlsCfVecEWKUX-AA.png)
Although SARSA is not known for its strong exploration mechanism – in this case a simple ϵ-greedy approach, with ϵ=0.05 – the initialization of Q-values has a major impact on this problem. With all Q-values set at 0, there is a strong inclination to try new states; the agent likely discovers the target fairly quickly (for Deep Q-learning, this logic does not hold because Q-values are generalized rather than obtained from a lookup table). The policy gradient algorithm – despite having a very explicit exploration mechanism – actually has a fairly small chance to find the target, and it often converges to a local optimum before ever hitting the target.
![Sample path obtained with deep policy gradient. Due to the inherent randomness in the policy, the agent frequently takes detours or revisits tiles. [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1K2mHFHARqwvqqWxJY92qWw.png)
In the end, it all worked out though. The journey is not always as smooth as hoped beforehand, but we safely avoided the cliff once more.
Takeaways
- The crucial difference between the vanilla policy gradient and the deep gradient is that the deep variant generalizes action probabilities across states. The linear expression ϕ(s,a)^⊤θ is replaced by neural network f(ϕ(s,a);θ).
- It is challenging to learn policies, because (i) the probabilistic policy makes it hard to discover the target and (ii) probabilities are often generalized across states. For this reason, the target reward was increased.
- Due to the high variance of policy gradient algorithms, reward shaping might be helpful. Good reward trajectories are rare in this problem; the update signal needs to be sufficiently strong to have an impact.
_The full code of the deep policy gradient algorithm can be found on my GitHub repository._
The implementations of tabular Q-learning and SARSA for the cliff walking problem are shown in this article:
Walking Off The Cliff With Off-Policy Reinforcement Learning
The Deep Q-learning algorithm has been implemented as well:
Finally, the discrete policy gradient variant is shown in this article:
Cliff-Walking Problem With The Discrete Policy Gradient Algorithm
References
[1]Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An introduction. MIT Press.