The world’s leading publication for data science, AI, and ML professionals.

A Minimal Working Example for Discrete Policy Gradients in TensorFlow 2.0

A multi-armed bandit example for training discrete actor networks. With the aid of the GradientTape functionality, the actor network can…

Photo by Hello I'm Nik via Unsplash
Photo by Hello I’m Nik via Unsplash

Training discrete actor networks with TensorFlow 2.0 is easy once you know how to do it, but also rather different from implementations in TensorFlow 1.0. As the 2.0 version was only released in September 2019, most examples that circulate on the web are still designed for TensorFlow 1.0. In a related article — in which we also discuss the mathematics in more detail – we already treated the continuous case. Here, we use a simple multi-armed bandit problem to show how we can implement and update an actor network the discrete setting [1].


A bit of mathematics

We use the classical policy gradient algorithm REINFORCE in which the actor is represented by a neural network known as the actor network. In the discrete case, the network output is simply the probability of selecting each of the actions. So, if the set of actions is defined by A and the action by a ∈ A, then the network output are the probabilities p(a), ∀a ∈ A. The input layer contains the state s or a feature array ϕ(s), followed by one or more hidden layers that transform the input, with the output being the probabilities for each action that might be selected.

The policy π is parameterized by θ, which in deep reinforcement learning represents the neural network weights. After each action we take, we observe a reward v. Computing the gradients for θ and using learning rate α, the update rule typically encountered in textbooks looks as follows [2,3]:

When applying backpropagation updates to Neural Networks we must slightly modify this update rule, but the procedure follows the same lines. Although we might update the network weights manually, we typically prefer to let TensorFlow (or whatever library you use) handle the update. We only need to provide a loss function; the computer handles the calculation of gradients and other fancy tricks such as customized learning rates. In fact, the sole thing we have to do is add a minus sign, as we perform gradient descent rather than ascent. Thus, the loss function – which is known as the log loss function or cross-entropy loss function[4] – looks like this:

TensorFlow 2.0 implementation

Now let’s move on to the actual implementation. If you have some experience with TensorFlow, you likely first compile your network withmodel.compileand then perform model.fitormodel.train_on_batchto fit the network to your data. As TensorFlow 2.0 requires a loss function to have exactly two arguments, (y_true and y_predicted) we cannot use these methods though, since we need the action, state and reward as input arguments. The GradientTapefunctionality – which did not exist in TensorFlow 1.0 [5] – conveniently solves this problem. After storing a forward pass through the actor network on a `tape‘ , it is able to perform automatic differentiation in a backward pass later on.

We start by defining our cross entropy loss function:

In the next step, we use the function .trainable_variables to retrieve the network weights. Subsequently, tape.gradient calculates all the gradients for you by simply plugging in the loss value and the trainable variables. With optimizer.apply_gradients we update the network weights using a selected optimizer. As mentioned earlier, it is crucial that the forward pass (in which we obtain the action probabilities from the network) is included in the GradientTape. The code to update the weights is as follows:

Multi-armed bandit

In the multi-armed bandit problem, we are able to play several slot machines with unique pay-off properties [6]. Each machine i has a mean payoff _μi and a standard deviation _σi, which are unknown to the player. At every decision moment you play one of the machines and observe the reward. After sufficient iterations and exploration, you should be able to fairly accurately estimate the mean reward of each machine. Naturally, the optimal policy is to always play the slot machine with the highest expected payoff.

Using Keras, we define a dense actor network. It takes a fixed state (a tensor with value 1) as input. We have two hidden layers that use five ReLUs per layer as activation functions. The network outputs the probabilities of playing each slot machine. The bias weights are initialized in such a way that each machine has equal probability at the beginning. Finally, the chosen optimizer is Adam with its default learning rate of 0.001.

We test four settings with differing mean payoffs. For simplicity we set all standard deviations equal. The figures below show the learned probabilities for each slot machine, testing with four machines. As expected, the policy learns to play the machine(s) with the highest expected payoff. Some exploration naturally persists, especially when payoffs are close together. A bit of fine-tuning and you surely will do a lot better during your next Vegas trip.


Key points

  • We define a pseudo-loss to update actor networks. For discrete control, the pseudo-loss function is simply the negative log probability multiplied with the reward signal, also known as the log loss- or cross-entropy loss function.
  • Common TensorFlow 2.0 functions only accept loss functions with exactly two arguments. The GradientTape does not have this restriction.
  • Actor networks are updated using three steps: (i) define a custom loss function, (ii) compute the gradients for the trainable variables and (iii) apply the gradients to update the weights of the actor network.

_This article is partially based on my method paper: ‘Implementing Actor Networks for Discrete Control in TensorFlow 2.0‘ [1]_

_The GitHub code (implemented using Python 3.8 and TensorFlow 2.3) can be found at my GitHub repository ._

Looking to implement the continuous variant or deep Q-learning? Check out:

A Minimal Working Example for Continuous Policy Gradients in TensorFlow 2.0

A Minimal Working Example for Deep Q-Learning in TensorFlow 2.0

References

[1] Van Heeswijk, W.J.A. (2020) Implementing Actor Networks for Discrete Control in TensorFlow 2.0. https://www.researchgate.net/publication/344102641_Implementing_Actor_Networks_for_Discrete_Control_in_TensorFlow_20

[2] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist Reinforcement Learning. Machine Learning, 8(3–4):229–256.

[3] Levine, S. (2019) CS 285 at UC Berkeley Deep Reinforcement Learning: Policy Gradients. http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf

[4] McCaffrey, J.D. (2016) Log Loss and Cross Entropy Are Almost the Same. https://jamesmccaffrey.wordpress.com/2016/09/25/log-loss-and-cross-entropy-are-almost-the-same/

[5] Rosebrock, A. (2020) Using TensorFlow and GradientTape to train a Keras model. https://www.tensorflow.org/api_docs/python/tf/GradientTape

[6] Ryzhov, I. O., Frazier, P. I., and Powell, W. B. (2010). On the robustness of a one-period look-ahead policy in multi-armed bandit problems. Procedia Computer Science, 1(1):1635{1644.


Related Articles