A Minimal Working Example for Discrete Policy Gradients in TensorFlow 2.0

A multi-armed bandit example for training discrete actor networks. With the aid of the GradientTape functionality, the actor network can be trained using only a few lines of code.

Wouter van Heeswijk, PhD
Towards Data Science
6 min readSep 4, 2020

--

Photo by Hello I’m Nik via Unsplash

--

--

Assistant professor in Financial Engineering and Operations Research. Writing about reinforcement learning, optimization problems, and data science.