Actor-Critic With TensorFlow 2.x [Part 1 of 2]

Implementing the Actor-Critic method in different ways with Tensorflow 2.x

Abhishek Suran
Towards Data Science

--

Photo by David Veksler on Unsplash

In this series of articles, we will try to understand the actor-critic method and will implement it in 3 ways i.e naive AC, A2C without multiple workers, and A2C with multiple workers.

This is the first part of the series, we will be implementing Naive Actor-Critic using TensorFlow 2.2. Let us first understand what the actor-critic method is and how it works? Knowing the Reinforce Policy gradient method will be beneficial, you can find it here.

Overview:

If you have read about Reinforce Policy gradient method than you know that its update rule is

Update Rule for Reinforce

In the Actor-Critic method, we subtract the baseline from the discounted reward. And the common baseline use for these methods is the state value function. So our update rule for actor-critic will look like the following.

Actor-Critic update rule

In Actor-Critic Methods, we have two neural networks namely Actor and a critic. The actor is used for action selection and Critic is used to calculate state value. If you look at the update equation then you can notice that state value is being used as a baseline. Having a baseline helps to determine if an action taken was bad/good or it was the state that was bad/good. You can find very good resources for theory purposes in the reference section.

Naive Actor-Critic:

In this implementation, we will be updating our neural networks on each timestamp. This implementation differs from A2C where we update our network after every n timestamp. We will implement A2C in the next part of this series.

Neural Networks:

The neural network can be implemented basically in two ways.

  1. One Network for both Actor and Critic functionalities i.e one network with two output layers one for state value and another one for action probabilities.
  2. Separate networks, one for actor and another for a critic.

We will be using Separate networks for Actor and Critic in this article because I find this one to learn quickly.

Code:

Actor and Critic Networks:

  1. Critic network output one value per state and Actor’s network outputs the probability of every single action in that state.
  2. Here, 4 neurons in the actor’s network are the number of actions.
  3. Note that Actor has a softmax function in the out layer which outputs action probabilities for each action.

Note: number of neurons in hidden layers are very important for the agent learning and vary from environment to environment.

Agent class’s init method:

  1. Here, we initialize optimizers for our networks. Please note that the learning rate is also important and can vary from the environment and method used.

Action Selection:

  1. This method makes use of the TensorFlow probabilities library.
  2. Firstly, Actor gives out probabilities than probabilities are turned into a distribution using the TensorFlow probabilities library, and then an action is sampled from the distribution.

Learn function and losses:

  1. We will be making use of the Gradient Tape technique for our custom training.
  2. Actor loss is negative of Log probability of action taken multiplied by temporal difference used in q learning.
  3. For critic loss, we took a naive way by just taking the square of the temporal difference. You can use the mean square error function from tf2 if you want but then u need to do some modification to temporal difference calculation. We will be using MSE in the next part of this series, so don’t worry.
  4. You can find more about the custom training loop at TensorFlow official website.

Note: Make sure that you call networks inside with statement (context manager) and only use tensors for the network predictions, Otherwise you will get an error regarding no gradient provided.

Trining loop:

  1. The agent takes action in environment and then bot networks are updates.
  2. For the Lunar lander environment, this implementation performs well.

Note: what I noticed while implementing these methods is that the Learning rate and neurons in hidden layers hugely affect the learning.

You can find the full code for this article here. Stay tuned for upcoming articles where we will be implementing A2C with and without multiple workers.

The Second Part of this series can be accessed here.

So, this concludes this article. Thank you for reading, hope you enjoy and was able to understand what I wanted to explain. Hope you read my upcoming articles. Hari Om…🙏

References:

--

--