TD3: Learning To Run With AI

Published in

Towards Data Science

11 min readJun 15, 2019

This article looks at one of the most powerful and state of the art algorithms in Reinforcement Learning (RL), Twin Delayed Deep Deterministic Policy Gradients (TD3)( Fujimoto et al., 2018). By the end of this article you should have a solid understanding of what makes TD3 perform so well, be capable of implementing the algorithm yourself and use TD3 to train an agent to successfully run in the HalfCheetah environment.

ML in Action | Substack

The Machine Learning newsletter that provides practical insights and lessons learned in the rapidly growing world of ML…

donalbyrne.substack.com

However, before tackling TD3 you should already have a good understanding of RL and the common algorithms such as Deep Q Networks and DDPG, which TD3 is built upon. If you need to brush up on your knowledge, check out these excellent resources, DeepMind Lecture Series, Let’s make a DQN, Spinning Up: DDPG. This article will cover the following:

What is TD3
Explanation of each core mechanic
Implementation & code walkthrough
Results & Benchmarking

The full code can be found here on my github. If you want to quickly follow along with the code used here click on the icon below to be taken to a Google Colab workbook with everything ready to go.

What is TD3?

TD3 is the successor to the Deep Deterministic Policy Gradient (DDPG)(Lillicrap et al, 2016). Up until recently, DDPG was one of the most used algorithms for continuous control problems such as robotics and autonomous driving. Although DDPG is capable of providing excellent results, it has its drawbacks. Like many RL algorithms training DDPG can be unstable and heavily reliant on finding the correct hyper parameters for the current task (OpenAI Spinning Up, 2018). This is caused by the algorithm continuously over estimating the Q values of the critic (value) network. These estimation errors build up over time and can lead to the agent falling into a local optima or experience catastrophic forgetting. TD3 addresses this issue by focusing on reducing the overestimation bias seen in previous algorithms. This is done with the addition of 3 key features:

Using a pair of critic networks (The twin part of the title)
Delayed updates of the actor (The delayed part)
Action noise regularisation (This part didn’t make it to the title :/ )

Twin Critic Networks

The first feature added to TD3 is the use of two critic networks. This was inspired by the technique seen in Deep Reinforcement Learning with Double Q-learning (Van Hasselt et al., 2016) which involved estimating the current Q value using a separate target value function, thus reducing the bias. However, the technique doesn’t work perfectly for actor critic methods. This is because the policy and target networks are updated so slowly that they look very similar, which brings bias back into the picture. Instead, an older implementation seen in Double Q Learning (Van Hasselt, 2010) is used. TD3 uses clipped double Q learning where it takes the smallest value of the two critic networks (The lesser of two evils if you will).

Fig 1. The lesser of the two value estimates will cause less damage to our policy updates. image found here

This method favours underestimation of Q values. This underestimation bias isn’t a problem as the low values will not be propagated through the algorithm, unlike overestimate values. This provides a more stable approximation, thus improving the stability of the entire algorithm.

Bottom Line: TD3 uses two separate critic networks, using the smallest value of the two when forming its targets.

Delayed Updates

Fig 2. Making our policy wait for a little while the critic network becomes more stable. Image found here

Target networks are a great tool for introducing stability to an agents training, however in the case of actor critic methods there are some issues to this technique. This is caused by the interaction between the policy (actor) and critic (value) networks. The training of the agent diverges when a poor policy is overestimated. Our agents policy will then continue to get worse as it is updating on states with a lot of error.

In order to fix this we simply need to carry out updates of the policy network less frequently than the value network. This allows the value network to become more stable and reduce errors before it is used to update the policy network. In practice, the policy network is updated after a fixed period of time steps, while the value network continues to update after each time step. These less frequent policy updates will have value estimate with lower variance and therefore should result in a better policy.

Bottom Line: TD3 uses a delayed update of the actor network, only updating it every 2 time steps instead of after each time step, resulting in more stable and efficient training.

Noise Regularisation

The final portion of TD3 looks at smoothing the target policy. Deterministic policy methods have a tendency to produce target values with high variance when updating the critic. This is caused by overfitting to spikes in the value estimate. In order to reduce this variance, TD3 uses a regularisation technique known as target policy smoothing. Ideally there would be no variance between target values, with similar actions receiving similar values. TD3 reduces this variance by adding a small amount of random noise to the target and averaging over mini batches. The range of noise is clipped in order to keep the target value close to the original action.

Fig 3. by training with the added noise to regularise the agents actions it favours a more robust policy. Image found here

By adding this additional noise to the value estimate, policies tend to be more stable as the target value is returning a higher value for actions that are more robust to noise and interference.

Bottom Line: Clipped noise is added to the selected action when calculating the targets. This preferences higher values for actions that are more robust.

Implementation

This implementation is based off the original repo for the paper found here.The major sections of code are covered below with the complete self contained notebook found here. This implementation is written in pytorch, if you are not familiar I would suggest checking out some of the example documentation here. All network architecture and hyper parameters are the same as the ones used in the original paper. Below is the pseudo code from the paper. Although this may look complicated, when you break it down and get past the mathematical equation format, it is actually very intuitive.

Fig 4. TD3 algorithm with key areas highlighted according to their steps detailed below

Algorithm Steps:

I have broken up the previous pseudo code into logical steps that you can follow in order to implement the TD3 algorithm as follows:

Initialise networks
Initialise replay buffer
Select and carry out action with exploration noise
Store transitions
Update critic
Update actor
Update target networks
Repeat until sentient

1. Initialise Networks

This is a fairly standard set up for both Actor and Critic networks. Note that the critic class actually contains both networks to be used. The critics forward() method returns the Q values for both critics to be used later. The get_Q method simply returns the first critic network.

class Actor(nn.Module):
    
    def __init__(self, state_dim, action_dim, max_action):
        super(Actor, self).__init__()        self.l1 = nn.Linear(state_dim, 400)
        self.l2 = nn.Linear(400, 300)
        self.l3 = nn.Linear(300, action_dim)        self.max_action = max_action    def forward(self, x):
        x = F.relu(self.l1(x))
        x = F.relu(self.l2(x))
        x = self.max_action * torch.tanh(self.l3(x)) 
        return xclass Critic(nn.Module):
    
    
    def __init__(self, state_dim, action_dim):
        super(Critic, self).__init__()    # Q1 architecture
        self.l1 = nn.Linear(state_dim + action_dim, 400)
        self.l2 = nn.Linear(400, 300)
        self.l3 = nn.Linear(300, 1)    # Q2 architecture
        self.l4 = nn.Linear(state_dim + action_dim, 400)
        self.l5 = nn.Linear(400, 300)
        self.l6 = nn.Linear(300, 1)    def forward(self, x, u):
        xu = torch.cat([x, u], 1)        x1 = F.relu(self.l1(xu))
        x1 = F.relu(self.l2(x1))
        x1 = self.l3(x1)        x2 = F.relu(self.l4(xu))
        x2 = F.relu(self.l5(x2))
        x2 = self.l6(x2)
        return x1, x2    def get_Q(self, x, u):
        xu = torch.cat([x, u], 1)        x1 = F.relu(self.l1(xu))
        x1 = F.relu(self.l2(x1))
        x1 = self.l3(x1)        return x1

2. Initialise Buffer

This is a standard replay buffer borrowed from the OpenAI baseline repo here

3. Select Action with Exploration Noise

This is a standard step in the markov decision process of the environment. Here the agent will pick an action with exploration noise added.

state = torch.FloatTensor(state.reshape(1, -1)).to(device)
        
action = self.actor(state).cpu().data.numpy().flatten()
if noise != 0: 
            action = (action + np.random.normal(0, noise,                            size=self.env.action_space.shape[0]))
            
return action.clip(self.env.action_space.low,
self.env.action_space.high)

4. Store Transitions

After taking an action we store the information about that time step in the replay buffer. These transitions will be used later while updating our networks.

replay_buffer.add((self.obs, new_obs, action, reward, done_bool))

5. Update Critic

Once we have carried out a full time step through the environment, we train our model for several iterations. The first step in the update carried it involves the critic. This is one of the most important parts of the algorithm and where most of the TD3 additional features are implemented. First thing to do is to sample a mini batch of stored transitions from the replay buffer.

# Sample mini batch
s, s_, a, r, d = replay_buffer.sample(batch_size)
state = torch.FloatTensor(s).to(device)
action = torch.FloatTensor(s_).to(device)
next_state = torch.FloatTensor(y).to(device)
done = torch.FloatTensor(1 - d).to(device)
reward = torch.FloatTensor(r).to(device)

Next we are going to select an action for each of the states that we have pulled in from our mini batch and apply the target policy smoothing. As described earlier, this is just picking an action with our target actor network and we add noise to that action that has been clipped in order to ensure that the noisy action isn’t too far away from the original action value.

# Select action with the actor target and apply clipped noise
noise = torch.FloatTensor(u).data.normal_(0, policy_noise).to(device)
noise = noise.clamp(-noise_clip, noise_clip)
next_action = (self.actor_target(next_state) + noise).clamp(-self.max_action, self.max_action)

Next we need to compute our target Q values of the critic. This is where the double critic networks come into play. We are going to get the Q values for each target critic and then take the smallest of the two for our target Q value.

# Compute the target Q value
target_Q1, target_Q2 = self.critic_target(next_state, next_action)
target_Q = torch.min(target_Q1, target_Q2)
target_Q = reward + (done * discount * target_Q).detach()

Finally we calculate the loss for the two current critic networks. This is done by getting the MSE of each current critic and the target Q value we just calculated. We then carry out the optimisation of the critic as normal.

# Get current Q estimates
current_Q1, current_Q2 = self.critic(state, action)# Compute critic loss
critic_loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)# Optimize the critic
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()

6. Update Actor

The actor is much simpler to update when compared to the critic. First we make sure that we are only updating the actor every d time steps. In our case and in the paper, the actor was updated every 2nd time step.

# Delayed policy updates
if it % policy_freq == 0:        # Compute actor loss
        actor_loss = -self.critic.Q1(state,       self.actor(state)).mean()        # Optimize the actor 
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()        # Update the frozen target models
        for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
            target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)        for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
                    target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

The actor’s loss function simply gets the mean of the -Q values from our critic network with our actor choosing what action to take given the mini batch of states. Just like before, we optimise our actor network through backpropagation.

7. Update Target Networks

Finally we update our frozen target networks using a soft update. This is done along side the actor update and is also delayed.

# Update the frozen target models
        for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
            target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
                    target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

Full Code

Results

The authors results in the original paper boasts excellent scores across a variety of benchmarking environments such as the MuJoco control suite. The results below show how TD3 outperforms almost all algorithms, including the recent SAC (Haarnoja et al., 2018) algorithm and PPO (Schulman et al., 2017) which is OpenAI’s go to algorithm used for ground breaking research such as their DOTA 2 agent.

The algorithms used as benchmark included the OpenAI baseline implementation of DDPG, PPO, ACKTR (Wu et al., 2017) and TRPO (Schulman et al., 2015). SAC was implemented from the authors github.

However, since the release of TD3, improvements have been made to SAC, as seen in Soft Actor-Critic Algorithms and Applications (Haarnoja et al., 2019). Here Haarnoja shows new results that outperform TD3 across the board. In order to make an unbiased review of the algorithm we can see benchmarking results from OpenAI:Spinning Up’s implementations of the main RL algorithms. As you can see in Fig 6, TD3 manages to outperform SAC in the Ant environment. However, SAC achieves a higher performance in the HalfCheetah environment.

Fig 6. OpenAI Spinning Ups benchmark results for the Ant and HalfCheetah MuJoCo environments

Below is the training results from my own implementation of TD3 tested on the Roboschool HalfCheetah environment. The graph above shows the agents average score over the last 100 episodes. As you can see the agent quickly learns to stand and then to walk successfully.

Fig 7. Training results of TD3 HalfCheetah. Shows the average score over the previous 100 episodes

Although it does briefly fall into a local optima, the agent is able to quickly recover, converging on an optimal policy after 500k time steps. The video below shows the results of the fully trained agent.

Fig 8. results of the trained TD3 HalfCheetah

Conclusion

Congratulations, we have covered everything you need to start implementing one of the best state-of-the-art reinforcement learning algorithms on the market! We have now gone through what TD3 is and explained the core mechanics that makes the algorithm perform so well. Not only that, but we have gone step by step through the algorithm and learned how to build the algorithm with pytorch. Finally we took a look at the results of the algorithm seen in the original paper and this articles implementation. I hope you found this article helpful and learned something about Reinforcement Learning!

References

[1] Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.

[2] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[3] OpenAI — Spinng Up, 2018: https://spinningup.openai.com/en/latest/algorithms/td3.html#background

[4] Hado van Hasselt (2010). Double Q-learning. Advances in Neural Information Processing Systems 23 (NIPS 2010), Vancouver, British Columbia, Canada, pp. 2613–2622.

[5] Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, pp. 2094– 2100, 2016.

[6] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforce- ment learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.

[7] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.

[8] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.

[9] Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, J. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Ad- vances in Neural Information Processing Systems, pp. 5285–5294, 2017.

[10] Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor-Critic Algorithms and Applications. arXiv preprint arXiv:1812.05905v2, 2019