Photo by James Pond on Unsplash

[RL] Train the Robotic Arm to Reach a Ball — Part 02

Compare Learning Efficiency over DDPG, D4PG and A2C

Tom Lin
Towards Data Science
10 min readFeb 21, 2019

--

Recap

Following what I’ve mentioned in Part 01, we know DDPG doesn’t successfully solve the task, no matter how much time the agent spends to learn. The average episodic reward lingers around 0.04 to 0.05, which is not at all close to the target reward of 30. Thus below, I will begin to experiment on other two algorithms, D4PG in single agent scenario and A2C in multi-agent scenario.

Again, following is a quick guide of the structure in this article.

1. Structure of the Report

  • 1.1 Train on a Single Agent Scenario — D4PG
    This time, I will instead experiment on D4PG agent. It’s the most newly published algorithm, allegedly to be more efficient than DDPG in coping with complicated environment. This section will walk through steps of model structure, replay memory, action exploration and learning process. It will highlight on the difference from DDPG and the similar parts are quickly skimmed through.
  • 1.2 Train on Multiple Agents Scenario — A2C
    A2C model is implemented on multi-agents environment. This section will cover how it is built up, collects experiences and learns in the process. The code for A2C is rather different from the ones for DDPG and D4PG, since it should reflect on how A2C uses multi-agents to collect experience and update network parameters. Again, the training result is included in the end.
  • 1.3 Comparison of All Algorithms and Conclusion
    Finally, we are ready to compare how three different algorithms perform. The rolling episodic reward of each one will be plotted on the same graph, so that we can better perceive the learning trend and efficiency over each model.

2. Train on a Single Agent Scenario — D4PG

As we’ve known in Part 01, the DDPG model doesn’t solve the task successfully, so I turn to another algorithm — [D4PG], which is the most updated RL algorithm in 2018. The code script is mainly referred from this book — [Deep-Reinforcement-Learning-Hands-On].

First, I will import some self-defined modules to configure the whole setting before training. These modules include,

  1. d4pg_model: Module file containing classes of Actor and Critic neural network structure for D4PG.
  2. replay_memory: Collect and sample on transition experience for training.
  3. d4pg_agent: Module file defining how an D4PG agent interacts with the environment and implements training process.

2.1 Model Structure

I follow up the same model structure as specified in DDPG, except for critic network, the output needs to be changed to N_ATOMS. For the rest, both Critic and Actor have two hidden layers with size 128, 64 each, the same as in DDPG.

Code — Critic Network for D4PG (excerpted)

2.2 Replay Memory

In order to conform to data type required in D4PG agent, which is referred from [Deep-Reinforcement-Learning-Hands-On], sampling is done via function sample2() defined in replay memory object. Replay memory is set to size 100,000. The detailed code snippet is in this [link].

2.3 Action Exploration

One minor difference of D4PG from DDPG is the action exploration. In D4PG, it uses simple random noise from normal distribution as a way to encourage action exploration instead of OU noise. Modified code snippet is in this [link].

2.4 Loss Function

The D4PG can train on multiple transition trajectories(N-Steps), but I choose to train on one time-step for its simplicity. However, according to other reviews, one-step training is the most unstable and not recommended, but I still go for it anyway. Hence, the following code for loss and agent learning is based on one-step transition trajectory.

Code — Loss Computation and Learning Process in D4PG Agent (excerpted)

2.5 Weight Update

The weights are soft-updated by soft_updated(), the same as in DDPG.

2.6 Hyper-parameter in a Nutshell

Followings are the overview of hyper-parameter settings,

  • Learning Rate (Actor/Critic): 1e-4
  • Batch Size: 64
  • Buffer Size: 100000
  • Gamma: 0.99
  • Tau: 1e-3
  • Repeated Learning per time: 10
  • Learning Happened per time-step: 150
  • Max Gradient Clipped for Critic: 1
  • N-step: 1 # transition trajectory
  • N-Atoms: 51 # for critic network output
  • Vmax: 10 # parameter for critic network
  • Vmin: -10 # parameter for critic network
  • Hidden Layer 1 Size: 128
  • Hidden Layer 2 Size: 64

2.7 Construct Training Function

In order to monitor the training progress, I again, define a training function train_d4pg(), which is pretty much the same as train_ddpg(), see full code snippet in this [link].

2.8 Training Result — Low Efficiency

Owing to the prior failure in DDPG, this time, I lower down the learning frequency a little bit. The training process will be triggered for every 150 time-steps and weight-update iterates for 10 times on each training.

I hope this will further stabilize the training although it may take much longer time for the agent to learn. The following result shows that D4PG agent successfully reaches the target episodic score of 30, nonetheless, it requires to take up to 5000 episodes before reaching the goal. We can tell the learning progress is pretty slow.

Below is the graph of rolling average scores over episodes of D4PG. Observe the result plot, we can tell the episodic score is hugely differed/deviated from one to the other. Clearly, the training progress is not so stable, precisely reflecting the nature of an off-policy algorithm.

Rolling Episodic Score over Episodes (D4PG)

Now, let’s take a look on how it performs in animation then.

A Single D4PG Agent Controls a Ball (final training result)

3. Train on Multi Agents Scenario — A2C

The modules/functions used to build up A2C model are as follows,

  1. A2CModel: Neural network for A2C reinforcement learning algorithm.
  2. collect_trajectories(): Collect n-step experience transitions.
  3. learn(): Compute training loss from collected trajectories and update network’s weights.

3.1 Brief Background of the Environment

This time, I use another environment which will activate 20 agents simultaneously, each with its own copy of environment. The experiences of these 20 agents will be gathered up and shared with one another.

Preview of the Environment (Multi-Agents)

3.2 Model Structure

The model is a simple two fully-connected layers with 128 units, 64 units for each layer.

Then it separates out to actor and critic layer (instead of actor and critic network as in previous models). Both actor and critic layer use fully-connected layer that follows the ways implemented in [the original A3C algorithm paper].

Code — Initialize A2C Model (excerpted)

3.3 Collect On-Site Transition Trajectory

A2C is an on-policy RL algorithm, there is not such thing as replay memory. Instead, it uses the current collected transition experiences to update its network.

In the next code snippet, I define collect_trajectories() function. It takes in the inputs of A2C model, the whole environment, and the number of time-step to collect. When the model interacts with the environment, all actions and feedbacks are stored in objects such as batch_s, batch_a, batch_r, standing for state, action, reward respectively. Once the collected experiences reach the required number of time-step or when the episode ends, the function then conducts reward normalization and discount on reward of each step and comes up with final target value/processed rewards for each time-step and stores it in batch_v_t object.

Function — Used to Collect Trajectories for A2C Model Training

3.4 Action Exploration

Actions are sampled from normal distribution where μ is dependent on each state while σ is given from the argument. Furthermore, the action output is passed through tanh() activation so that its values are squashed within -1 and 1, as required by the environment.

Besides, in order to retrieve log probability of actions later on, there is a trick I use. I define function get_action() to return both actions_tanh and raw action values. The raw action values are stored in the batch_a. Then during learning phase, they will be passed along with states to get_action_prob(batch_s, batch_a) to get the corresponding log probability for the actions.

In respect to critic state value, it is just the output of state being passed to critic layer.

Code — Action Output and Critic State Value for A2C Model (excerpted)

3.5 Loss Function

The loss function in A2C is also known as objective function.

Noted that the original loss function includes entropy term. Entropy is used to encourage action exploration in many algorithms, including A2C model. However, I drop off entropy in policy loss, which is contrary to most other implementations. The reason is that I am now dealing with a multi-dimensional action space. I have no clue on how to specify the entropy for multi-dimensional action space.

Instead, I consult [ShangtongZhang’s work], in which he assumes σ, the variance to encourage action exploration, to be constant so that entropy will be constant in all cases as well. This way, I can ignore and drop off entropy from the policy loss.

Respecting to value loss function, it’s just also part of the component for policy loss. That leads to my policy loss and value loss as follows,

Now, I wrap up the computation for the loss and update for network parameters in the self-defined function — learn(). You may notice that the way I construct codes here for learning process, is rather different from the way it is in previous models. It’s a stand-alone function, instead of a sub-class object belonging to the agent.

Function — Compute Loss and Trigger Learning for A2C Model

3.6 Weight Update

In A2C model, all weights are directly updated by the gradients of the current trajectory batch, meaning no soft-update applied here.

3.7 Hyper-parameters in a Nutshell

Followings are the overview of hyper-parameter settings,

  • Number of learning episode: 1000
  • Number of N-Step: 10 # n-step transition trajectories
  • Learning rate: 0.00015
  • GAMMA: 0.99

3.8 Construct Training Process

In this section, I don’t particularly wrap up training process in a function. In the code linked here, it will directly monitor the learning progress and save the model after training is finished.

3.9 Training Result — High Efficiency

During the training, once agents collect a new batch of N-Step transition experience, the batch will be used to compute the loss and update actor and critic layer’s parameters immediately. Notice that the last state of each batch will be the initial state of next batch if none of the agents’ episode is done yet. On the opposite, if any of the episode is done, then all agents will stop and leave from current episode and move on to re-start a new episode again.

From the result shown below, you can tell A2C model is very efficient. The agents learn to pick up the task and reach goal episodic score of 30 in less than 1000 episodes. Plus, the training progress is quite consistent and stable either during the learning process or in the case of re-training from the beginning. You can get pretty much the same result whenever you re-train the agents again.

Average Episodic Score for A2C Agents

Observing from the plot of rolling score, we are further convinced that the learning progress is pretty smoothly. The deviation or fluctuation is way far smaller than it’s had in D4PG model, and finally, the last picture is the animation of training result.

Rolling Episodic Score over Episodes (A2C)
Multiple A2C Agents Control a Ball (final training result)

4. Comparison of all Algorithms and Conclusion

Among these trials, the A2C model reaches the best performance and efficiency, even on top of that, the re-training result remains very consistent, but given the fact it is multi-agents scenario and uses 10 steps transition trajectory, it shouldn’t be too surprising for the outcome.

On the other hand, the D4PG is a single agent scenario, and I use only 1 step transition trajectory. Although it’s not as efficient as A2C, it still gives kind of a satisfactory outcome. However, the re-training result is not so consistent, you may find your agent stuck in some local optimum in some trials. In my case, it takes 5000 episodes to reach the goal score. Nonetheless, my setting for triggering parameter update is every 150 timestamps, perhaps I can increase the update frequency to improve its efficiency. However, this way I would take the risk of sacrificing the stability which has already been wobbly. It’s fundamentally a trade-off question between efficiency and stability.

The last one is DDPG. Well, it doesn’t work out in this task. The training result demoed in the previous article uses 1000 episodes, but I have experimented on other training length, up to 5000 episodes. None of them solve the task successfully. Perhaps the task is really complicated where the observation state contains 33 variables and the agent has 4 action variables. It seems DDPG simply isn’t effective enough dealing with the complexity up to that level.

Learning Progress Compared on all Algorithms — DDPG, D4PG and A2C

In my final reflections on how to further improve the whole experiment, possibly I will attempt to re-write the code using Python’s multi-processing module, that will enable the algorithm to apply in parallel environments. This is an ongoing trend in reinforcement learning, thus worth trying. Besides, I might try to see if I can re-code D4PG to take multi-step trajectories for training in the future. That way will improve the stability of D4PG model.

If you enjoy the post, welcome to point out any mistakes I made or leave any feedback in the comment box. The full implementation of code and jupyter notebook can be found in this [link].

--

--

An enthusiastic in swimming, jogging and movies besides my job as an analyst. Specifically with long-standing passion on customer behavior analysis.