[RL] Train the Robotic Arm to Reach a Ball — Part 02
Compare Learning Efficiency over DDPG, D4PG and A2C
Recap
Following what I’ve mentioned in Part 01, we know DDPG doesn’t successfully solve the task, no matter how much time the agent spends to learn. The average episodic reward lingers around 0.04 to 0.05, which is not at all close to the target reward of 30. Thus below, I will begin to experiment on other two algorithms, D4PG in single agent scenario and A2C in multi-agent scenario.
Again, following is a quick guide of the structure in this article.
1. Structure of the Report
- 1.1 Train on a Single Agent Scenario — D4PG
This time, I will instead experiment on D4PG agent. It’s the most newly published algorithm, allegedly to be more efficient than DDPG in coping with complicated environment. This section will walk through steps of model structure, replay memory, action exploration and learning process. It will highlight on the difference from DDPG and the similar parts are quickly skimmed through. - 1.2 Train on Multiple Agents Scenario — A2C
A2C model is implemented on multi-agents environment. This section will cover how it is built up, collects experiences and learns in the process. The code for A2C is rather different from the ones for DDPG and D4PG, since it should reflect on how A2C uses multi-agents to collect experience and update network parameters. Again, the training result is included in the end. - 1.3 Comparison of All Algorithms and Conclusion
Finally, we are ready to compare how three different algorithms perform. The rolling episodic reward of each one will be plotted on the same graph, so that we can better perceive the learning trend and efficiency over each model.
2. Train on a Single Agent Scenario — D4PG
︽ As we’ve known in Part 01, the DDPG model doesn’t solve the task successfully, so I turn to another algorithm — [D4PG], which is the most updated RL algorithm in 2018. The code script is mainly referred from this book — [Deep-Reinforcement-Learning-Hands-On].
First, I will import some self-defined modules to configure the whole setting before training. These modules include,
d4pg_model:
Module file containing classes of Actor and Critic neural network structure for D4PG.replay_memory:
Collect and sample on transition experience for training.d4pg_agent:
Module file defining how an D4PG agent interacts with the environment and implements training process.
2.1 Model Structure
I follow up the same model structure as specified in DDPG, except for critic network, the output needs to be changed to N_ATOMS. For the rest, both Critic and Actor have two hidden layers with size 128, 64 each, the same as in DDPG.
2.2 Replay Memory
In order to conform to data type required in D4PG agent, which is referred from [Deep-Reinforcement-Learning-Hands-On], sampling is done via function sample2()
defined in replay memory object. Replay memory is set to size 100,000. The detailed code snippet is in this [link].
2.3 Action Exploration
One minor difference of D4PG from DDPG is the action exploration. In D4PG, it uses simple random noise from normal distribution as a way to encourage action exploration instead of OU noise. Modified code snippet is in this [link].
2.4 Loss Function
The D4PG can train on multiple transition trajectories(N-Steps), but I choose to train on one time-step for its simplicity. However, according to other reviews, one-step training is the most unstable and not recommended, but I still go for it anyway. Hence, the following code for loss and agent learning is based on one-step transition trajectory.
2.5 Weight Update
The weights are soft-updated by soft_updated()
, the same as in DDPG.
2.6 Hyper-parameter in a Nutshell
Followings are the overview of hyper-parameter settings,
- Learning Rate (Actor/Critic): 1e-4
- Batch Size: 64
- Buffer Size: 100000
- Gamma: 0.99
- Tau: 1e-3
- Repeated Learning per time: 10
- Learning Happened per time-step: 150
- Max Gradient Clipped for Critic: 1
- N-step: 1 # transition trajectory
- N-Atoms: 51 # for critic network output
- Vmax: 10 # parameter for critic network
- Vmin: -10 # parameter for critic network
- Hidden Layer 1 Size: 128
- Hidden Layer 2 Size: 64
2.7 Construct Training Function
In order to monitor the training progress, I again, define a training function train_d4pg()
, which is pretty much the same as train_ddpg()
, see full code snippet in this [link].
2.8 Training Result — Low Efficiency
Owing to the prior failure in DDPG, this time, I lower down the learning frequency a little bit. The training process will be triggered for every 150 time-steps and weight-update iterates for 10 times on each training.
I hope this will further stabilize the training although it may take much longer time for the agent to learn. The following result shows that D4PG agent successfully reaches the target episodic score of 30, nonetheless, it requires to take up to 5000 episodes before reaching the goal. We can tell the learning progress is pretty slow.
Below is the graph of rolling average scores over episodes of D4PG. Observe the result plot, we can tell the episodic score is hugely differed/deviated from one to the other. Clearly, the training progress is not so stable, precisely reflecting the nature of an off-policy algorithm.
Now, let’s take a look on how it performs in animation then.
3. Train on Multi Agents Scenario — A2C
︽ The modules/functions used to build up A2C model are as follows,
A2CModel:
Neural network for A2C reinforcement learning algorithm.collect_trajectories():
Collect n-step experience transitions.learn():
Compute training loss from collected trajectories and update network’s weights.
3.1 Brief Background of the Environment
This time, I use another environment which will activate 20 agents simultaneously, each with its own copy of environment. The experiences of these 20 agents will be gathered up and shared with one another.
3.2 Model Structure
The model is a simple two fully-connected layers with 128 units, 64 units for each layer.
Then it separates out to actor and critic layer (instead of actor and critic network as in previous models). Both actor and critic layer use fully-connected layer that follows the ways implemented in [the original A3C algorithm paper].
3.3 Collect On-Site Transition Trajectory
A2C is an on-policy RL algorithm, there is not such thing as replay memory. Instead, it uses the current collected transition experiences to update its network.
In the next code snippet, I define collect_trajectories()
function. It takes in the inputs of A2C model, the whole environment, and the number of time-step to collect. When the model interacts with the environment, all actions and feedbacks are stored in objects such as batch_s
, batch_a
, batch_r
, standing for state, action, reward respectively. Once the collected experiences reach the required number of time-step or when the episode ends, the function then conducts reward normalization and discount on reward of each step and comes up with final target value/processed rewards for each time-step and stores it in batch_v_t
object.
3.4 Action Exploration
Actions are sampled from normal distribution where μ is dependent on each state while σ is given from the argument. Furthermore, the action output is passed through tanh()
activation so that its values are squashed within -1 and 1, as required by the environment.
Besides, in order to retrieve log probability of actions later on, there is a trick I use. I define function get_action()
to return both actions_tanh and raw action values. The raw action values are stored in the batch_a
. Then during learning phase, they will be passed along with states to get_action_prob(batch_s, batch_a)
to get the corresponding log probability for the actions.
In respect to critic state value, it is just the output of state being passed to critic layer.
3.5 Loss Function
The loss function in A2C is also known as objective function.
Noted that the original loss function includes entropy term. Entropy is used to encourage action exploration in many algorithms, including A2C model. However, I drop off entropy in policy loss, which is contrary to most other implementations. The reason is that I am now dealing with a multi-dimensional action space. I have no clue on how to specify the entropy for multi-dimensional action space.
Instead, I consult [ShangtongZhang’s work], in which he assumes σ, the variance to encourage action exploration, to be constant so that entropy will be constant in all cases as well. This way, I can ignore and drop off entropy from the policy loss.
Respecting to value loss function, it’s just also part of the component for policy loss. That leads to my policy loss and value loss as follows,
Now, I wrap up the computation for the loss and update for network parameters in the self-defined function — learn()
. You may notice that the way I construct codes here for learning process, is rather different from the way it is in previous models. It’s a stand-alone function, instead of a sub-class object belonging to the agent.
3.6 Weight Update
In A2C model, all weights are directly updated by the gradients of the current trajectory batch, meaning no soft-update applied here.
3.7 Hyper-parameters in a Nutshell
Followings are the overview of hyper-parameter settings,
- Number of learning episode: 1000
- Number of N-Step: 10 # n-step transition trajectories
- Learning rate: 0.00015
- GAMMA: 0.99
3.8 Construct Training Process
In this section, I don’t particularly wrap up training process in a function. In the code linked here, it will directly monitor the learning progress and save the model after training is finished.
3.9 Training Result — High Efficiency
During the training, once agents collect a new batch of N-Step transition experience, the batch will be used to compute the loss and update actor and critic layer’s parameters immediately. Notice that the last state of each batch will be the initial state of next batch if none of the agents’ episode is done yet. On the opposite, if any of the episode is done, then all agents will stop and leave from current episode and move on to re-start a new episode again.
From the result shown below, you can tell A2C model is very efficient. The agents learn to pick up the task and reach goal episodic score of 30 in less than 1000 episodes. Plus, the training progress is quite consistent and stable either during the learning process or in the case of re-training from the beginning. You can get pretty much the same result whenever you re-train the agents again.
Observing from the plot of rolling score, we are further convinced that the learning progress is pretty smoothly. The deviation or fluctuation is way far smaller than it’s had in D4PG model, and finally, the last picture is the animation of training result.
4. Comparison of all Algorithms and Conclusion
︽ Among these trials, the A2C model reaches the best performance and efficiency, even on top of that, the re-training result remains very consistent, but given the fact it is multi-agents scenario and uses 10 steps transition trajectory, it shouldn’t be too surprising for the outcome.
On the other hand, the D4PG is a single agent scenario, and I use only 1 step transition trajectory. Although it’s not as efficient as A2C, it still gives kind of a satisfactory outcome. However, the re-training result is not so consistent, you may find your agent stuck in some local optimum in some trials. In my case, it takes 5000 episodes to reach the goal score. Nonetheless, my setting for triggering parameter update is every 150 timestamps, perhaps I can increase the update frequency to improve its efficiency. However, this way I would take the risk of sacrificing the stability which has already been wobbly. It’s fundamentally a trade-off question between efficiency and stability.
The last one is DDPG. Well, it doesn’t work out in this task. The training result demoed in the previous article uses 1000 episodes, but I have experimented on other training length, up to 5000 episodes. None of them solve the task successfully. Perhaps the task is really complicated where the observation state contains 33 variables and the agent has 4 action variables. It seems DDPG simply isn’t effective enough dealing with the complexity up to that level.
In my final reflections on how to further improve the whole experiment, possibly I will attempt to re-write the code using Python’s multi-processing module, that will enable the algorithm to apply in parallel environments. This is an ongoing trend in reinforcement learning, thus worth trying. Besides, I might try to see if I can re-code D4PG to take multi-step trajectories for training in the future. That way will improve the stability of D4PG model.
If you enjoy the post, welcome to point out any mistakes I made or leave any feedback in the comment box. The full implementation of code and jupyter notebook can be found in this [link].
Reference
[1] M. Lapan, Hands-on Deep Reinforcement Learning (2018), Github
[2] S. Zhang, Modularized Implementation of Deep RL Algorithms in PyTorch (2018), Github
[3] M. Zhou, Simple implementation of Reinforcement Learning (A3C) using PyTorch (2018), Github
[4] Hungryof, 模型中buffer的使用 (2018), CSDN