Unity-ML Agents Course

Diving deeper into Unity-ML Agents

Train a curious agent to destroy Pyramids.

Published in

Towards Data Science

9 min readFeb 3, 2020

We launched a new free, updated, Deep Reinforcement Learning Course from beginner to expert, with Hugging Face 🤗

👉 The new version of the course: https://huggingface.co/deep-rl-course/unit0/introduction

The chapter below is the former version, the new version is here 👉 https://huggingface.co/deep-rl-course/unit5/introduction?fw=pt

We launched a new free, updated, Deep Reinforcement Learning Course from beginner to expert, with Hugging Face 🤗

👉 The new version of the course: https://huggingface.co/deep-rl-course/unit0/introduction

The chapter below is the former version, the new version is here 👉 https://huggingface.co/deep-rl-course/unit5/introduction?fw=pt

Last time, we learned about how Unity ML-Agents works and trained an agent that learned to jump over walls.

This was a nice experience, but we want to create agents that can solve more complex tasks. So today we’ll train a smarter one that needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top.

To train this new agent, that seek for that button and then the pyramid to destroy, we’ll use a combination of two types of rewards, the extrinsic one given by the environment. But also an intrinsic one called curiosity. This second will push our agent to be curious, or in other terms, to better explore its environment.

So today we’ll learn about the theory behind this powerful idea of curiosity in deep reinforcement learning and we’ll train this curious agent.

Let’s get started!

What is Curiosity in Deep RL?

I already cover curiosity in detail in 2 other articles here and here if you want to dive into the mathematical and implementation details.

Two Major Problems in Modern RL

To understand what is curiosity, we need first to understand the two major problems with RL:

First, the sparse rewards problem: that is, most rewards do not contain information, and hence are set to zero.

Remember that RL is based on the reward hypothesis, which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents, if they don’t receive any, their knowledge of which action is appropriate (or not) cannot change.

*Thanks to the reward, our agent knows that this action at that state was good*

For instance, in Vizdoom “DoomMyWayHome,” your agent is only rewarded if it finds the vest. However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy and it can spend time turning around without finding the goal.

A big thanks to Felix Steger for this illustration

The second big problem is that the extrinsic reward function is handmade, that is in each environment, a human has to implement a reward function. But how we can scale that in big and complex environments?

So what is curiosity?

Therefore, a solution to these problems is to develop a reward function that is intrinsic to the agent, i.e., generated by the agent itself. The agent will act as a self-learner since it will be the student, but also its own feedback master.

This intrinsic reward mechanism is known as curiosity because this reward push to explore states that are novel/unfamiliar. In order to achieve that, our agent will receive a high reward when exploring new trajectories.

This reward is in fact designed on how human acts, we have naturally an intrinsic desire to explore environments and discover new things.

There are different ways to calculate this intrinsic reward, and Unity ML-Agents use curiosity through the next-state prediction method.

Curiosity Through Prediction-Based Surprise (or Next-State Prediction)

I already cover this method here if you want to dive into the mathematical details.

So we just said that curiosity was high when we were in unfamiliar/novel states. But how we can calculate this “unfamiliarity”?

We can calculate curiosity as the error of our agent of predicting the next state, given the current state and action taken. More formally, we can define this as:

Why? Because the idea of curiosity is to encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its own actions (uncertainty will be higher in areas where the agent has spent less time, or in areas with complex dynamics).

If the agent spend a lot of times on these states, it will be good to predict the next state (low curiosity), on the other hand, if it’s a new state unexplored, it will be bad to predict the next state (high curiosity).

Let’s break it down further. Say you play Super Mario Bros:

If you spend a lot of time at the beginning of the game (which is not new), the agent will be able to accurately predict what the next state will be, so the reward will be low.
On the other hand, if you discover a new room, our agent will be very bad at predicting the next state, so the agent will be pushed to explore this room.

Using curiosity will push our agent to favor transitions with high prediction error (which will be higher in areas where the agent has spent less time, or in areas with complex dynamics) and consequently better explore our environment.

But because we can’t predict the next state by predicting the next frame (too complicated to predict pixels directly), we use a better feature representation that will keep only elements that can be controlled by our agent or affect our agent.

And to calculate curiosity, we will use a module introduced in the paper, Curiosity-driven Exploration by Self-supervised Prediction called Intrinsic Curiosity module.

*If you want to know it works,* *check our detailled article*

Train an agent to destroy pyramids

So now that we understand what is curiosity through the next state prediction and how it works, let’s train this new agent.

We published our trained models on github, you can download them here.

The Pyramid Environment

The goal in this environment is to train our agent to get the gold brick on the top of the pyramid. In order to do that he needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top.

The reward system is:

In terms of observation, we use the raycast version. With 148 raycasts, but detecting switch, bricks, golden brick, and walls.

We also use a boolean variable indicating the switch state.

The action space is discrete with 4 possible actions:

Our goal is to hit the benchmark with a mean reward of 1.75.

Let’s destroy some pyramids!

First of all, let’s open the UnitySDK project.

In the examples search for Pyramids and open the scene.

Like WallJump, you see in the scene, a lot of Agents, each of them comes from the same Prefab and they all share the same Brain (policy).

Multiple copies of the same Agent Prefab.

In fact, as we do in classical Deep Reinforcement Learning when we launch multiple instances of a game (for instance 128 parallel environments) we do the same hereby copy and paste the agents, in order to have more various states.

So, first, because we want to train our agent from scratch, we need to remove the brain from the agent prefab. We need to go to the prefabs folder and open the Prefab.

Now in the Prefab hierarchy, select the Agent and go into the inspector.

In Behavior Parameters, we need to remove the Model. If you have some GPU you can change Inference Device from CPU to GPU.

For this first training, we’ll just modify the total training steps because it’s too high and we can hit the benchmark in only 500k training steps. To do that we go to config/trainer_config.yaml and you modify these to max_steps to 5.0e5 for Pyramids situation:

To train this agent, we will use PPO (Proximal Policy Optimization) if you don’t know about it or you need to refresh your knowledge, check my article.

We saw that to train this agent, we need to call our External Communicator using the Python API. This External Communicator will then ask the Academy to start the agents.

So, you need to open your terminal, go where ml-agents-master is and type this.

mlagents-learn config/trainer_config.yaml — run-id=”Pyramids_FirstTrain” — train

It will ask you to run the Unity scene,

Press the ▶️ button at the top of the Editor.

You can monitor your training by launching Tensorboard using this command:

tensorboard — logdir=summaries

Watching your agent jumping over walls

You can watch your agent during the training by looking at the game window.

When the training is finished you need to move the saved model files contained in ml-agents-master/models to UnitySDK/Assets/ML-Agents/Examples/Pyramids/TFModels.

And again, open the Unity Editor, and select Pyramids scene.

Select the Pyramids prefab object and open it.

Select Agent

In Agent Behavior Parameters, drag the Pyramids.nn file to Model Placeholder.

Then, press the ▶️ button at the top of the Editor.

Time for some experiments

We’ve just trained our agents to learn to jump over walls. Now that we have good results we can try some experiments.

Remember that the best way to learn is to be active by experimenting. So you should try to make some hypotheses and verify them.

By the way, there is an amazing video about how to hyperparameter tuning Pyramid environment by Immersive Limit that you should definitely watch.

Increasing the time horizon to 256

The time horizon, as explained in the documentation, is the number of steps of experience to collect per-agent before putting it into the experience buffer. This trades off between a long time horizon (less biased, but higher variance estimate), and a short time horizon (more biased, but less varied estimate).

In this experience, we doubled the time horizon from 128 to 256. Increasing it allows our agent to capture more important behaviors in his sequence of actions than before.

However, this didn’t have an impact on the training of our new agent. Indeed, they share quite the same results.

We published our trained models on github, you can download them here.

That’s all for today!

You’ve just trained a smarter agent than last time. And you’ve also learned about Curiosity in Deep Reinforcement Learning. That’s awesome!

Now that we’ve done that, you might want to go deeper with Unity ML-Agents. Don’t worry, next time we’ll create our own environments and the article next we’ll create our own reinforcement learning implementations.

So in the next article, we’ll create our first environment from scratch. What this environment will be? I don’t want to spoil everything now, but I give you a hint: