The world’s leading publication for data science, AI, and ML professionals.

CubeTrack: Deep RL for active tracking with Unity + ML-Agents

CubeTrack, for me, was an exercise in getting my hands dirty with active object tracking (AOT) and game engine supported deep…

CubeTrack in Game View (Image by author)
CubeTrack in Game View (Image by author)

CubeTrack, for me, was an exercise in getting my hands dirty with active object tracking (AOT) and game engine supported deep reinforcement learning (RL), whilst I await the development of a much more complex, realistic simulation of my real-world deployment environment. Up until this point I had worked on off-the-shelf Gym environments, but the need for a custom environment had led me to Unity and ML-Agents.

ML-Agents example environments (Image from Unity ML-Agents Toolkit Github Repository)
ML-Agents example environments (Image from Unity ML-Agents Toolkit Github Repository)

The example environments that ship with ML-Agents **** do not include a tracking problem and so I followed the guidelines for creating a new environment, keeping to the same set up of a cube on a platform. With such a similar, simple design, I figured it wouldn’t take much effort to tidy the project up, swap out the game objects I had flung together with some of the ML-Agents Project prefabs (just in the interest of style consistency), and make the project a public repo. The example environments are super useful for learning, practising, testing and building upon. Hopefully CubeTrack can be used in the same way but in a different problem space.

The project is available at https://github.com/kirstenrichardson/CubeTrack

If you just want to use the environment, then everything you need to know is in the ReadMe. If you are new to RL, I recommend taking a look at some of the videos on Xander Steenbrugge‘s Youtube channel Arxiv Insights for a general overview and closer look at PPO (proximal policy optimization), the algorithm used in the work presented here. Or, if you have more time and would like to dive deeper, I recommend the Introduction to RL lecture series with David Silver of Deep Mind.

If you are already familiar with RL but are interested in how to tackle a custom problem with a custom environment in Unity, then read on and I’ll introduce ML-Agents and how CubeTrack was put together. If you are already familiar with Unity and ML-Agents and just want to know about the learning-based decisions that facilitated successful tracking, then skip ahead to Design Choices.


ML-Agents is a plugin that transforms regular Unity scenes into learning environments. It provides a Python Low Level API (contained in the Python package mlagents_envs) which handles communication between a communicator residing within the Unity environment and a Python training script (either the implementations that ship with the toolkit, which are part of their own Python package called mlagents, or an implementation of your choice, with the option to wrap the Unity environment in a gym wrapper using the third Python package gym-unity). After cloning the ML-Agents repo¹ into a local directory (set up as a pipenv) and installing the various ML-Agents packages, I opened Unity Hub and created a new 3D Unity project at the same location.

Version information:
  ml-agents: 0.18.0,
  ml-agents-envs: 0.18.0,
  Communicator API: 1.0.0,
  TensorFlow: 1.14.0,
  Unity: 2019.4.4f1,
  Python: 3.6.9

Unity Scene

Nothing flashy here. All I needed was a non-player character (NPC)(game object not controlled by a human player) moving around an area randomly, providing a moving target, and a game object representing the tracker (or agent, to use RL terminology) moving in response to some external ‘player’ input. Confusingly, Unity can refer to NPCs as AI controlled, but AI is being used here to mean preprogrammed and therefore independent, as opposed to actively learning as is the case with the ‘player’, our RL model, which sits external to Unity.

Example of Hierarchy Window (Image by author)
Example of Hierarchy Window (Image by author)

CubeTrack has two near-identical scenes, CubeTrack and VisualCubeTrack, the latter with the addition of a first person view camera on the agent cube. Other than game cameras and lighting, the scenes simply contain a prefab called Area (or VisualArea) – an empty game object containing all the components that make up the RL environment. This container prefab makes it easy to duplicate the environment and having multiple environment instances can speed up training, exposing the ‘shared brain’ of our agent cube fleet to multiple experiences in parallel. The Area can be explained in three parts:

  • Training court: A series of cubes that form the ground and walls of the court. In addition, 13 cylinders (Waypoints) are positioned on the ground in a grid pattern, with their Mesh Renderer component and Capsule Collider component unticked, such that they are both invisible and capable of passing through.
  • Target: A purple cube, with smaller child object cubes as the headband, eyes and mouth. To have the cube move around the platform, a component called a NavMesh Agent was added along with a script called TargetMovement. A NavMesh was baked to the ground (with the Area prefab open so that it exists for each Area instance) and every waypoint was given the tag RandomPoint. It is then possible from the script to select a random waypoint from the list of objects with the tag RandomPoint and set the waypoint’s position as the destination for the NavMesh Agent i.e. the target cube. See this video for more detail.
  • Agent: A blue cube with headband, eyes and mouth, this time with a Rigidbody component enabling it to act under the control of physics and have forces applied. You can play with the mass and drag parameters of the Rigidbody to change how the cube slides. When it came to using visual observations, a camera was added as a child object and positioned to point out the front face (the one looking downtransform.forward).
Scene View showing waypoint objects and agent camera FOV (Image by author)
Scene View showing waypoint objects and agent camera FOV (Image by author)

Learning Environment Set Up

Transforming this otherwise ordinary Unity scene into a training arena for artificial intelligence is made super simple by ML-Agents. The documentation for doing this is here, but these are the basic steps ..

  • Install the Unity package

Select the GameObject acting as your RL agent and ..

  • Add a script component
  • Add a Behaviour Parameters component
  • Add a Decision Requester component
  • (optional) add a Camera Sensor component

Go to your local ML-Agents repo directory (or local CubeTrack repo directory) and ..

  • Add a config file

With all of these pieces in place, your Unity scene is set up to train a model. To begin training, make sure the Model field under the Behaviour Parameters component of the agent cube says None. Open a terminal and use the single command-line utility …

mlagents-learn ./config/ppo/CubeTrack.yaml --run-id=runName

… with whatever options tagged on the end (use -h to explore the full usage of the mlagents-learn command). Then go to the Unity Editor and hit ▶️. A great deal of flexibility is offered, including resuming a training run that hasn’t reached it’s end (the global max_steps parameter) or training on top of an existing model. See here for more detail.

When a training run is complete or is terminated early, the model weights are saved to a ‘.nn’ (neural network) file in the ‘results’ folder. The mlagents-learn command needs to be called from the same location as the config and results folder you are using. If you have cloned the ML-Agents repo, then you can place your config file along with all the yaml files for the example environments and run the command from the root of the ML-Agents repo directory. The CubeTrack repo has it’s own config and results folder so that it can be used without a local copy of the ML-Agents repo. In this instance, issue the command from the root of the CubeTrack repo directory.

Inference is as simple as copying and pasting the .nn file to your project assets folder, dragging it into the Model field, and hitting ▶️.

Design Choices

Observations

To begin with, I attempted to foster tracking behaviour using vector observations. The group of variables chosen as the observation set varied across experimental runs, but ended up as the following twelve obs:

  • the agent’s position (2 obs, the x and z component of the object’s Transform Position)
  • the target’s position
  • the agent’s speed (velocity.magnitude)
  • the target’s speed
  • the agent’s facing direction (transform.forward)(vector so 3 obs)
  • the target’s facing direction

Once tracking was achieved in this way, a Camera Sensor was added, the CollectObservations() function was removed, and the Space Size in the Inspector window was set to 0. The Camera Sensor was set up to pass the model 84 x 84 PNG images. The GIF below demos the agent’s POV (note that this is before image resizing – the model receives images from this same camera but in lower resolution). Shout out to the Immersive Limit Visual Chameleon tutorial by Adam Kelly here.

View from agent's onboard camera - Game View Display 2 (Image by author)
View from agent’s onboard camera – Game View Display 2 (Image by author)

Before I tidied up the project and utilised the game objects used in the ML-Agents example environments, I had given my agent the best chance of succeeding with visual observations by choosing material colours with the highest possible contrast (black target cube, light grey ground and white walls), turning off shading and turning off shadow casting. In this setting, it was sufficient to pass the model grayscale 84 x 84 images (tick box under Camera Sensor component), but in the later version of VisualCubeTrack colour images were required.

(left) Latest version of CubeTrack (right) Early version of CubeTrack (Image by author)
(left) Latest version of CubeTrack (right) Early version of CubeTrack (Image by author)

Actions

Initially, the action space was set to continuous. The model passed in two float values, both of which were clipped between -1 and 1 using Mathf.Clamp before serving as the magnitude of force to be applied along the x and z axes concurrently. I later decided on a simple 5-choice discrete action space ..

Action space (Image by author, inspired by Thomas Simonini's An Introduction to Unity ML-Agents)
Action space (Image by author, inspired by Thomas Simonini‘s An Introduction to Unity ML-Agents)

The magnitude of force to apply in each case was tested by allocating each action choice to an arrow key and experimenting with what looked reasonable. A magnitude of 500 was set for forward / reverse and a magnitude of 250 for turning.

One of the problems I ran into could be described as the agent consistently ‘overshooting’, resulting in either a spinning behaviour or pendulum-like movement of travelling forward then back. I realised I had the Decision Period of the Decision Requester set to 10, but ‘Take Actions Between Decisions’ was ticked, resulting in the same action being taken for ten consecutive steps. Changing the Decision Period to 1 did not overwhelm the sim and rid of the overshooting.

Rewards

My first stab at reward engineering was inspired by the blog below on using curriculum learning for pursuit-evasion by Adam Price (see his open source pursuit-evasion game here).

Curriculum Learning With Unity ML-Agents

Replicating the ‘reaching’ behaviour worked really nicely. I incremented the target’s speed and decremented the distance considered as ‘close enough’ to the target. I then attempted to train on top of this model with a new curriculum that encouraged tracking using a counter variable, counting the number of consecutive FixedUpdate()calls within range of the target. Again, target speed was incremented, and this time so was the ‘duration’ threshold (the value the counter variable needed to reach). Progress in this second phase was slow! And the agent’s resulting behaviour, once the episode termination code was removed, was to only track for the length of time that had led to a reward and then lose interest (obviously in retrospect).

I needed a function that allocated scaled rewards at every step and I had seen an example of just that in a state-of-the-art AOT paper by Luo et al. (2018)². The function is as follows:

Reward function (Image from Luo et al. (2018))
Reward function (Image from Luo et al. (2018))
  • A, d, c and λ **** are all tuning parameters. __ A and d are purely design choices, A being the maximum reward that can be allocated in one step (set to 1, same as Luo paper) and d being the optimal forward/back distance (set to 3 based on what looked like a reasonable gap in Game view).
  • The left side of the bracket is rewarding the agent’s positioning and the right side the agent’s facing direction. In order to generate rewards in the range [-1,1], c and λ need to be set such that the maximum value the bracket can be is 2. Rotation in Unity goes from -180 to 180, so λ should always be 1/180 if rotation and positioning are equally weighted. The value of c changes depending on the size of the training court being used.
  • The x and y represent the x and z components of the heading vector from agent to target. Taking the square root of the summed and squared heading components calculates the length of the vector i.e distance. Reward is highest when the whole term is zero and the whole term is zero when the agent and target’s positions are the same in the x plane, but heading.zis +3. I realised that the sign of heading.z changes depending on the travelling direction of the target, resulting in the agent sometimes being rewarded for being behind the target and sometimes for being in front (not helpful!). There is therefore an extra chunk of code that ensures the highest reward is issued for being behind the target even when the agent is technically forward of the target according to the world z axis.
if (Vector3.Dot(heading.normalized, Target.transform.forward) > 0)
 {
  heading.z = Mathf.Abs(heading.z);
 } else {
  heading.z = -Mathf.Abs(heading.z);
 }
  • The a is the angle between the agent’s transform.forward vector and the heading vector. This encourages the target to look at the target. Initially I had interpreted the description of a (in the quote below) as the angle between the agent’s transform.forward and the target’s transform.forward, but the later implementation (angle to heading) worked better, which I think makes sense, especially in the visual setting.

"the maximum reward A is achieved when the object stands perfectly in front of the agent with a distance d and exhibits no rotation" – Luo et al. (2018)

Below is the code implementation of the function for this project.

var rDist = Mathf.Sqrt(Mathf.Pow(heading.x, 2f) + Mathf.Pow((heading.z - d), 2f));        
 var r = A - ((rDist/c) + (a*lam));
 AddReward(r);.

Results

Training was performed locally on my Lenovo Thinkpad T480s laptop (Intel® Core™ i7–8550U Processor, 4 cores, 8 threads, 1.80 GHz processor base frequency), setting max_stepsto 5M. The graph below is the Cumulative Reward graph generated by TensorBoard with smoothing at 0.99.

Training with vector observations (pink) initially took around 5 hours but this was reduced to roughly 2 hours using six training courts (see Game View Display 2 in the CubeTrack scene). The reward values in this case represent reward averaged across the six Area instances. The graph axes are faint, but cumulative reward levels off at around 2,500. Note that the maximum reward achievable in an episode is 3,000 (episodes terminate at 3,000 steps unless the cumulative reward for the episode dips below -450) but would require the random starting position and facing direction of the agent to be the position and facing direction that elicits an immediate +1 from the reward function.

Cumulative Reward over 5 million steps for 5 different training runs (Image by author)
Cumulative Reward over 5 million steps for 5 different training runs (Image by author)

The dark blue line just below illustrates the success of visual observation tracking in the first version of VisualCubeTrack, with high contrast colours and no shading or shadowing. Training with grayscale images in the later version of VisualCubeTrack led to a significant performance drop (light orange) and changing to colour observations alone did little to improve things (red). In both cases the training run was cut short. It wasn’t until the visual encoder was changed from simple(2 convolutional layers) to resenet(IMPALA ResNet – three stacked layers, each with two residual blocks) that progress improved (light blue), with cumulative reward levelling off around 2,300 and inference presenting reasonable behaviour. This training run involved one training court and took approximately 56 hours.

Scope

What is essentially discussed in this blog is the first ‘set up’ (reward function, environment settings, training configuration etc. etc.) to elicit convincing following behaviour and no more. The performance of the trained agent in VisualCubeTrack especially leaves room for improvement. Avenues for improvement include ..

  • Further tuning of the hyperparameters in the config file
  • Further reward engineering e.g. addition of long-term rewards
  • Using a different algorithm (ML-Agents provides an implementation of soft actor critic) or different implementation (before using ML-Agents my RL library of choice was Stable Baselines³ from Antonin RAFFIN and others)
  • Using additional techniques, such as the already mentioned curriculum learning or an intrinsic reward signal referred to as curiosity, for example
  • Making the learned policy more robust using domain randomisation, for example
  • Making training more efficient by training over multiple environment instances or training with an executable rather than in the editor

Other than improving performance in the current version of the environment, AOT with deep RL can be furthered by ..

  • Increasing the complexity of the learning problem e.g. introduce a penalty for collisions with the target or added obstacles
  • Using a more complex and realistic sim!!
Inference on CubeTrack.nn (Image by author)
Inference on CubeTrack.nn (Image by author)

If you spot any oversights or have any thoughts or questions etc. then please do comment below or tweet me at @KN_Richardson. Stay safe!

[1]: Juliani, A., Berges, V., Teng, E., Cohen, A., Harper, J., Elion, C., Goy, C., Gao, Y., Henry, H., Mattar, M., Lange, D. (2020). Unity: A General Platform for Intelligent Agents. arXiv preprint arXiv:1809.02627. https://github.com/Unity-Technologies/ml-agents.

[2]: Luo, W., Sun, P., Zhong, F., Liu, W., Zhang, T., & Wang, Y. (2018, July). End-to-end active object tracking via reinforcement learning. In International Conference on Machine Learning (pp. 3286–3295).

[3]: A. Hill, A. Raffin, M. Ernestus, A. Gleave, A. Kanervisto, R. Traore, P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu, "Stable baselines," https://github.com/hill-a/stable-baselines, 2018.


Related Articles