Hierarchical Reinforcement Learning: FeUdal Networks

Letting computers see the bigger picture

Austin Nguyen
Towards Data Science

--

Photo by Johann Siemens on Unsplash

Every task can be naturally divided into subtasks. When we prepare dinner, we don’t micromanage every little movement that our hands make. Instead, we segment the task into smaller pieces (taking out ingredients, cutting, cooking, serving), and then, we focus on how to accomplish each one individually. This concept of decomposition is what inspired hierarchical approaches to reinforcement learning. Specifically, FeUdal Networks (FuNs) divide computing between a manager and worker by using a modular neural network. The manager assigns the worker local, specific goals to learn optimally. At the same time, the manager learns how to optimally assign these subgoals to best accomplish a “bigger-picture” task.

In this article, we outline the architecture, intuition, and mathematics behind FeUdal Networks.

Photo by Jehyun Sung on Unsplash

FuN’s Architecture

FuN is a modular neural network (MNN) consisting of two independent networks: the manager and the worker. Here, we describe an environment with a discrete action space.

FuN Modular Neural Network Architecture

The architecture can seem dense so here’s a breakdown of what each of the nodes on the graph represents:

Variable Definitions

It’s useful to take a step back and dissect what’s going on. Given a task to learn, we divide labor between two entities: the “manager” and the “worker” sections of our MNN. We can see that the variable z is just another representation of our observation, x. In other words, z carries the same information as x, just in terms of different variables! We pass that same information to both worker and manager, but both handle the information somewhat differently.

The Manager

After receiving z, the manager creates a different latent state (s) by passing z into another function. This latent state is, yet another, representation of the environment but in a higher dimension. The manager operates in a much higher dimensional vector space than the worker to encode how the manager is considering a much larger picture instead of solely local information.

Then, the manager pushes this latent state (s) into a recurrent neural network (RNN) and consequently outputs a goal for the worker to achieve. This goal represents a relative change in state for the worker. More formally:

Manager RNN Forward Function

where h represents the RNN’s hidden states. After normalizing the goal, we do something special. We pool all the goals over a finite horizon c and then pass the result into a linear transformation with no bias. This effectively transitions from the manager’s vector space to the worker’s vector space and encodes a representation of the previous c goals assigned by the manager.

Pooling Followed By Linear Transformation

The vector w has k dimensions and two key properties:

  • Since the transformation has no bias, it can never produce a constant, non-zero vector. As a result, the worker will never ignore the manager’s input. There is always some “meaning” for the worker to extract.
  • Due to pooling, the manager’s conditioning varies smoothly over time. This prevents any erratic changes in goals that the worker cannot understand or handle.
Photo by John Schnobrich on Unsplash

The Worker

Once the worker receives z, it passes z into a different recurrent neural network. However, instead of outputting a vector, the worker’s RNN outputs a matrix. The matrix has several rows equal to the number of possible actions and k columns.

Worker RNN Forward Function

where each h represents the RNN’s hidden states. To develop intuition regarding why we output a matrix instead of a vector, we look at the equation below:

Action Output for Worker

This output is the probability distribution over the worker’s actions. However, let’s get a slightly different perspective. Assume that each of U’s rows encodes the resulting state if we chose that corresponding action. Then, if we look at the vector Uw, each element is the dot product between a row of U and the encoded goal w. Thinking of the dot product as a metric of similarity and knowing SoftMax retains relative ordering, this vector has elements proportional to the probability of achieving the manager’s goal given the worker chooses that action. As a result, it makes sense to sample actions according to this distribution.

The entire forward process for FuN is shown below.

FuN Forward

How It Learns

Photo by Kimberly Farmer on Unsplash

Let’s consider this. After executing an action, we receive a reward and another set of observations. Then, we could train the entire MNN per usual with TD-learning by optimizing over actions taken by the worker. Afterward, we would propagate these gradients to the manager as well. However, this defeats the purpose of hierarchical learning since the manager’s outputs g would lose all semantic meaning. This would make FuN no different from any other network as g just becomes another internal latent variable. As a result, we instead independently train the manager and worker.

The Manager

Intuitively, we want the manager to give the worker goals that not only maximize reward over time but are also achievable by the worker. Therefore, we maximize over some similarity measure between the worker’s change in state and the goal set by the manager.

The manager’s section of the MNN is updated according to the equation:

Manager Policy Update Equation

where d_cos represents the cosine similarity between two vectors, A is the manager’s advantage function and c represents the manager’s horizon. By multiplying the similarity measure by the advantage function, this update rule effectively finds the optimal balance between feasibility and pay-off. The advantage is computed using the manager’s internal value function and is updated in a similar way to other actor-critic algorithms. The manager’s reward function is defined depending on the task at hand.

Photo by sol on Unsplash

The Worker

We want to encourage the worker to follow the manager’s provided goal. As a result, let’s define an intrinsic reward:

Worker Intrinsic Reward

This reward is an average of how closely the worker follows the manager’s instructions over a finite horizon. The algorithm is trained to maximize a weighted sum consisting of the environment reward and the given intrinsic reward. Using these, we train the worker’s value function, similar to the manager. Then, we update the worker’s policy using:

Worker Policy Update Equation

where we maximize over the log probabilities scaled by the advantage function. This is analogous to typical actor-critic algorithms.

The algorithm notes that the manager and worker could potentially have different discount factors. As a result, the worker could focus more on immediate, local rewards while the manager focuses on long-term events.

The Results

FuN Performance on DeepMind Lab’s Labyrinth

The paper on FuNs [1] uses many experiments to show the algorithm’s robust learning ability, most notably on Montezuma’s Revenge and DeepMind Lab’s games. Using recurrent LSTM networks trained with A3C as baselines, the FuN outperforms other methods in these two experiments.

FuN Performance on Montezuma’s Revenge

Even more incredibly, FuN learns semantically meaningful subgoals. In the visualization above, the tall bars represent consistently administered goals from the manager, each of which corresponds to big “turning points” in the game.

That’s It!

FeUdal Networks present a large stepping stone for reinforcement learning, giving agents the ability to autonomously decompose a task with semantically meaningful results. Next time, we’ll explore how this algorithm can be extended to various multi-agent frameworks.

References

[1] A. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, FeUdal Networks for Hierarchical Reinforcement Learning (2017), ICML ‘17.

--

--

Part-time writer · Full-time learner · PhD Student @ University of Michigan