The world’s leading publication for data science, AI, and ML professionals.

Hierarchical Reinforcement Learning

With Options-Critic framework using tabular Q-Learning

unsplash.com
unsplash.com

Hierarchical Reinforcement Learning decomposes long horizon decision making process into simpler sub-tasks. This idea is very similar to breaking down large number of lines of code to smaller functions each performing a very specific task.

Let’s look at an example, Suppose the agent has to clear or set a dining table. This includes the task of reaching and grasping dishes. These are high level tasks. On a lower level, it requires the task of controlling and moving the limbs and then the fingers to reach out and grasp objects and subsequently put them in the proper place. Hierarchical Reinforcement Learning is designed with the same logic. There are multiple levels of policies with each policy handling a lower level task like moving the fingers and the higher level policies handling tasks like grasping the objects.

HRL gives us multiple benefits during training and exploration:

  1. Training : Since high level correspond to multiple environmental steps, episodes are relatively shorter, thus propagating rewards faster and improving learning.
  2. Since exploration happens at a higher level, it is able to learn more meaningful policies and thus take more meaningful actions than those taken on an atomic level. Example, agent would learn better policies to reach the goal at the level of grasping objects than at the level of understanding joint movements of fingers.

A few common architectures for HRL are-

  1. Option – Critic Framework
  2. Feudal Reinforcement Learning

Lets look at how to build your own Option-Critic framework in a simple four rooms setting using Q-Learning. You can look at this blog to understand more about how Option-Critic frameworks work.

We will usea 2D fourrooms environment here. The environment has 4 rooms. env.reset() resets the environment and returns a random start state. And we can use env.goal() to change the goal and set the goal as random from 1 of the corner of the 4 rooms. In this blog, we will change the goal once after a thousand episodes which is similar to the Option-Critic Paper. – https://github.com/anki08/Option-Critic/blob/main/fourrooms.py.

fourrooms environment - image by author
fourrooms environment – image by author

Now lets create our policies.

Q_Omega is the higher level meta-policy informing the lower level policies.

Lets define our Q_Omega policy as a 2D Q table where each state has a set of options. The options direct the lower level policy on which action to take to maximize its reward. The number of options are defined by noptions. We sample from our Q_Omega table in an epsilon greedy manner.

An option is a generalisation of actions and lets us define macro-actions. In Sutton et al. (1999), it is defined as :

Options consist of three components: a policy π : S × A → [0, 1], a termination condition β : S+ → [0, 1], and an initiation set I ⊆ S. An option ⟨I,π,β⟩ is available in state st if and only if st∈ I. If the option is taken, then actions are selected according to π until the option terminates stochastically according to β.


Lets create our lower level policy as a softmax policy called Q_U. Our Q_U table stores the actions our lower level policy takes. Here, let’s sample our actions using a softmax policy. Our update is a Bellman update.

Finally, let us define our Termination Policy. This will be used to change the option or higher level policy the lower level has been following.

We have our 2 levels ready. Now we will define our Critic. This is an extension of the Actor-Critic framework. Here the Critic evaluates the options and tells the higher level how good the option is. The Q_Omega i.e. the meta policy informing the intra-options policy, Q_U and Termination Policy form a part of the Option part and the Q_U i.e. value of executing an action in context of state and option, forms a part of the Critic.

Now that we have all the parts, lets train and test the agent. I have created a colab notebook for the training and testing part. _https://colab.research.google.com/drive/1q5J0CeAhGP2M_bgJM3NhFpKVqgzWzJ_M?usp=sharing_

During training, the meta policy learns the options and termination policies which are then used during the testing to inform the lower level policy on which action to take. We change the goal every 1000 episode to show that with every change in goal, the agent takes less time because it doesn’t have to learn the policies from scratch and uses prior knowledge to inform the actions.

image by author
image by author

As we can see, the agent can learn and reach the goal in just 6 steps as opposed to around 50 steps for Q-learning.

You can find the entire code here → https://github.com/anki08/Option-Critic

References:

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

The Option-Critic Architecture

Hierarchical Reinforcement Learning


Related Articles