The world’s leading publication for data science, AI, and ML professionals.

Accelerate Training in RL Using Distributed Reinforcement Learning Architectures

How do the most common ones differ from one another?

Photo by Possessed Photography on Unsplash
Photo by Possessed Photography on Unsplash

Reinforcement Learning is an area of ML focused on teaching agents to automate sequential decision-making in a complex and dynamic environment. An agent learning to play a game of chess against an actual human player would be an example of the application of RL. In an RL environment, the agent learns to operate an environment by exploring and exploiting the actions that can be taken on the environment. The intent of an agent is to maximize long-term cumulative rewards, like scoring the highest number of wins in a game of chess.

Source: https://en.wikipedia.org/wiki/Reinforcement_learning
Source: https://en.wikipedia.org/wiki/Reinforcement_learning

.RL algorithms are divided into model-based algorithms and model-free algorithms. In a model-based environment, we have a true representation of the environment, which means we know the probability of reaching the next state when an action was taken in the current state. These environments are simple to simulate. For example, if you want to teach an agent to solve a game of maze, it is easier to represent the maze as a series of the current position, action (left, right, forward, back), next position tuples.

Model-free learning deals with real-world environments which are difficult to simulate, there is no true model of the environment – for example, an agent learning to drive a car in the real world. In this example, it is complex to represent the world around the car as tuples of the current position, action (left, right, forward, back), next position tuples. In such cases, the agent gathers experiences by taking random actions initially and optimizes actions in later stages. Hence most of the time spent by the agent in model-free RL problems is in generating data. It is not just the stochasticity of the environment; dynamic environments also contain large state space and action space which makes the gathering of training data even slower.

Most of RL applications need high throughput in decision making and this is not just limited to games. Imagine an industrial robot that is responsible for ensuring safety by maintaining the pressure at an optimum level. The inference times in such industrial applications should be in a few milliseconds/microseconds so that disasters are prevented. Hence applying RL algorithms for developing industrial AI solutions is always a challenge where time to train, optimize, and infer is limited and the environment is complex.

In such cases, distributed reinforcement learning can be helpful. In distributed reinforcement learning the responsibilities of acting on the environment and learning from the experience are divided between actors and the learners respectively. The experiences gathered by the agent are shared with the learner, which is responsible for learning the best action to take. Similarly, the actions learned by the learner(s) are sent to the actors. This decoupling enabled research on a variety of distributed RL architectures.

This blog post introduces the most common architectures that are in use and how they differ from each other. The list of architectures presented here is not exhaustive. As RL grows in popularity both in research and application there are more variants proposed. However, each of these is a mere derivative of previous architectures and differs in terms of optimizing the training time, communication overhead, resource optimization, cost of running, etc. The underlying design of decoupling actors and learners is common across many.

Note: Model-based RL algorithms like value iteration or policy iteration can also be parallelized. Yet, they are confined to problems where the model of the environment is available, which is very rare in the real world. Hence this blog post only focuses on model-free RL algorithms.

GORILA

Gorila, which stands for General reinforcement learning Architecture, was the very first massively distributed reinforcement learning architecture proposed in the year 2015. The idea of using actors that are responsible for creating experience trajectories and neural network-based learners which represent the value function or the policy function was introduced by this paper. The image below shows the architecture of GORILA.

GORILA Architecture, source: https://arxiv.org/abs/1507.04296
GORILA Architecture, source: https://arxiv.org/abs/1507.04296

GORILA uses DQN for learning the value function. Each actor receives the best action that can be taken on the environment from the latest Q-Network parameters. The actor sends the experience trajectory to the replay buffer. There can be multiple actors in an environment which makes the gathering of experience trajectories (training data) easier. The training data is sampled by each Learner to train their corresponding Q-Network using Deep Q Learning. However, the gradients are not directly applied to the target network. Each learner containing Q-Networks sends the gradients to the parameter server. The parameter server is responsible for collecting the gradients from the learners and consolidating the gradient update. The parameter server is built using distbelief a distributed system for training large networks in parallel. GORILA uses model parallelism for collecting and applying gradients from the learner models and sending the latest Q-Network parameters to the target Q Network and the actors. Hence the model parameters used for generating the best action fall behind the latest parameters learned by the learner because synchronization happens after every N gradient updates.

IMPALA

IMPALA, which standards for Importance Weighted Actor-Learning architecture, is another distributed RL approach proposed in 2018, and is quite different from GORILA. IMPALA is more suitable for multi-task learning where you need to train on millions of samples across different domains. Unlike GORILA, which learns the value function, IMPALA learns the policy function and the value function using an algorithm called Actor-critic. In actor-critic methods, there are two function approximators, one responsible for learning the policy and the other responsible for learning the value function. Learning the policy function has an advantage over pure value functions because the policy functions are stochastic in nature. In policy gradient-based methods the agents are free to explore actions generated by the policy function. In contrast, in traditional Q-learning where exploration vs. exploitation decision is made only during the learning phase, but the actions predicted by the agent are taken as per learned policy which makes the policy deterministic.

(Note: It is possible to learn a stochastic policy using Q-learning, such a policy is called quassi-deterministic).

IMPALA uses multiple actors and a single learner strategy to overcome the communication overhead. The learner uses accelerators (GPU) for training the agent on experiences collected from various actors. The learner uses a variant of an actor-critic algorithm for learning the policy function and the baseline value function. However, unlike distributed versions of A3C, the actors store locally and share the experiences with the centralized learner and not the gradient updates. IMPALA employs various optimization techniques to reduce the lag between the actors and learners. The learning is optimized using CNN + LSTM + FC layer, where the input to CNN and FC layers are processed in parallel. However, the architecture still follows an off-policy learning process because the actors do not use the latest policy for deriving the actions. Nonetheless, IMPALA reduces the impact of lag by employing V-trace another off-policy actor-critic reinforcement learning algorithm. As mentioned in the title of the algorithm "Importance sampling" is a technique applied in estimation to reduce the variance. IMPALA uses weighted importance sampling by applying weights to the gradients. The image below shows the sample architecture of IMPALA.

IMPALA Architecture (Image by author)
IMPALA Architecture (Image by author)

SEED-RL

SEED RL is the latest of the lot, proposed by Google in 2020. SEED stands for scalable and efficient Deep RL. SEED RL solves many shortfalls that existed in the distributed actor and centralized learner approach used by IMPALA. The most important shortfall of IMPALA is the use of CPU for inference. SEED solves this problem by moving the inference and trajectory collection to the learner. With this change, the actor is only responsible for operating the environment. Since the actor now has only one job to do, it is more lightweight than the one used in IMPALA. Additionally, the actors only share experiences. Hence the bandwidth requirements are limited compared to IMPALA, where the actors share model parameters, experience trajectory, and the LSTM state.

Since the actor relies on the learner for every step it needs to take, it introduces a new problem of latency. SEED solves this problem by using streaming and always-open connection with low latency communication protocols like GRPC. Additionally, when an actor is collocated with the learner the latency is further reduced by using domain sockets. The image below shows the architecture of SEED-RL.

SEED-RL Architecture (Image by Author)
SEED-RL Architecture (Image by Author)

SEED uses V-trace introduced by IMPALA for learning the target policy. SEED differs from how the value function is learned. SEED uses R2D2 (Recurrent experience replay distributed reinforcement learning), a distributed value-based agent for learning the value function.


The following table summarizes the pros and cons of each of the approaches mentioned above.

Pros and Cons of Distributed RL approaches. (Image by author)
Pros and Cons of Distributed RL approaches. (Image by author)

In summary, each of the approaches above solves the problem of training time, cost, bandwidth requirements, and resource optimization in a distinctive way. The application of each is subjective to the nature of the problem. Nonetheless, applications of AI in industrial applications other than gaming have gained momentum and distributed architectures for RL powered with on-demand pricing models, and the availability of accelerators from cloud providers will aid in further advancements in this area.


Related Articles