Scalable Reinforcement Learning Using Azure ML and Ray

Accelerate deep RL training on custom gym environments using RLLIB on Azure ML Cluster.

Published in

Towards Data Science

6 min readJul 13, 2022

Single-machine and single-agent RL training have many challenges, the most important being the time it takes for the rewards to converge. Most of the time spent by the agent in RL training goes into gathering experiences. The time taken for simple applications is a few hours, and complex applications take days. Deep Learning frameworks like Tensorflow support distributed training; can the same be applied to RL as well? The answer is yes.

This article focuses on specific pain points of single-machine training with a practical example and demonstrates how scaled RL solves the problem. This article assumes the reader has a fundamental understanding of RL, Deep RL, and deep learning; for a detailed explanation and walkthrough, please go through the entire video presented at the bottom of the page.

Scalable reinforcement learning techniques solve the problem of increased training times in environments with high dimensional state space by decoupling learning from acting. Following the practices of distributed deep learning, parallelism can be achieved by spreading the gradient computation across models or by distributing the data accumulation process. A wide range of algorithms has been proposed for distributed learning.

This article mainly focuses on accelerated learning using distributed experience-trajectory collection.

We begin with:

Understanding the problem statement.
Building a simple DQN agent and creating a baseline for training duration and rewards.
Creating a distributed RL environment, running the experiment in a distributed fashion, and comparing the improvements with the baseline above.

The link to the code artifacts used for training is available here (https://github.com/sriksmachi/supercabs)

Custom Environment

Many samples on the web demonstrate using distributed RL for inbuilt gym environments, but very few use custom environments. In practice, when RL is chosen for solving a problem, it is important to learn how to develop and register custom gym environments for distributed training. This article teaches how to create a custom gym environment. For the purpose of this article and the video, we will use a pseudo-learning problem defined below.

Problem: The custom environment we deal with is called “Contoso Cabs”. Contoso Cabs is a fictitious cabs company that wants to increase its profits by increasing the total number of hours logged by each driver in a month. It currently operates in 5 states and provides 24/7 services. To maximize its profits Contoso Cabs want to build an RL agent which will help its drivers make the right decisions when a cab request arrives. Since profits are linked to hours driven, the agent accepts a ride to maximize the cumulative discounted reward which in this case is the total hours. The agent's goal is to accumulate a maximum number of hours per episode (each episode is one month). The below code shows the custom gym environment-class used in this training.

DQN Agent

We first build a simple DQN agent using the following architecture.

Input: The size of the encoded state space is 36 (number of cities (5)+ number of days operating (7) + number of hours operating per day (24)).
2 FC Hidden layers of size 36.
The action space is represented as a tuple of (source, destination), and the size of the action space is 5*5 = 25. For simplicity, the tuple with the same source and destination pairs are included but a penalty is added if the agent picks any such action. In essence, we input to the network a batch of 64 state vectors, the of size each state vector is 36. The output represents the Q-value for each action.

When a DQN Agent is trained, the rewards stabilize between 2500–3000 in 1000 episodes as shown in the image below. This means now we have a model which can be used to ensure the driver earns between 2500–3000.

DQN training results. (Image created by Author)

The challenge however is that the agent has taken ~35 min to converge for a simple environment. The state space considered here contains only 36 bits (encoded). In real-time, the state spaces are easily higher by 10-fold.

Ray on Azure ML

Before diving into solving the state-space problem, let us understand the key ingredients which are Distributed RL algorithms, Ray, Azure ML, and APEX.

Distributed Deep RL decouples acting (interacting with the env) from learning (learning from samples), and the decoupling allows the systems to scale independently. Depending on how the interaction between various components happens and RL methodology several methods and algorithms have been published, a few of which are described in my article here.

Ray’s RLLIB is an excellent open-source library that supports most of the proposed algorithms. It is easy to configure and run training jobs using existing or custom Gym environments. Enabling RLLIB on any distributed cluster is simple.

Azure ML is a service provided by Azure (Microsoft’s cloud) that acts as a workspace for running ML experiments. Using Azure ML we can run ML experiments in interactive mode or job mode for long-running experiments. Azure ML provides a self-hosted ML cluster that can be configured using CPU / GPU cores for running distributed ML experiments. To install and run distributed RL jobs, we need to install Ray on the Azure ML cluster. Ray-on-aml is an open-source library for converting an Azure ML cluster into a Ray cluster.

APEX is a distributed architecture for deep reinforcement learning at scale. It allows actors to learn efficiently by decoupling acting from learning. Actors in APEX can be scaled to 1000s of worker nodes, each actor works with its own copy of the environment. The experiences are gathered and shared with a common replay buffer. Along with experiences actors also share priorities computed using Temporal difference error. The learner model samples prioritized experiences and re-computes priorities. The network parameters are updated to the actor’s model periodically.

The below diagram shows the architecture of APEX.

APEX Architecture (Image created by Author)

The above architecture allows the learner model to learn from a rich set of experiences compared to a single agent. Prioritized / Importance sampling facilitates faster convergence compared to uniform sampling.

The code below shows the Job which is executed on Azure ML cluster converted to Ray cluster. This is executed using AML in Job mode, for details of the AML job please see the code is written here (https://github.com/sriksmachi/supercabs/blob/master/run_experiment.py)

Results

Using the methods explained above the ‘Contoso Cabs’ environment is trained with 10 actors for 3 training iterations, on a Ray cluster. This experiment took only 3 min to train 6K episodes and converged to maximum rewards.

The training time is reduced by 90% and at the same time, we ran 6k episodes by distributing the workload. The below images shows the maximum reward per episode across all episodes.

Rewards per episode with APEX (Image created by Author)

The video below explains the training experiments in detail. If you are new to RL I would recommend watching the video from the beginning.

Code Walkthrough: Distributed Deep RL on Azure ML using Ray’s RLLIB and Custom GYM environments.

Summary

In summary, single machine training of RL agents for real environments is a time taking process. Distributed deep RL training aids in improving the training time and reaching convergence faster. RLLIB (Ray) is a powerful framework for production-grade distributed RL workloads, Azure ML cluster can be converted to Ray cluster using ray-on-ml in just a few lines of code. The training times can be drastically reduced by leveraging the methodology proposed in this article.