ElegantRL-Podracer: A Scalable and Elastic Library for Cloud-Native Deep Reinforcement Learning

XiaoYang-ElegantRL
Towards Data Science
6 min readDec 11, 2021

--

Agents as Pods to Do a Racing Tournament on the Cloud.

“When the storm is over, you can see my racer. I’m building a podracer!”

— by Anakin Skywalker

We are building a Podracer on the cloud, using ElegantRL.

If you want to see a “podracer” that trains a powerful deep reinforcement learning (DRL) agent in minutes, this article is for you. ElegantRL-Podracer is a cloud solution that supports millions of GPU cores to carry out massively parallel DRL training at multiple levels. Let’s dive into it!

This article by Steven Li and Xiao-Yang Liu describes our recent paper ElegantRL-Podracer: Scalable and Elastic Library for Cloud-Native Deep Reinforcement Learning, presented at NeurIPS 2021: Deep RL Workshop.

The codes, documentation, and demos are available on GitHub.

What Is the Major Challenge?

Deep reinforcement learning, which balances the exploration (of uncharted territory) and exploitation (of current information), has revolutionized learning and actuation in applications such as game playing and robotic control.

However, the cost of data collection remains a major challenge for wider DRL adoption in real-world problems with complex and dynamic environments. Therefore, a compelling solution is massively parallel training on hundreds or even thousands of GPUs, say millions of GPU cores.

Existing DRL frameworks/libraries either experience a lack of scalability or involve a steep learning curve.

Figure 1: A comparison of different frameworks/libraries.
Figure 1: A comparison of different frameworks/libraries. [Image by authors.]

Design Principles

We develop a user-friendly open-source library that exploits cloud resources to train DRL agents. The library emphasizes the following design principles:

  • Scaling-out: scalability and elasticity.
  • Efficiency: low communication overhead, massively parallel simulations and robustness of agents.
  • Accessibility: light weight and customization.

Tournament-Based Ensemble Scheme

ElegantRL-Podracer employs a tournament-based ensemble scheme to orchestrate the training process on hundreds or even thousands of GPUs, scheduling the interactions between a leaderboard and a training pool with hundreds of agents (pods).

Figure 2: Tournament-based ensemble scheme.
Figure 2: Tournament-based ensemble scheme. [Image by authors.]

In contrast to Evolutionary Strategies (ES) where a population of agents evolve over generations, our tournament-based ensemble scheme updates agents asynchronously in parallel, which decouples population evolution and single-agent learning. The key of the tournament-based ensemble scheme is the interaction between a leaderboard and a training pool.

  1. An orchestrator instantiates a new agent (pod) and puts it into a training pool.
  2. A generator initializes an agent (pod) with networks and optimizers selected from a leaderboard.
  3. An updater determines whether and where to insert an agent into the leaderboard according to its performance after an agent has been trained for a certain number of steps or a certain amount of time.

Intuitively, ElegantRL-Podracer is inspired by Kevin Kelly’s book Out of control [4], “the virtues of nested hierarchies” and “getting smart from dumb things”. For low-level training, ElegantRL-Podracer realizes nested hierarchies by employing hardware-oriented optimizations. For high-level scheduling, we obtain a smart agent from hundreds of weak agents.

Cloud-Native Paradigm

ElegantRL-Podracer follows a cloud-native paradigm by realizing the development principles of microservices and containerization. We decompose the training process into five components and implement them as microservices, e.g., orchestrator, leaderboard, worker, learner, and evaluator. We employ the Kubernetes (K8s) as the resource manager and execute each process in separated containers and pods.

Features of ElegantRL-Podracer

Scalable parallelism: The multi-level parallelism of ElegantRL-Podracer leads to high scalability.

  • Agent parallelism: The agents in the training pool are parallel, thus can scale out to a large number. The asynchronous training of parallel agents can also reduce the frequency of agent-to-agent communication.
  • Learner parallelism: An agent employs multiple learners to train the neural networks in parallel, and then fuse the parameters of the network to obtain a result agent, instead of using distributed SGD. Such a model fusion through network parameters involves low-frequency communication as the fusion process happens at the end of an epoch.
  • Worker parallelism: An agent utilizes multiple rollout workers to sample transitions in parallel.

Elastic resource allocation: The elasticity is critical for cloud-level applications as it helps users adapt to the changes in cloud resources and prevent over-provisioning and under-provisioning of resources. ElegantRL-Podracer can elastically allocate the number of agents (pods) by employing an orchestrator to monitor the available computing resources and the current training status.

Cloud-oriented optimizations: ElegantRL-Podracer co-locates microservices on GPUs to accelerate the parallel computation on both data collection and model training. For the data transfer and storage, ElegantRL-Podracer represents data as tensors to speed up the communication and allocates the shared replay buffer on the contiguous memory of GPUs to increase the addressing speed.

Experiments

Finance is a promising and challenging real-world application of DRL algorithms. We apply ElegantRL-podracer to a stock trading task as an example to show its potential in quantitative finance.

We aim to train a DRL agent that decides where to trade, at what price and what quantity in a stock market, thus the objective of the problem is to maximize the expected return and minimize the risk. We model the stock trading task as a Markov Decision Process (MDP) as in FinRL. We follow a training-backtesting pipeline and split the dataset into two sets: the data from
01/01/2016 to 05/25/2020 for training, and the data from 05/26/2020 to 05/26/2021 for backtesting.

The experiments were executed using NVIDIA DGX-2 servers in a DGX SuperPOD cloud, a cloud-native infrastructure.

Left: cumulative return on minute-level NASDAQ-100 constituents stocks (initial capital
$1, 000, 000, transaction cost 0.2%). Right: training time (wall-clock time) for reaching cumulative
rewards 1.7 and 1.8, using the model snapshots of ElegantRL-podracer and RLlib.
Stock trading performance on NASDAQ-100 constituent stocks with minute-level data.

All DRL agents can achieve a better performance than the market benchmark with respect to the cumulative return, demonstrating the algorithm’s effectiveness. We observe that ElegantRL-podracer has a cumulative return of 104.743%, an annual return of 103.591%, and a Sharpe ratio of 2.20, which outperforms RLlib substantially. However, ElegantRL-podracer is not as stable as RLlib during the backtesting period: it achieves annual volatility of 35.357%, max. drawdown -17.187%, and Calmar ratio 6.02. There are two possible reasons to account for such instability:

  1. the reward design in the stock trading environment is mainly related to the cumulative return, thus leading the agent to take less care of the risk;
  2. ElegantRL-podracer holds a large number of funds around 2021–03, which naturally leads to a larger slip.

We compare the training performance on a varying number of GPUs, i.e., 8, 16, 32, and 80. We measure the required training time to obtain two cumulative returns of 1.7 and 1.8, respectively. Both ElegantRL-podracer and RLlib require less training time to achieve the same cumulative return as the number of GPUs increases, which directly demonstrates the advantage of cloud computing resources on the DRL training. For ElegantRL-podracer with 80 GPUs, it requires (1900s, 2200s) to reach cumulative returns of 1.7 and 1.8. ElegantRL-podracer with 32 and 16 GPUs need (2400s, 2800s) and (3400s, 4000s) to achieve the same cumulative returns. It demonstrates the high scalability of ElegantRL-podracer and the effectiveness of our cloud-oriented optimizations. For the experiments using RLlib, increasing the number of GPUs does not lead to much speed-up.

Acknowledgement

This research used computational resources of the GPU cloud platform provided by the IDEA Research Institute.

Reference

[1] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for distributed reinforcement learning. In ICML, pages 3053–3062. PMLR, 2018.

[2] J. Schulman, F. Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017.

[3] Erik Wijmans, Abhishek Kadian, Ari S. Morcos, Stefan Lee, Irfan Essa, Devi Parikh, M. Savva, and Dhruv Batra. DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames. In ICLR, 2020.

[4] Kevin Kelly. Out of control: The rise of neo-biological civilization. Addison-Wesley Longman Publishing Co., Inc., 1994.

--

--