OpenAI’s MADDPG Algorithm

An actor-critic approach to multi-agent RL problems

Austin Nguyen
Towards Data Science

--

A New Approach

Multi-agent reinforcement learning is an on-going, rich field of research. However, naively applying single-agent algorithms in multi-agent contexts “puts us in a pickle.” Learning becomes difficult due to many reasons, especially due to:

  • The non-stationarity between independent agents
  • The exponential increase in state and action space

Researchers have proposed plenty of approaches to mitigate the effects of these challenges. A large subset of these methods falls under the umbrella of “centralized planning with decentralized execution.”

Centralized Planning

Each agent only has direct access to local observations. These observations can be many things: an image of the environment, relative positions to landmarks, or even relative positions of other agents. Also, during learning, all agents are guided by a centralized module or critic.

Even though each agent only has local information and local policies to train, there is an entity overlooking the entire system of agents, advising them on how to update their policies. This reduces the effect of non-stationarity. All agents learn with the help of a module with global information.

Decentralized Execution

Then, during testing, the centralized module is removed, leaving only the agents, their policies, and local observations. This reduces the detriments of increasing state and action space because joint policies are never explicitly learned. Instead, we hope that the central module has given enough information to guide local policy training such that it is optimal for the entire system once test time comes around.

OpenAI

Researchers at OpenAI, UC Berkeley, and McGill University introduced a novel approach to multi-agent settings using Multi-Agent Deep Deterministic Policy Gradients. Inspired by its single-agent counterpart DDPG, this approach uses actor-critic style learning and has shown promising results.

Photo by Alina Grubnyak on Unsplash

Architecture

We assume familiarity with the single-agent version of MADDPG: Deep Deterministic Policy Gradients (DDPG). For a quick refresher, Chris Yoon has a fantastic article overviewing it here:

Every agent has an observation space and continuous action space. Also, each agent has three components:

  • An actor-network that uses local observations for deterministic actions
  • A target actor-network with identical functionality for training stability
  • A critic-network that uses joint states action pairs to estimate Q-values

As the critic learns the joint Q-value function over time, it sends appropriate Q-value approximations to the actor to help training. We’ll see in the next section a more in-depth look at this interaction.

Keep in mind that the critic can be a shared network between all N agents. In other words, instead of training N networks that estimate the same value, simply train one network and use it to help all of the actors learn. The same applies for the actor networks if the agents are homogenous.

MADDPG Architecture (Lowe, 2018)

Learning

First, MADDPG uses an experience replay for efficient off-policy training. At each timestep, the agent stores the following transition:

Experience Replay Transition

where we store the joint state, next joint state, joint action, and each of the agents’ received rewards. Then, we sample a batch of these transitions from the experience replay to train our agent.

Critic Updates

To update an agent’s centralized critic, we use a one-step lookahead TD-error:

where mu denotes the actor. Keep in mind that this is a centralized critic, meaning it uses joint information to update its parameters. The primary motivation is that knowing the actions taken by all agents makes the environment stationary even when policies change.

Notice the calculation of our target Q-value on the right. Even though we never explicitly store the next joint actions, we use each of the agent’s target actor to compute this next action during the update to help in training stability. The target actor’s parameters are updated periodically to match the agent’s actor parameters.

Actor Updates

Similar to single-agent DDPG, we use the deterministic policy gradient to update each of the agent’s actor parameters.

where mu denotes an agent’s actor.

Let’s dig into this update equation just a little bit. We take the gradient with respect to the actor’s parameters using a central critic to guide us. The most important thing to notice is that even though the actor only has local observations and actions, we use a centralized critic during training time, providing information about the optimality of its actions for the entire system. This reduces the effects of nonstationarity while keeping policy learning at a lower state space!

Photo by Sander Weeteling on Unsplash

Policy Inference and Policy Ensembles

We can take decentralization one step further. In earlier critic updates, we assumed each agent automatically knew other agents’ actions. However, MADDPG suggests inferring other agents’ policies to make learning even more independent. In effect, each agent adds N-1 more networks to estimate the true policy of each of the other agents. We use a probabilistic network and maximize the log probability of outputting another agent’s observed action.

where we show the loss function for the ith agent estimating the jth agent’s policy with an entropy regularizer. As a result, our Q-value target becomes a slightly different value as we replace agent actions with our predicted action!

So, what exactly have we done? We’ve removed any assumption that agents know each other’s policies. Instead, we try to train agents to correctly predict other policies through a series of observations. In effect, each agent is trained independently by extracting global information from the environment instead of automatically having it on hand.

Photo by Tim Mossholder on Unsplash

Policy Ensembles

There’s one big issue with the approach above. In many multi-agent settings, especially in competitive ones, agents can craft policies that overfit to other agents’ behaviors. This makes policies brittle, unstable, and typically suboptimal. To compensate for that, MADDPG trains a collection of K sub-policies for each agent. At each timestep, an agent randomly selects one of the sub-policies to choose an action from. Then, execute.

The policy gradient becomes slightly modified. We average over the K sub-policies, use linearity of expectation, and propagate updates through the Q-value function.

Take a Step Back

That outlines the entire algorithm! At this point, it’s important to take a step back and internalize what exactly we’ve done and intuitively understand why it works. In essence, we’ve done the following things:

  • Defined actors for agents that only use local observations. This helps curb the effects of an exponentially increasing state and action space.
  • Defined a centralized critic for each agent that uses joint information. This helps reduce the effects of non-stationarity and guides the actor to make it optimal for the global system
  • Defined policy inference networks to estimate other agent’s policies. This helps limit agent interdependence and removes the need for agents to have perfect information.
  • Defined policy ensembles to reduce the effects and possibility of overfitting to other agent’s policies.

Every component of the algorithm serves a specific, delegated purpose. This is what makes MADDPG a powerful algorithm: its various components are meticulously designed to overcome big obstacles multi-agent systems usually have in spades. Next, we take a look at the algorithm’s performance.

Photo by RetroSupply on Unsplash

Results

MADDPG was tested in many environments. For the full overview of its performance, feel free to check out the paper [1]. Here, we’ll only discuss the cooperative communication task.

Environment Overview

Here, there are two agents: a speaker and a listener. During each iteration, the listener is given a colored landmark to travel to and receives a reward inversely proportional to its distance from it. Here’s the catch: the listener only knows its relative position and the color of all landmarks. It doesn’t know which landmark it’s supposed to travel to. On the other hand, the speaker knows the color of the correct landmark for this episode. As a result, the two agents must communicate and collaborate to solve the task.

Comparisons

For this task, the paper pits MADDPG against state-of-the-art single-agent methods. We can see a significant improvement with the use of MADDPG.

It was also shown that policy inference, even though policies were not fitted perfectly, achieved the same success rates as using true policy observations. Even better, there was no significant slowing in convergence.

Lastly, policy ensembles showed promising results. The paper [1] tests the effect of ensembles in competitive environments and demonstrated significantly better performance than agents with only one policy.

Closing Notes

And that’s it! Here we overviewed a novel approach to multi-agent reinforcement learning problems. Of course, there’s an endless sea of methods under the “MARL umbrella,” but MADDPG provides a strong starting point for approaches that tackle multi-agent systems’ biggest problems.

References

[1] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, I. Mordatch, Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments (2018).

--

--

Part-time writer · Full-time learner · PhD Student @ University of Michigan