
Motivation: Sample Efficiency in Deep Reinforcement Learning Agents
Training deep reinforcement learning agents requires significant trial-and-error for the agent to learn robust policies to accomplish one or more tasks in its environment. For these applications, the agent is typically only told whether its behavior results in a large or small reward; therefore, the __ agent must indirectly learn both the behavior and value of certain actions and states. As one can imagine, this typically requires the agent to experiment with its behavior and estimates for quite a while.
Generating these experiences can be difficult, time-consuming, and expensive, particularly for real-life applications such as humanoid robotics. Therefore, a question many roboticists and Machine Learning researchers have been considering is: "How can we minimize the number of experiences we need to generate in order to successfully train robust and high-performing agents?"
Just getting started with deep reinforcement learning? Check out this fantastic intro from OpenAI.

Enter: Experience Replay
As you probably suspected, this is where experience replay comes in! The idea behind this crucial technique is simple – rather than regenerating new experiences each time we want to train the agent, why don’t we continue learning from experiences we already have available?
Experience replay is a crucial component of off-policy deep Reinforcement Learning algorithms, improving the sample efficiency and stability of training by storing the previous environment interactions experienced by an agent [1].
What Do Experience Replay Buffers Store?
To answer this question, we first need to visit the common implementations of "experiences" in deep reinforcement learning:
Representing Training Experience as Transitions and Rollouts
In reinforcement learning, experiences are represented as transitions and rollouts, the latter of which is a set of temporally contiguous transitions. These transitions, in their most general form, are composed of a quintuple of five features/signals given to the agent as a training sample:
- State (s): This represents the information available to the agent that can be used for action selection. You can think of s as the representation of the world that the agent is able to observe. Typically, the agent cannot observe the true state of the world, just a subset of it.
- Actions (a): These represent the choices, either discrete or continuous, the agent can make as it interacts with its environment. The agent’s choice of action will typically impact both its next state s’ and reward r.
- Reward (r): These represent the rewards given to the agent for taking a given action a given state s, or, in some cases, simply for observing/being in the state corresponding to __ a.
- Next State (s’): This represents the state corresponding to the state the agent transitions to after being in state s and taking action a.
- Done Signals (d): These are binary signals representing whether the current transition represents the final transition in a given rollout/episode. These are not necessarily used for all environments – when they are not needed (e.g. environments with no termination conditions), they can simply always be set to 1 or 0.
Written symbolically, these transition quintuples T are given as:

And a rollout, also known as an episode or trajectory and denoted R, is given as a set of N transitions:

These transitions T and rollouts R are the primary representations of how an agent’s experiences are stored. These transitions all lie within the transition space, which we denote by the function τ. The domain of τ is given by the Cartesian product of the state and action spaces of the agent S x A (respectively), and the co-domain is given by the Cartesian product of the reward and state spaces (R x S). Mathematically, the transition function is defined (when it is deterministic) by:

How Do We Implement Experience Replay?

Experience replay is typically implemented as a circular, first-in-first-out (FIFO) replay buffer (think of it as a database storing our agent’s experiences). We use the following definitions for categorizing our experience replay buffers [1]:
- Replay Capacity: The total number of transitions stored in the replay buffer.
- Age of transition: Defined to be the number of gradient steps taken by the learner since the transition was generated. The oldest policy of a replay buffer is represented by the age of the oldest transition in the buffer.
- Replay Ratio: The number of gradient updates per environment transition. Provided the agent can continue to learn stable policies, behaviors, and skills by training on the same sets of experiences repeatedly, a higher replay ratio can be helpful for improving the sample efficiency of off-policy reinforcement learning agents.
How do We Train with Experience Replay?

We’ve talked about how to describe replay buffers, but how do they work? In short, replay buffers "replay" experiences for an agent, allowing them to revisit and train on their memories. Intuitively, it allows agents to "reflect" and "learn" from their previous mistakes. As the saying goes, we learn from the mistakes we make, and this is certainly true for experience replay.
Experience replay buffers are typically applied to off-policy reinforcement learning algorithms by capturing all the samples generated by an agent interacting with its environment and then storing them for later reuse. Crucially, since the agent is off-policy (has a different training vs. exploration policy), the samples replayed from the agent need not follow a sequential order.
What are some libraries I can use for Replay Buffers?
Functionality for implementing reinforcement learning can be found in many popular python reinforcement learning environments, such as:
- TensorFlow Agents (Replay buffer page)
- Ray RLLib (Replay buffer API)
- Stable-Baselines (Using a replay buffer with Soft Actor-Critic)
- Spinning Up (Home page)
- Keras-RL (Home page)
- Tensorforce (Replay buffer page)
Many of these libraries implement replay buffers modularly, allowing for choosing different replay buffers to use with different reinforcement learning algorithms.
Recent Advances in Experience Replay

Significant advances have been made that build upon the foundations of experience replay to further improve sample efficiency and robustness of reinforcement learning agents. These advances can largely be categorized into two topics:
(i) Determining sample selection
(ii) Generating new training samples.
Each of these, along with a sample of corresponding examples from literature, is discussed below.
Determining Sample Selection (PER, LFIW, HER, ERO)
One way for an experience replay buffer to explicitly manage training for a reinforcement learning agent is to give it control over which experiences are replayed for the agent. Some literature examples include:
- Prioritized Experience Replay (PER) [4]: **** Assigns a numeric "prioritization" value according to how much "surprise" an agent would receive from learning from this experience. Essentially, the more "surprise" (typically encoded as TD error) a sample has, the greater the prioritization weight.
- Likelihood Free Importance Weights (LFIW) [5]: **** Like PER, LFIW uses TD error to assign a prioritization of experience. LFIW reweights experiences based on their likelihood from the current policy. To balance bias and variance, LFIW also uses a likelihood-free density ratio estimator between on-policy and off-policy experiences. This ratio is in turn used as the prioritization weight.
- Hindsight Experience Replay (HER) [6]: **** Addresses issues associated with sparse reward environments by storing transitions not only with the original goal used for a given episode but also with a subset of other goals for the RL agent.
- Experience Replay Optimization (ERO) [7]: **** Learns a separate neural network function for determining which samples from the replay buffer to select. Therefore, in addition to the underlying agent’s neural networks (typically actor and critic networks), this architecture also assigns a neural network to determine sample selection for the other learners.
These approaches all control how new samples are selected for training agents, and in turn, allow for improving overall training for deep reinforcement learning agents. Rather than just supplying the agent with a random set of experiences to train on (and subsequently optimize over using gradient optimization techniques), novel experience replay learns, either heuristically or through additional gradient optimization methods, which samples to provide the agent in order to maximize learning. The replay buffer not only supplies and stores the metaphorical books and lessons used to teach the agent, but actually is in charge of deciding which books and lessons to give to the agent at the right times.
Generating New Training Samples (CT, S4RL, NMER)
Another class of experience replay buffers focuses on generating novel samples for an agent to train using existing samples. Some literature examples include:
- Continuous Transition (CT) [8]: **** Performs data augmentation for reinforcement learning agents in continuous control environments by interpolating adjacent transitions along a trajectory using Mixup [9], a stochastic linear recombination technique.
- Surprisingly Simple Self-Supervised RL (S4RL) [10]: **** Proposes, implements, and evaluates seven different augmentation schemes and how they behave with existing offline RL algorithms. These augmentation mechanisms help to smooth out the state space of the deep reinforcement learning agent.
- Neighborhood Mixup Experience Replay (NMER) (Disclaimer: my research) [11]: Similarly to CT, NMER recombines nearby samples to generate new samples using Mixup. However, rather than combining temporally-adjacent samples, NMER combines nearest neighbor samples in the (state, action) space according to a provided distance metric.
The Future of Experience Replay (My Outlook)
Originally, replay buffers were only tasked with storing the experiences of an agent, and had little control over what and how the agent used the samples to improve its policy and value functions. However, as new experience replay buffers come out, the replay buffer is gaining an increasingly important role not just as an experience storage mechanism for reinforcement learning agents, but as a trainer and sample generator for the agent. From the techniques referenced above, as well as many more, here are several directions I believe experience replay is heading in.
- Interpolated Experiences (Disclaimer: This was my main area of research for my Master’s Thesis.) – Using existing experiences, replay buffers will augment a reinforcement learning agent’s set of experiences available for training, leading to more robust policies and decision-making.
- Low-bias, Low-variance sample selection – Replay buffers will further continue to improve how samples are selected from a replay buffer, to ensure the distribution of experience they are implicitly teaching to the agent helps the agent learn a realistic representation of the environment and its associated transition function/manifold it interacts in.
- Neural Experience Replay – As seen with some replay buffer approaches such as ERO, some mechanisms in experience replay can themselves be learned, and can approximate functions when implemented as neural networks! As experience replay approaches continue to mature and become more complicated, I believe we will see continued integration and use of different neural network architectures (MLPs, CNNs, GNNs, and Transformers).
Thank you for reading! To see more on computer vision, reinforcement learning, and robotics, please follow me. Considering joining Medium? Please consider signing up through here. Thank you for reading!
References
[1] Fedus, William, et al. "Revisiting fundamentals of experience replay." International Conference on Machine Learning. PMLR, 2020.
[2] Brockman, Greg, et al. "Openai gym." arXiv preprint arXiv:1606.01540 (2016).
[3] Todorov, Emanuel, Tom Erez, and Yuval Tassa. "Mujoco: A physics engine for model-based control." 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012.
[4] Schaul, Tom, et al. "Prioritized Experience Replay." ICLR (Poster). 2016.
[5] Sinha, Samarth, et al. "Experience replay with likelihood-free importance weights." Learning for Dynamics and Control Conference. PMLR, 2022.
[6] Andrychowicz, Marcin, et al. "Hindsight experience replay." Advances in neural information processing systems 30 (2017).
[7] Zha, Daochen, et al. "Experience Replay Optimization." IJCAI. 2019.
[8] Lin, Junfan, et al. "Continuous transition: Improving sample efficiency for continuous control problems via mixup." 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.
[9] Zhang, Hongyi, et al. "mixup: Beyond Empirical Risk Minimization." International Conference on Learning Representations. 2018.
[10] Sinha, Samarth, Ajay Mandlekar, and Animesh Garg. "S4RL: Surprisingly simple self-supervision for offline reinforcement learning in robotics." Conference on Robot Learning. PMLR, 2022.
[11] Sander, Ryan, et al. "Neighborhood Mixup Experience Replay: Local Convex Interpolation for Improved Sample Efficiency in Continuous Control Tasks." Learning for Dynamics and Control Conference. PMLR, 2022.