Value-based Methods in Deep Reinforcement Learning

Deep Reinforcement learning has been a rising field in the last few years. A good approach to start with is the value-based method, where the state (or state-action) values are learned. In this post, a comprehensive review is provided where we focus on Q-learning and its extensions.

Dr Barak Or
Towards Data Science

--

unsplash

A Short Introduction to Reinforcement Learning (RL)

There are three types of common machine learning approaches: 1) supervised learning, where a learning system learns a latent map based on labeled examples, 2) unsupervised learning, where a learning system establishes a model for data distribution based on unlabeled examples, and 3) Reinforcement Learning, where a decision-making system is trained to make optimal decisions. From the designer’s point-of-view, all kinds of learning are supervised by a loss function. The sources of supervision must be defined by humans. One way to do this is by the loss function.

Image by author

In supervised learning, the ground truth label is provided. But, in RL, we teach an agent by exploring the environment. We should design the world where the agent is trying to solve a task. This design is related to RL. A formal RL framework definition is given by [1]

an agent acting in an environment. At every point in time, the agent observes the state of the environment and decides on an action that changes the state. For each such action, the agent is given a reward signal. The agent’s role is to maximize the total received reward.

RL Diagram (image by author)

So, how it works?

RL is a framework for learning to solve sequential decision-making problems by trial & error in a world that provides occasional rewards. This is the task of deciding, from experience, the sequence of actions to perform in an uncertain environment to achieve some goals. Inspired by behavioral psychology, reinforcement learning (RL) proposes a formal framework for this problem. An artificial agent may learn by interacting with its environment. Using the experience gathered, the artificial agent can optimize some objectives given via cumulative rewards. This approach applies in principle to any type of sequential decision-making problem relying on past experience. The environment may be stochastic, the agent may only observe partial information about the current state, etc.

Why go deep?

Over the past few years, RL has become increasingly popular due to its success in addressing challenging sequential decision-making problems. Several of these achievements are due to the combination of RL with deep learning techniques. For instance, a deep RL agent can successfully learn from visual perceptual inputs made up of thousands of pixels (Mnih et al., 2015 / 2013).

As Lex Fridman said:

“One of the most exciting fields in AI. It’s merging the power and the capabilities of deep neural networks to represent and comprehend the world with the ability to act and then understanding the world”.

It has solved a wide range of complex decision-making tasks that were previously out of reach for a machine. Deep RL opens up many new applications in healthcare, robotics, smart grids, finance, and more.

Types of RL

Value-Based: learn the state or state-action value. Act by choosing the best action in the state. Exploration is necessary. Policy-Based: learn directly the stochastic policy function that maps state to action. Act by sampling policy. Model-Based: learn the model of the world, then plan using the model. Update and re-plan the model often.

Mathematical background

We now focus on the value-based method, which belongs to “model-free” approaches, and more specifically, we will discuss the DQN method, which belongs to Q-learning. For that, we quickly review some necessary mathematical background.

Open AI

Let’s define some mathematical quantities:

  1. Expected return

An RL agent goal is to find a policy such that it optimizes the expected return (V-value function):

where E is the expected value operator, gamma is the discount factor, and pi is a policy. The optimal expected return is defined as:

The optimal V-value function is the expected discounted reward when in a given state s the agent follows the policy pi* thereafter.

2. Q value

There are more functions of interest. One of them is the Quality Value function:

Similarly to the V-function, the optimal Q value is given by:

The optimal Q-value is the expected discounted return when in a given state s and for a given action, a, the agent follows the policy pi* thereafter. The optimal policy can be obtained directly from this optimal value:

3. Advantage function

We can relate between the last two functions:

It describes “how good” the action a is, compared to the expected return when following direct policy pi.

4. Bellman equation

To learn the Q value, the Bellman equation is used. It promises a unique solution Q*:

where B is the Bellman operator:

To promise optimal value: state-action pairs are represented discretely, and all actions are repeatedly sampled in all states.

Q-Learning

Q learning in an off-policy method learns the value of taking action in a state and learning Q value and choosing how to act in the world. We define state-action value function: an expected return when starting in s, performing a, and following pi. Represented in a tabulated form. According to Q learning, the agent uses any policy to estimate Q that maximizes the future reward. Q directly approximates Q*, when the agent keeps updating each state-action pair.

For non-deep learning approaches, this Q function is just a table:

Image by author

In this table, each of the elements is a reward value, which is updated during the training such that in steady-state, it should reach the expected value of the reward with the discount factor, which is equivalent to the Q* value. In real-world scenarios, value iteration is impractical;

Image by Google

In the Breakout game, the state is screen pixels: Image size: 84x84, Consecutive: 4 images, Gray levels: 256. Hence, the number of rows in the Q-table is:

Just to mention, in the universe, there are 10⁸² atoms. This is a good reason why we should solve problems like the Breakout game in deep reinforcement learning…

DQN: Deep Q-Networks

We use a neural network to approximate the Q function:

The neural network is good as a function approximator. DQN was used in the Atari games. The loss function has two Qs functions:

Target: the predicted Q value of taking action in a particular state. Prediction: the value you get when actually taking that action (calculating the value on the next step and choosing the one that minimizes the total loss).

Parameter updating:

When updating the weights, one also changes the target. Due to the generalization/ extrapolation of neural networks, large errors are built in the state-action space. Hence, the Bellman equation is not converged w.p. 1. Errors may propagate with this update rule (slow / unstable/ etc.).

DQN algorithm can obtain strong performance in an online setting for a variety of ATARI games and directly learns from pixels. Two heuristics to limit the instabilities: 1. The parameters of the target Q-network are updated only every N iterations. This prevents the instabilities from propagating quickly and minimizes the risk of divergence.2. The experience replay memory trick can be used.

DQN Architecture (MDPI: Comprehensive Review of Deep Reinforcement Learning Methods and Applications in Economics)

DQN Tricks: experience replay and epsilon greedy

Experience replay

in DQN, a CNN architecture is used. The approximation of Q-values using non-linear functions is not stable. According to the experience replay trick: all experiences are stored in a replay memory. When training the network, random samples from the replay memory are used instead of the most recent action. In different words: the agent collects memories\stores experience (state transitions, actions, and rewards) and creates mini-batches for the training.

Epsilon Greedy Exploration

As the Q function converges to Q*, it actually settles with the first effective strategy it finds. Hence, exploration is greedy. An effective way to explore is by choosing a random action with probability “epsilon” and other-wise (1-epsilon), go with the greedy action (with highest Q value). The experience is collected by the epsilon-greedy policy.

DDQN: Double Deep Q-Networks

The max operator in Q-learning uses the same values both to select and evaluate action. It makes it more likely to select overestimated values (in case of noise or inaccuracies), resulting in overoptimistic value estimates. In DDQN, there is a separate network for each Q. Hence, there are two neural networks. It helps reduces bias where policy is still chosen according to the values obtained by the current weights.

Two neural networks, with Q function for each:

Now, the loss function is provided by:

Dueling Deep Q-Networks

Q contains advantage (A) value in addition to the value (V) of being in that state. A is defined earlier as the advantage of taking action a in state s among all other possible actions and states. If all the actions you aim to take are “pretty good”, we want to know: how better it is?

The dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. For further reading with code example, we refer to the dueling-deep-q-networks post by Chris Yoon.

Summary

We presented the Q-learning in value-based method, with a general introduction for reinforcement learning and motivation to put it in the deep learning context. A mathematical background, DQN, DDQN, some tricks, and the dueling DQN have been explored.

About the author

Dr. Barak Or is a well-versed professional in the field of artificial intelligence and data fusion. He is a researcher, lecturer, and entrepreneur who has published numerous patents and articles in professional journals. ​Dr. Or is also the founder of ALMA Tech. LTD, an AI and advanced navigation company. He has worked with Qualcomm as DSP and machine learning algorithms expert. He completed his Ph.D. in machine learning for sensor fusion at the University of Haifa, Israel. He holds M.Sc. (2018) and B.Sc. (2016) degrees in Aerospace Engineering and B.A. in Economics and Management (2016, Cum Laude) from the Technion, Israel Institute of Technology. He has received several prizes and research grants from the Israel Innovation Authority, the Israeli Ministry of Defence, and the Israeli Ministry of Economic and Industrial. In 2021, he was nominated by the Technion for “graduate achievements” in the field of High-tech.

Website www.barakor.com Linkedin www.linkedin.com/in/barakor/ YouTube www.youtube.com/channel/UCYDidZ8GUzUy_tYtxvVjRiQ

References

[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT Press, 2018.

[2] Human-level control through deep reinforcement Learning, Volodymyr Mnih et al., 2015. on Nature.

[3] Mosavi, Amirhosein, et al. “Comprehensive review of deep reinforcement learning methods and applications in economics.” Mathematics 8.10 (2020): 1640.

[4] Baird, Leemon. “Residual algorithms: Reinforcement learning with function approximation.” Machine Learning Proceedings 1995. Morgan Kaufmann, 1995. 30–37.

--

--