The Ultimate Beginner’s Guide to Reinforcement Learning

Siddharth Sharma
Towards Data Science
15 min readApr 2, 2020

--

Motivation and Background

One of my favorite quotes about machine learning is from Peter Norvig, current Google Head of AI: “We don’t have better algorithms, we just have more data.” Today, Artificial Intelligence is truly the new electricity and is lighting up new discoveries across both industry and academia.

One major field in AI today which is opening new frontiers is Reinforcement Learning. Similar to the problem of moving towards higher logic (fuzzy logic) and more adaptable algorithms in classical machine learning, Reinforcement Learning is the term used to denote the set of algorithms that have the potential capability to make highly-intelligent decisions depending on their local environment. Reinforcement Learning (RL) specifically is a growing subset of Machine Learning which involves software agents attempting to take actions or make moves in hopes of maximizing some prioritized reward. In other words, it is an iterative feedback loop between an agent and its environment. Compared to traditional ML, RL algorithms like Monte Carlo Methods, SARSA, and Q learning are more dynamic in their behavior compared to traditional machine learning.

We will begin by examining specific reinforcement learning algorithms and discussing their behavior in comparison to supervised learning. We will also use the cartpole problem as a physical example of demonstrating RL in action! This guide will cover Q-learning, DQNs (Deep Q-Network), MDPs, Value and Policy Iteration, Monte Carlo Methods, SARSA, and DDGP.

What is Reinforcement Learning?

Reinforcement Learning (RL) is a growing subset of Machine Learning which involves software agents attempting to take actions or make moves in hopes of maximizing some prioritized reward. There are several different forms of feedback which may govern the methods of an RL system. Compared to Supervised Learning algorithms which map functions from input to output, RL algorithms typically do not involve the target outputs (only inputs are given). There are 3 elements of a basic RL algorithm: the agent (which can choose to commit to actions in its current state), the environment (responds to action and provides new input to agent), and the reward (incentive or cumulative mechanism returned by environment). The basic schema for a RL algorithm is given below:

Flowchart of an RL algorithm (Source: https://i.stack.imgur.com/eoeSq.png)

The broad goal of most RL algorithms is to achieve a balance between exploration (training on new data points) and exploitation (use of previously captured data). The immediate goal is to maximize the reward with trials alternating between the aforementioned exploitation and exploration. It is important to note that are three types of RL implementations: policy-based, value-based, and model-based. Policy-based RL involves coming up with a policy or deterministic/stochastic strategy to maximize the cumulative reward. Value-based RL attempts to maximize an arbitrary value function, V(s). Model-based RL is based on creating a virtual model for a certain environment and the agent learns to perform within the constraints of the environment.

Definitions

(Note: Credit for this section goes to Kung-Hsiang, Huang (Steeve). Here is his original article)

  1. Action (A): All the possible moves that the agent can take
  2. State (S): Current situation returned by the environment.
  3. Reward (R): An immediate return send back from the environment to evaluate the last action.
  4. Policy (π): The strategy that the agent employs to determine next action based on the current state.
  5. Value (V): The expected long-term return with discount, as opposed to the short-term reward R. Vπ(s) is defined as the expected long-term return of the current state sunder policy π.
  6. Q-value or action-value (Q): Q-value is similar to Value, except that it takes an extra parameter, the current action a. Qπ(s, a) refers to the long-term return of the current state s, taking action a under policy π.

The Cartpole Problem

The cart pole problem is a famous problem in dynamics and control theory, with a pendulum whose center of gravity is above the pivot point. This naturally creates an unstable system and the pendulum will typically remain vertically downward without any force of swing or dynamic control being applied. The cartpole has one degree of freedom upon its axis, and the system has no vertical movement. The goal of most cartpole systems is to effectively keep the cartpole balanced by applying various forces on the pivot point and its axis of movement (the horizontal direction).

Demo of Cartpole being balanced (Source: Mc.AI)

We will examine the cartpole and show how Reinforcement Learning may be used to effectively balance the system.

Reinforcement Learning for the Cartpole System

Since RL is a form of learning characterized by trial and error response to actions and its effect on the environment, it makes sense to model the cartpole system via RL, since the cartpole system is heavily subject to various parameter changes while having a clearly defined agent-action-environment-reward schema. The agent is the controller or algorithm which controls the movement of the cart. The action is the physical movement of the cartpole in response to various forces and torques following the swing-up phase. The environment is the physical setting of the cartpole in regard to the constrained area of the system. The reward is the ability of the cartpole to achieve sustained balance in its current state. We will now identify the specific actions and states of the cartpole problem:

Cartpole Actions and possible States (Source: LiveBook-Manning)

The cartpole agent is limited to two possible actions: (1) exert a rightward constant force on the cart. (2) exert a leftward constant force on the cart. As seen in the diagram, these two forces are directed in the horizontal direction. These actions that may be taken by the agent will change the position of the cart and environment accordingly. The state of the cartpole is solely determined by the velocity and position of the cart, angle 𝜃, and pole velocity at the tip. All of these parameters were earlier identified as the basis for the differential equation which addresses the properties necessary to adjust and control the cartpole system.

Each time a force is applied by the controller, the controller checks for whether the cumulative reward is achieved or maximized. In the case of the cartpole problem, the angle of the pole with respect to the cart and distance from the center determines the value/reward achieved. If the cartpole is generally upright and near the center of the environment, a reward is given and maximized for that sequence. In the case that the reward is not maximized, the controller makes the necessary adjustments to the force and subsequent displacement. It is also important to note that there are two crucial conditions that may terminate or restart the action-environment-reward loop:

Termination Criteria 1 (Source: LiveBook-Manning)
Termination Criteria 2 (Source: LiveBook-Manning)

In both of the above cases, the agent’s actions led to a case that did not maximize the reward in the specified environment. (the cartpole was either not upright or not within the specified bounds) This could be seen as the “punishment” of the model. On the other hand, each time the reward is gained, the “score” of the cartpole increases by 1.

Now that we have discussed the physical properties of the cartpole system, the mathematical model used to solve it, and the application of RL towards the cartpole, we may move to analyze two algorithms that may allow for effective control and stability of the cartpole.

Q-Learning for the Cartpole Problem

We know based off the possible states of the cartpole problem, that if we make the right decision, the cartpole will stay upright and balanced. Thus, we can identify the action- state pairs which lead to higher reward in the cartpole system. We can model each pair as a function of the probability of reward: 𝑅𝑒𝑤𝑎𝑟𝑑 = 𝑄(𝑠, 𝑎). In this case, the reward is known as a Q-value. The goal of Q- Learning, an RL algorithm, is to find this function 𝑄(𝑠, 𝑎), while applying it iteratively to 𝑠′ (future state). In other words, a Q-function represents the expected total reward an agent in state 𝑠 can receive by executing a specific action 𝑎. The initial Q- learning function can be represented as:

Q-learning function

After obtaining some reward 𝑟 by making an action 𝑎, we can reach the next state: 𝑠′. Upon reaching the next state (𝑠′), the agent performs a new action (𝑎′) in regard to the reward. The weight we wish to focus on the next reward (𝑟′) is 𝛾. Thus, we update the equation as follows:

Updated Q-function

Before we apply Q-learning to the Cartpole problem, it is important to acknowledge that Q-learning is an example of model-free learning. We will later discuss MDPs and how policy and value iteration work in regards to Q-learning.

Application of Deep Q-Learning to the Cartpole System

Based on an analysis of the elements of Q- learning, it is evident that the cartpole system can be effectively modeled using Q- learning. The cartpole problem has a state space of 4 dimensions of continuous values (𝜃, 𝑙, 𝑥, 𝑥2) and an action space of 2 discrete values (move right or left). However, in typical Q-learning, we must shift our state for every slight change in the angle or position of the cartpole, and this would require extreme memory storage capabilities. Moreover, to apply Q-learning to balancing the cartpole system, we must approximate the model-free function

𝑄(𝑠, 𝑎), where the input is a state-action pair (𝑠, 𝑎) and the output is some expected reward. This technique of approximating the 𝑄(𝑠, 𝑎) function is known as a Deep Q- Network (DQN) and is more robust against parameter variations and frequent changes in the state. This technique follows the same process behind Q-learning, but rather utilizes a deep neural network to compute 𝑄(𝑠, 𝑎) based on a trained network of nodes:

DQN for the Cartpole System (from Greg Surma)

As seen in the diagram above, the DQN uses the current states of the cartpole to calculate the expected reward and next action for the cartpole, returning a 𝑄(𝑠, 𝑎) for both movement to the right and movement to the left. The DQN would most likely need to be supplemented with a loss-function. We know that the updated Q-learning equation already calculates the value for 𝑄(𝑠, 𝑎):

Updated Q-Learning Equation

Thus, it is essential to have a loss function that minimizes the error between the approximation from the DQN and the true 𝑄(𝑠, 𝑎) obtained from the equation. In summary, it is best to think of the overall process behind Q-learning and DQN as a “controlled trial and error” that looks to approximate the expected reward: 𝑄(𝑠, 𝑎). Q-learning makes use of the updated Q- function which makes iterative adjustments between discrete state-action pairs. DQN seeks to avoid the memory overuse that may occur in Q-learning with near-infinite state-action pairs, in favor of a Neural Network that approximates the expected reward from previous continuous state-action pairs.

Training a DQN for the cartpole problem

DQNs are commonly used for the cartpole problem, and we may now understand the implementation of a DQN that maximizes the reward of the cartpole (the reward is the ability for the controller to both balance and control the cartpole). It is first important to note we can summarize the state-action- reward-state and environment of the cartpole system as a tuple: (𝑆, 𝐴, 𝑅, 𝑃, 𝜌), where 𝑆 is the state, 𝐴 is the action, 𝑅 is the reward function, 𝑃 is the transition probabilities, and 𝜌 is the initial state distribution. The reward function is:

Reward function for the Cartpole system

This formulation of the cartpole system as a tuple is known as a Markov Decision Process (MDP). An MDP typically provides us with a method to accurately select an action 𝑎, given a state 𝑠. We then observe 𝑎’: and 𝑠’: based on the transition probabilities P. MDPs also provide techniques for helping agents find the optimized policy within the specified environment in the long run. Most implementations of DQNs use flat convolutional neural networks with batch normalization. This technique uses iterative adjustments towards 𝑄(𝑠, 𝑎). After a DQN is implemented, it is necessary to train the model. The training of the DQN simply acts as a learning phase for the cartpole system in its environment. The training stage of any RL model is similar to the commonplace example of a child learning to walk. There are a few phases of learning that the cartpole must go through before it can effectively balance both the angle and position of the system: 1.) Learning to balance the pole alone 2.) Staying in bounds 3.) Staying in bounds but unable to balance pole 4.) Staying in bounds while effectively balancing the pole. The ultimate goal is to solve the environment as quickly as possible (solve in the least number of steps/episodes). As the model trains on state-action pairs, it will eventually improve the number of steps needed before solving. The figure below demonstrates a sample training for a DQN Cartpole. As seen in the graph, the cartpole eventually reduces the number of steps needed to solve the environment:

Sample Training steps and trials for an RL Cartpole Simulation (source: Greg Surma)

Now that we have demonstrated the use of Q-learning and DQN for the Cartpole Problem, we will generalize some other RL algorithms and contrast policy and value iteration.

Markov Decision Processes (MDPs)

Let’s discuss the concept of Markov Decision Processes (MDPs). We touched on it while discussing how to train the DQN. In every RL algorithm, there is an agent interacting with an environment by taking actions to maximize some reward. An MDP is a special stochastic time control process for decision making which assumes random probability and a decision-maker having complete control. This is the basis of most RL algorithms

MDPs consists of a tuple of 5 elements: (All credit for these definitions goes to Moustafa Alzantot. Here is his article)

  • S : Set of states. At each time step, the state of the environment is an element s ∈ S.
  • A: Set of actions. At each time step, the agent chooses an action a ∈ A to perform.
  • p(s_{t+1} | s_t, a_t) : State transition model that describes how the environment state changes when the user performs an action a depending on the action aand the current state s.
  • p(r_{t+1} | s_t, a_t) : Reward model that describes the real-valued reward value that the agent receives from the environment after performing an action. In MDP, the reward value depends on the current state and the action performed.
  • 𝛾 : discount factor that controls the importance of future rewards.

In an MDP, we are searching for a policy function that the agent or decision-maker will choose in the next state s. Once we specify the policy and fix the action for each state, the agent behaves like a Markov chain and the state only depends on the following:

Markov operators (policy and current state and action)

An MDP is thus an example of a Policy Iteration algorithm since it wishes to instantiate a policy and sample future actions based on the set policy. We assume that the reward is the total discounted reward (also known as the discount factor):

Discount Factor (graphic from MIT Intro to Deep Learning)

We use the discount factor to prevent the total reward from going to infinity (because it is between 0 and 1). The discount factor also allows us to gauge the preferences of the agent. Beyond the discount factor, there are two possible environments which an MDP may exist in: deterministic and stochastic

Value v. Policy Iteration

Value and Policy Iteration are RL algorithms that assume MDP characteristics. Value Iteration attempts to constantly refine the value function V (or the Q-function) which will converge at the most optimal value while Policy Iteration attempts to define the policy function which will converge at the most optimal policy. Through a policy gradient, we directly optimize the policy. Q learning is considered model-free and uses Value Iteration. The difference is outlined below:

Value v. Policy Iteration (Source: MIT [12])

To summarize, value iteration seeks to update the value function that will be used to compute the q-function while policy iteration finds the most optimal policy. Here are some great explanations of the two as given by Moustafa Alanztot:

Value Iteration: “Value iteration computes the optimal state value function by iteratively improving the estimate of V(s). The algorithm initializes V(s) to arbitrary random values. It repeatedly updates the Q(s, a) and V(s) values until they converge. Value iteration is guaranteed to converge to the optimal values.

Policy Iteration: “While value-iteration algorithm keeps improving the value function at each iteration until the value-function converges. Since the agent only cares about finding the optimal policy, sometimes the optimal policy will converge before the value function. Therefore, another algorithm called policy-iteration instead of repeated improving the value-function estimate, it will re-define the policy at each step and compute the value according to this new policy until the policy converges. Policy iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to converge than the value-iteration algorithm.

Other RL Algorithms

SARSA

SARSA (State-Action-Reward-State-Action) is a type of reinforcement learning algorithm that uses a Markov decision process to adjust the value of the Q-function based on the next state. Therefore, we can think of SARSA as a modified Q-learning algorithm where an extra action and state are manipulated.

Monte Carlo Methods

Monte Carlo RL is the opposite of Q-learning in the sense that it is not model free. Monte Carlo methods learn directly from experience and past action-state pairs without any prior knowledge of MDP transitions. Monte Carlo Methods primarily use Policy Iteration since the goal is to learn the value function V from a defined policy.

DDPG (Deep Deterministic Policy Gradient)

This is an extension of the Policy Gradient RL algorithm and applies a Deep Q-Network to do random exploration with a soft updates strategy. The pseudocode is given below. This article explains it quite well:

Applications of Reinforcement Learning

While the fundamentals of RL require heavy logic and abstract evaluation of either a policy or value function, there are many concrete applications of RL in the real world. PID or LQR controllers (control theory) can be substituted with a robust RL algorithm. There are also commonplace applications of RL in robotics and movement (where there is a well-defined environment, actions, and reward). RL can also be used in applicable game-theory situations. A great example of this is in the AlphaGo Reinforcement Learning model that was able to beat the world’s best human player.

References

The content for this article is primarily borrowed from both a short research paper and book that I wrote.

Medium Articles which I cite:

Markus Buchholz: https://medium.com/@markus.x.buchholz/deep-reinforcement-learning-deep-deterministic-policy-gradient-ddpg-algoritm-5a823da91b43

Thomas Simonini: https://medium.com/free-code-camp/an-introduction-to-reinforcement-learning-4339519de419

Thomas Simonini: https://medium.com/free-code-camp/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe

Kung-Hsiang, Huang (Steeve): https://towardsdatascience.com/introduction-to-various-reinforcement-learning-algorithms-i-q-learning-sarsa-dqn-ddpg-72a5e0cb6287?source=search_post---------2

Mustafa Alanztot: https://medium.com/@m.alzantot/deep-reinforcement-learning-demysitifed-episode-2-policy-iteration-value-iteration-and-q-978f9e89ddaa?source=search_post

Mohammed Ashraf: https://towardsdatascience.com/reinforcement-learning-demystified-markov-decision-processes-part-1-bf00dda41690?source=search_post---------8

Greg Surma: https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288

[1] Comparison of Reinforcement Learning Algorithms applied to … (n.d.). Retrieved from https://arxiv.org/pdf/1810.01940.

[2] Surma, G. (2019, January 18). Cartpole — Introduction to Reinforcement Learning (DQN — Deep Q-Learning). Retrieved from https://towardsdatascience.com/cartpole- introduction-to-reinforcement-learning- ed0eb5b58288.

[3] Kwon, S. (2007). Two Alternate Fuzzy Controllers for Cartpole System. 2007 International Conference on Machine Learning and Cybernetics. doi: 10.1109/icmlc.2007.4370220

[4] (n.d.). Retrieved from http://ctms.engin.umich.edu/CTMS/index.php?exampl e=InvertedPendulum§ion.

[5] (1995, February 10). Retrieved from http://pages.cs.wisc.edu/~finton/qcontroller .html.

[6] (2016, November 15). Cart-Pole Balancing with Q-Learning. Retrieved from https://medium.com/@tuzzer/cart-pole-balancing- with-q-learning-b54c6068d947.

[7] IIT Bombay Graduate. (2019, May 6). Introduction to Deep Q-Learning for Reinforcement Learning (in Python). Retrieved from https://www.analyticsvidhya.com/blog/2019/04/intro duction-deep-q-learning-python/.

[8] Phy, V. (2019, November 4). Reinforcement Learning Concept on Cart-Pole with DQN. Retrieved from https://towardsdatascience.com/reinforcement- learning-concept-on-cart-pole-with-dqn- 799105ca670.

[9] Rodriguez, J. (2017, August 31). Reinforcement Learning Soup: MDPs, Policy vs. Value Learning, Q- Learning and Deep-Q-Networks. Retrieved from https://medium.com/@jrodthoughts/reinforcement- learning-soup-mdps-policy-vs-value-learning-q- learning-and-deep-q-networks-4ac137acd07.

[10] Deep Q-Learning with Keras and Gym. (2017, February 6). Retrieved from https://keon.io/deep-q- learning/.

[11] Brunskill, E. (n.d.). Slides on Q-learning and Monte Carlo Models by Stanford University. Stanford.

[12] © Alexander Amini and Ava Soleimany MIT 6.S191 Introduction to Deep Learning: IntroToDeepLearning.com (Slides)

[13] Ankit Choudary: https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learning-introduction-monte-carlo-learning-openai-gym/

--

--