Lite Intro into Reinforcement Learning 🤖

Tomas Turek
Towards Data Science
15 min readSep 3, 2018

--

This is a brief introduction into Reinforcement Learning (RL) going through the basics in simplified terms. We start with a brief overview of RL and then get into some practical examples of techniques solving RL problems. In the end you may even think of places you can apply these techniques. I think we can all agree building our own Artificial Intelligence (AI) and having a robot do chores for us is cool.

… so let’s get to it!

1) What is an RL? 🧐

“Reinforcement learning is an area of machine learning (artificial intelligence) originating from a psychological concept of reinforcement, that is concerned with decision making of software agents toward maximizing desired conception of rewards in model environments.”

Nowadays, reinforcement learning has found its way into AI and Machine Learning (ML) techniques, however its origins come from behavioral psychology. An American psychologist, E. L. Thorndike, introduced the term reinforcement in the mid-20th-century. Then, another American psychologist by the name of B.F. Skinner followed up on Thorndike’s studies and came up with a new learning process called Operant Conditioning (OC). Also around that time, an American applied mathematician R. E. Bellman developed the dynamic programming which helps solve RL problems with computers. Since then, RL has shifted heavily into the digital world, perhaps thanks to Canadian computer scientist R. S. Sutton pioneering computational reinforcement learning and producing his influential RL textbook at the end of 20th-century. This allows us to present machine learning as the most prominent field taking the plunge into decision making & implementing it a large scale!

… reinforcement in behavioral psychology 🐹

Is an effect strengthening the future behavior of an organism when it has been exposed to a specific stimulus prompting it to execute the learned behavior. It is utilized in specific types of conditioning that use stimuli to exhibit a response as a part of learning. In OC, behaviors are controlled by detectable changes in the environment aka environmental stimuli, which is something external that influences an activity. For instance, environmental stimuli are diverse manifestations that living organisms can detect, e.g. our bodies can detect touch, sound, vision, etc. To bring it back to reinforcement, OC uses reinforcement or punishment to modify the likelihood of a behavior. As well, it involves voluntary behavior that can be described with the following example on animal behavior.

  • a dog can be trained to jump higher when rewarded by dog treats, meaning its behavior was reinforced by treats to perform specific actions
[Klein & Abbeel 2018]

… reinforcement in machine learning 🤖

Is an effect on following action of a software agent, that is, exploring a model environment after it has been given a reward to strengthen its future behavior. Software agents are sent into model environments to take their actions with intentions to achieve some desired goals. Such agents can be thought of as computer programs that are rewarded for performing the right actions or punished for performing the wrong actions and learn from their experience. What is right or wrong depends on the environment in general, but right actions are usually the ones leading toward the best way (whatever that may be) achieving the goal as shown in the following examples.

  • an agent in the maze (environment) can be programmed to find autonomously (via rewards) the quickest way (actions) how to escape it (goal)
  • an agent in the ocean (environment) can be programmed to find autonomously (via rewards) the longest way (actions) how to survive under the water so as to restore damaged coral reefs (goal)

… let’s see how to define such theoretical RL examples via mathematical models!

2) How to define RL problems? ✍

“A formulation of RL tasks can be done with a combination of the Markov decision process and the ‘right’ policy, that software agents can use for their own decisions when solving model environments.”

The Markov Decision Process (MDP) provides a neat mathematical framework to model the decision making in RL. The MDP is used in situations where the decision maker (software agent) partially controls outcomes while the rest of outcomes are random. It equips an agent with the knowledge of states, actions, transition probabilities, rewards, and discount factors. Those properties are ‘lightly’ described below, along with some other common terms.

maze example — grid world [Klein & Abbeel 2018]

🐹 agent -> the guy that does the dirty job for you!

🗺 environment -> a universe / space / world where an agent is doing that job, e.g. it could be a stochastic (randomly generated) grid world describing a maze as in the picture above

🏔 state (s) -> a particular state / position of an agent inside an environment

💃🕺action (a) -> an act of agent possibly leading to new states

🤷‍ transition probability (P) -> the probability how agent moves in an environment, i.e. the likelihood that an agent’s action in a particular state will / won’t lead to other possible states

💵 reward (R) -> literally, some money as a reward or lumps of fossil fuel as a punishment, i.e. negative reward, for an agent’s decision getting into a state

⚖️ discount factor (ɣ) -> helps an agent to focus on its job by not getting distracted by the size of received reward, i.e. it scales the received rewards by taking the difference in importance between future and present rewards

👴🏻 experience -> the experience an agent receives by entering an environment and performing some actions, e.g. learned behavior such as observation of new states, rewards, etc.

🔮 policy (π) -> tells an agent how to act in states and even how to go from start to goal, i.e. it is a mapping from states to probabilities of selecting each possible action (transition probabilities)… the policy can be a solution to MDPs by giving an action for each state in an environment, such as “π(s) -> a

⇒ it’s all about getting the long term reward! 💰

MDPs are all about the long term reward, the so-called delayed reward. These rewards lead to the completion of a task (goal) the agent is attempting to solve in an environment. For instance, an agent may receive a significant short term reward, the so-called immediate reward, by getting into a particular state. However, following this path of immediate rewards may not be the best way to complete the task. This is dependent on how rewards are set in an environment.

... it’s time for some questions!

… how can an agent find a new state? 👀

It is done through transitions, i.e. some transition function T(s, a, r, s′). It describes a process how agents finds a new state. In a model environment, an agent is in some state (s), takes an action (a), then gets a reward (r) for it, and finds a new state (s′) by this process.

… how can an agent know how good it is to be in states? 👍

Decision making in RL is done by an agent that usually needs to know ‘how good’ it is to be in a given state or to perform their given action. For instance, an agent wants to know expected return aka future rewards that are dependent on actions it will take. Many RL algorithms account for such need by estimating functions of states or of state-action pairs, called value functions. Such functions are defined with particular ways of acting (policies). Value functions are used to organize and structure the search for policies, which is one of the key ideas in RL!

Bellman equation — value function calculating values in states for MDPs

One way how to utilize value functions is the Bellman Equation, which is one of the fundamental RL equations and can take a lot of forms depending on the problem. The equation above is one example that is aimed at MDPs. It shows how to get a value (V) for being in particular state (s) and following a policy (π) in that state. In detail, it uses rewards in that state and discounted sum of transition probabilities (P) & values for next states (sʹ).

… what is the ultimate goal in MDPs? 🏒 🥅

The goal in MDPs is usually to find the optimal policy (π*). It is the one that maximizes the long term expected reward. As seen from its high-level definition below, the optimal policy returns specific actions to take in states of some environment. The expected sum of discounted rewards is calculated by following a policy (π) and going through sequence of states at particular time (t).

optimal policy (π*) — finding the right actions in states for MDPs
maze example — map [Klein & Abbeel 2018]

… now imagine an agent having a map to the maze as in the picture above, then it will know what to do in each situation to get the diamond or not get killed since there is a specific action (green arrow) for each state -> this is what ‘having a policy’ means!

… so where is RL in this maze example? ‍🕵️‍

It is there, but it depends on how much we know about the maze (environment) and what we are trying to accomplish. The maze in which an agent takes actions to find rewards can be fully described with an MDP only if all of its characteristics are known. Then, an agent knows all actions in each state, and new states and rewards such actions lead to. However, an agent will need to still figure out what series of actions optimally lead to the ultimate reward / goal, i.e. figure out the optimal policy. This is sometimes called offline planning that can be solved just by using the knowledge of MDPs and finding the policy, which is not really an RL problem.

In the ‘real’ world however, there are environments of arbitrary size that can be enormous and combinatorial. For instance, a maze can have many states with possible actions and outcomes. Then, there is not necessarily any knowledge of transitions or rewards until an agent actually tries some actions and figures this out! This is sometimes called online learning, which still assumes MDPs and continues looking for the policy, but it is more of ‘true’ RL nature. It means an agent learns from data as they come in, which contrasts with the offline case that uses only static data. To summarize, RL problems can be represented with MDPs that may be complex and incomplete in terms of their main properties.

… let’s see how to solve RL tasks by leveraging introduced MDP definitions!

3) How to solve RL problems? 🎮

“RL problems can be solved with dynamic programming, temporal difference learning, or policy gradient methods. However, it is often necessary to help agents that are operating in more complex environments to find good approximate solutions by employing deep neural networks & some cunning techniques that can tackle the fundamental exploration-exploitation tradeoff.”

Approaches for solving RL problems vary through the nature of environments, like how much we know about environments or what tasks we want to solve in them. In general, RL applications can be represented by MDPs and solved with some form of Dynamic Programming (DP) method. The DP is an optimization technique dividing problems into smaller tasks and solving them recursively. They can be also solved with methods using just some aspects of DP, such as in the Temporal Difference learning (TD) method. In the TD method, algorithms learn by combining sampling of Monte Carlo methods (estimating expected values) with bootstrapping of DP (updating estimates on the bases of other estimates). One of modern RL solutions is the Policy Gradient (PG) method, which can directly approximate the optimal policy without using any approximate value functions. The PG method is usually implemented via deep neural networks, which are connected Artificial Neural Networks (ANNs) with many layers. There are many other methods to solve RL problems, but they all share a similar aim of using the experience during learning and somehow obtaining unknown properties of environments, such as probabilities or rewards in MDPs.

⇒ RL is solved via tabular or approximate solutions! 🤖 🍽 📐

  1. In simple MDPs, solutions are mostly found via tabular solutions in which tables (arrays) represent the approximate value functions since environments here are small or simple. They can often find exact solutions, such as finding exactly the optimal policy or optimal value function [Sutton & Barto 2017]. In other words, it is about finding an optimal policy or optimal value function over possibly infinite time and data in fully described environments. Algorithms calculate this through a recursion over values & policies for all states of an environment. It is done quite trivially with some variant of recursion of both introduced equations for policy and values in the previous section, or it can also be done by some variant of their iteration, such as doing value iteration or policy iteration.
  2. In complex MDPs, solutions are mostly found via approximate solutions that can generalize previous encounters in different states to other ones, since environments here are large. In other words, the solution problem heavily shifts into finding a good approximate solution that is not computationally heavy since we cannot expect to find an optimal policy or optimal value function in such environments [Sutton & Barto 2017]. Also, it is often necessary in this case to help agents with their job by ANNs. For instance, ANNs can process an agent’s experience by acting as nonlinear functions approximators and learning value functions or maximizing expected rewards.

… it seems like a good time for some questions again!

… so what’s the best way an agent can solve the task? 🎢

It is usually the one by which an agent can find the right balance between exploring and exploiting while completing tasks in an environment. An exploration is the capability to learn from data, while an exploitation is the capability to use what you know so far. Both are important abilities for an agent to have. However, this is actually a hard task in RL, since those two clash when trying to come up with policies for an environment.

This conflict can be handled by two different learning methods for finding ‘good’ policies, on-policy and off-policy learning. They both attempt to evaluate or improve the policy by ensuring that all actions are selected infinitely often, meaning that an agent is just continuing to select them. The main difference is that on-policy learning do that for policy that is used to make decisions, while off-policy learning do that for policy different from that used to generate the data [Sutton & Barto 2017]. They can be both solved with some function approximation that is very useful in complex environments. Then, there are usually challenges with convergence or the stability of such approximate solutions, but we will not talk about those in this introduction material :)

… can algorithms differentiate between agent’s goals and habits? ☯

Yes, algorithms can be built to do it quite well and there are actually various terms across fields describing such distinction in behavior. In cognitive science, it would be about reflective versus reflexive decision making and choice. In RL, it is about model-based versus model-free algorithms if we are thinking about this psychological distinction between goal-directed and habitual behavior. Model-based methods select actions by planning ahead via a model of the agent’s environment. Model-free algorithms make decisions by accessing information that has been stored in a policy or an action-value function [Sutton & Barto 2017].

Let’s tie this back to the introduced MDP terminology in the previous section. The most likely job for model-free algorithms, such as TD learning, would be the one where not all properties in MDPs are known. Also in connection with rather colloquial terms for planning, model-free algorithms would be the ones solving the case of online learning and model-based ones solving the offline planning.

… what are some common RL algorithms? 🔦

It is perhaps worth mentioning fundamental algorithms in RL that can be used to some extent for approximating solutions, since this is what RL implementations are all about. There are two distinct algorithms called Q-learning and SARSA that belong to TD method. They both require some action-value estimates to come up with their policies, i.e. they must first learn the values of actions to select the best actions. The main difference is perhaps their policy learning type; one is off-policy and other is on-policy, respectively. Then, there are two distinct algorithms called REINFORCE and Actor-Critic that belong to PG method. They can learn some parameterized policy that can select actions without consulting a value function (action-value estimates)[Sutton & Barto 2017]. The parameterized policy is a policy expressed as a function of some independent quantities called parameters, e.g. parameters can be weights of neural net implementing the PG method.

To summarize, TD algorithms can learn action values and then use them to determine action selections, while PG algorithms can learn a parameterized policy that enables actions to be taken without consulting action-value estimates [Sutton & Barto 2017]. In other words, it is about representing policy by value functions versus parametrically.

… well this may sound a bit confusing, so let’s just look at the Q-learning alone in the rest of this RL introduction text!

… how does Q-learning work? 🎛

0001 The Q-learning is a model-free algorithm implementing an off-policy learning that belongs to the TD method. It bases an agent’s learning on experience through state-action pairs without an explicit specification of transition probabilities. The letter Q stands for ‘quality’ being in a state, which is expressed by Q values representing state-action pairs. High Q values mean that it is better to be there (in such a state) compared to low Q values.

0010 The Q-learning usually follows some exploration-exploitation policy and learns the Q values associated with taking the optimal policy. It stores Q values for each state of environment and updates them directly as it learns, i.e. it uses its own transitions (stored data) to directly produce solutions to equations. The goal is to infer the optimal policy (π*) by approximating the Q function as shown below.

Q-learning — optimal policy

0100 The Q-learning can be solved by calculating equations that produce Q-values for each state in an environment, such as the equation below. It contains something called the learning rate (α), which determines how agent learns in terms of exploring vs exploiting (agent that learns everything vs nothing). There is also the discount factor (ɣ), which determines the importance of future rewards by setting up an agent for instant or delayed gratification (aka nearsighted or farsighted agent). Colloquially speaking, the new Q value produced by this equation roughly demonstrates that something has happened (sʹ), when an agent was in state (s) and tried doing something (a).

Q-learning — equation for Q values

1000 The Q-learning can use a table to store data as its simplest implementation version. However, this may not be feasible for large problems (environments with lot of states or actions). This is where ANNs come into place as function approximators allowing it to scale up where tables just can’t do. They provide more flexibility to the solution, but at the cost of sacrificing a bit of its stability. Regardless of the specific Q-learning implementation, there is also a need for a game plan when choosing actions. In our following coding example, we used a multi-armed bandit algorithm with epsilon-greedy strategy for deciding on actions in regards the exploration-exploitation dilemma.

… so let’s get straight to implementation of our own RL example!

4) Coding Example: Maze Solved by Q-Learning 💻

“We will try to solve some well known maze-like environment with RL by coding the Q-learning algorithm via table and neural net implementations.”

Q-Learning with Table 🤖 🎛 🍽

(Kaggle Kernel version to run in your browser)

Q-Learning with ANN 🤖 🎛 ☎️

(Kaggle Kernel version to run in your browser)

Conclusions

We have just learned some applications of learning processes which may help readers to start thinking about RL. We saw that problems can be defined with MDPs and solved with methods enabling agents to operate under various situations. Then, we coded our own RL solution addressing some situations in a maze-like environment. There are many other computational ways how to solve RL problems, but hopefully this article simply conveys basic ideas in RL and serves a purpose for those searching for a quick starting point.

Now, I recommend to get your hands dirty and make your own RL applications! Start perhaps with just exploring more examples on learning processes with reinforcement, but try to solve them with RL at the same time. This may help you later on when creating unique solutions in ML space… that is currently flooded by copious images of cute 🐱 & 🐶 :)

References

  • R. S. Sutton and A. G. Barto: “Reinforcement Learning: An Introduction”, 2nd edition, A Bradford Book, 2017
  • D. Klein and P. Abbeel: “CS188 Intro to AI”, course notes, UC Berkeley, <http://ai.berkeley.edu>, 2018 (downloaded)

--

--

Pragmatic AI Apps = artificial intelligence research + business use case -> algorithm & software development! https://github.com/tomtx @UofT @CVUT