The intuition behind Reinforcement Learning

A short and simplistic introduction to Reinforcement Learning

Anwesan De
Towards Data Science

--

Video provided by author through vimeo. This video shows a trained agent trying to avoid oncoming traffic by changing lanes and manipulating its speed. This was achieved using DQN algorithm .

At first glance, Reinforcement Learning may appear incredibly overwhelming. State, Action, Environment, Rewards, value function, Q and plethora of other such terms and definitions maybe quite a burden for the beginner; what if I told you that we have been implementing this concept, quite literally , all our lives? Yes. You read that right! Almost everyday we employ RL techniques in our day-to-day lives. Through this article I hope to show you just how intuitive Reinforcement Learning really is and how commonplace it is our lives. Perhaps this approach might help demystify this concept a bit. I hope to focus on the intuition rather than implementation so I tried to use Mathematical formulas as sparingly as possible.

THE EXAMPLE

Before we talk about the terminologies and the definitions let’s analyze a few examples.

Examination:Imagine you have an examination tomorrow . Due to some emergency you were unable to prepare for the exam ; so as a last resort you go through the previous year’s exam papers for the course. Now after going through a few of them you begin to notice a pattern — more than 60% of the marks is allotted to questions from the first three chapters while the rest 40 % is equally distributed between the remaining five chapters. You can either study the first three chapters or the last five due to the shortage of time. You need to score at least 40% to pass the course.

image provided by Unsplash

Now my question to you is this- Which chapters will you devote your remaining time to?

The first three chapters of course! Based on your analysis of previous data you calculated that there is a high probability that yet again a considerable amount of marks will be assigned to the first three chapters and thus there’s a greater chance of scoring 40% given there’s a breathing room for you so even if you might make a few mistakes you can pass the course. However if you had selected the rest 5 chapters , even a single mistake might make your life difficult.

TERMINOLOGY

Let’s dig deeper into the terminologies offered by this algorithm.

Environment:

As the name suggests this is the environment, the space with which the agent interacts. In our case the exam was our environment. We , the agent, interacted with the exam environment. It is the space where we need our agent to interact in a particular manner. We use algorithms to define the behavior of the agent . In the video given at the beginning we can see the agent(green car) interacting with its environment (the highway filled with blue cars) .

Agent:

The agent is used to define the program itself, which interacts with the environment. in our examination example we are the agents. In the highway video the green car is the agent interacting with the highway environment

Action:

This is the particular operation which the agent needs to perform . In our case it was choosing between the two options whether the first three or the last five chapters. In the car driving example it was the agent’s responsibility how to navigate the traffic whether to slow down or change lanes or speed up.

Reward:

The concept of maximizing is what makes this concept so exciting. If we observe the exam example carefully , it can be noticed that we came to the conclusion of choosing the first three chapters on our own. Unlike supervised learning where we are given few situations with specific rigid instructions on what to do in said situations here we are just given the situations(past year’s question papers) and we had to come to our own conclusions. The thing that motivated us to select the option that we ended up with was marks. Our main goal was to maximize our marks or reward so to speak . It is this incentive to maximize the reward that is central motivation for the agent. It is this motivation that allows the agent to draw conclusions and almost all the algorithms in Reinforcement Learning is around maximizing the rewards in the most efficient way possible. We can define the rewards in a way so as to communicate our intentions to the agent . In the examination example our reward was defined as the marks obtained in the exam and our goal was to maximize it . All our decisions were influenced by this main incentive. In the highway example video I have defined the rewards in such a way that it increases by a big amount whenever the car learns to turn and increases its speed. A huge negative reward is given whenever it crashes or slows down . By defining the parameters in such a way we implemented the intuition that the more the car learns to maneuver at high speeds without crashing or slowing down the better and just like that we taught reckless driving to the agent . This is how powerful the concept of reward is .

State and History:

This is yet another very important concept of Reinforcement Learning. Whenever we(agent) interact with the environment , perform some operation (action) we observe a reaction in the environment and in return we get a reward from the environment. Much like based on what chapters we choose to study we will in turn get marks in return or based on what the car chooses to do (slow down, speed-up, turn etc.) it gets a reward in return and we see it cruising along the highway or crashing with another car. Let’s represent that using Ot for observation at time t. Now we need a way to keep track of these observations, action and rewards obtained at each step. This is where the History comes in. Its a sequence of Actions, rewards and observations until the time t .

History at time t [picture provided by author]

How the agent is going to interact with the environment depends on the history . Its going to analyze the history and based on its priorities its going to decide the next action. Now at this point it is obvious that maintaining a sequence of all the observations, actions and reward becomes extremely space consuming and impractical . What we need is a concise summary of the History. Something that tells us all the important details the agent needs to know without being too impractical . That is exactly what the state does. State is essentially just a function of history .

State as a function of history.[image provided b author]

We use a very special assumption , the Markov State , to define the function . Markov Assumption basically is the following:

Markov State definition .[image provided by author]

This means that the future is independent of the past given the present. So given the present state we can accurately predict the future thereby making history redundant . This is how revolutionary Markov Assumption is . For example in the highway car example it does not matter what happened in the past . As long as we can correctly analyze the current situation i.e the position and velocity of the blue cars we are fine (we can also treat as a sequential model but that is a different issue).

Policy and Value function:

Policy is the function which describes the behavior of the agent, while value function is a metric which tells us how good or bad it is to be in a particular state. It helps predict the reward which we can achieve given that you are in a particular state .

policy as represented by pi. it basically maps a state to a particular action. [image provided by author]
This is the value function for a given policy π and state s at a time t. It is a expectation of the rewards we hope to get in the future (t+1, t+2 ….etc) . The discount factor γ is used to maintain how much importance we are placing in the future rewards while estimating our expectation. it lies between 0 and 1 . 0 means we only care about the reward returned at time step t while a discount factor of 0.99 means all the subsequent steps in the future matter a lot. [image provided by author]

If we observe the equation we can get an idea of what value function really is . It estimates the Rewards obtained in the future given a policy . There are many different variations to the formula of value function and it can be applied in various different ways but the fact of the matter is its main function is to provide us with a way to estimate future rewards we hope to get ; which is essentially us using mathematical equations to predict the future. Beautiful isn’t it?

CONCLUSION

Hopefully this short introduction gives you an idea about the intuition of the reinforcement learning. It is indeed a beautiful concept and it can be used in a lot of varied fields including Robotics, Gaming , Stock Markets etc. I’d highly recommend checking out David Silver’s Playlist on reinforcement learning to understand the mathematics and implementation of reinforcement Learning. If you want to see Reinforcement learning in action to visit my repo where I used Deep Q Network to train an agent to play the Google chrome dino game.

--

--

I am an Eletcronics Undergraduate at BIts Pilani interested in Machine Learning and Robotics | Linkedin :https://www.linkedin.com/in/anwesan-de-66913a1ab |