Reinforcement Learning Uncovered

A gentle introduction to reinforcement learning: from agents and environments to state values and Bellman equation

Alberto Prospero

Published in

Towards Data Science

10 min readFeb 7, 2021

A random image to make things cool! (photo from geralt on pixabay)

Introduction

Reinforcement learning is one of the most exciting and promising research topics of AI. During the last few years, it has repeatedly proven to be one of the most effective research fields, able to defeat humans in many environments. In 2016 an RL agent beat Lee Sedol in the game of Go, while in 2019 competed at grandmaster level in the game of starcraft. But how does reinforcement learning work, exactly?

In this tutorial, we will explain the basics of reinforcement learning in the simplest way, trying to decompose and justify the main concepts without complicated formulas. I am a mathematician, so I will die a little bit on the inside but hope that in the end, you have a fine grasp of what reinforcement learning is and, more importantly, you want to know more!

After all, RL is not that complicated. Trust me, as with everything in this world, it is just a matter of truly mastering the basics.

Agent and environment

At the very core, reinforcement learning is based on an agent that repeatedly interacts with an environment.

The agent can represent everything able to act: a poker player, a trader in the stock market, my poor, old, bad, tired, robot vacuum which cleans my house. The environment instead is everything external to our agent: it would be all the opponent players, the dealer, the deck for the poker player, would be all the people investing in the market and the market itself for the trader, would be my house and all the objects inside it for my vacuum.

As a result of every action the agent takes in the environment, two things take place:

A reward is provided to the agent, which quantifies the advantage the agent receives from having taken that action in that particular situation.
The environment changes: our action causes the environment to move from a certain situation, condition, circumstance (we will use the term state) to another one, which in general differs.

It is like in video games: when you (the agent) kill (act) the bad dragon in the game (the environment), your skill points increase (get a reward), and the dragon is not part of your game anymore (the game state changed).

I am pretty sure many of you are familiar with the following chart, which describes the loop going on between the agent and the environment.

Reinforcement learning loop: at time t, the agent takes an action and the environment returns a reward and a new state. At time t+1 the loop repeats.

One fair question at this point might be: ok, ok that’s great dude! But how long does this agent-environment interaction take place? Well, we have two options:

The interaction between the agent and the environment lasts forever. This might be the case for example of a thermal regulator, setting every day the right temperature for the house (and theoretically not having a scheduled end)
The interaction between the agent and the environment ends at some point. This might be the case of the game solitaire for example, where the game ends when we (agent) win or lose. An important note: in this case, there will always be one particular state which represents the end of the game. This is called the ending state, and a single run of the game up to the ending state is called an episode.

Since the second case is more intuitive and easy to explain, we will focus on it, but keep in mind that everything can be extended naturally to the no-ending framework.

Return and policy

Now, as everybody in this world, our agent is interested in maximizing its reward (everybody wants to get rich right?). But our agent is a smart guy! So it does not aim to maximize the reward it gets here and now (we will call this the immediate reward), but it would rather maximize the rewards he receives across all the interactions he has now and has with the environment in the future. I do not know you guys, but I would not choose 10 bucks right now if I can get 2 bucks every day for the next 20 days! So, what the agent really wants is to take actions that allow maximizing the rewards received in time from here on. Indeed, this quantity is called return, and so the agent wants to maximize its return. Here below the mathematical translation of what we just said. In the following equation, t represents the current generic time step we are settled in right now, while T is the time step in which we reach the ending state.

In so doing, our poor agent still has a problem: in many many settings, there is high uncertainty on the state it reaches by taking some action or the reward it gets. Think about it. If we (agent) buy (act) a stock in the market (environment), we are not sure that the price will rise the next day (positive reward) or will drop (negative reward). If we (agent) declare our love (act) to that girl (environment), we are not completely sure of the answer right? People who formalized reinforcement learning imagined three kinds of uncertainties occurring during the agent-environment interactions:

Reward. By applying our action, we are not sure the environment will provide a unique and deterministic reward.
Next state. By applying the action, we are not sure the environment will move in a unique and deterministic next state.
Agent action We might act in the environment in a non-deterministic way. In a given state, we might choose to go to the left or to the right according to some probability.

Among these, the agent can only control the last one. While rewards and next states do depend on the environment dynamics, the agent can only decide how to act. In each state, it can either play deterministically by selecting one action or randomly choosing across multiple possible actions. We will say that the agent has a policy, which means it has a strategy to use across the whole game. In a given state the policy tells which action the agent is going to choose and with which probability it is going to take that action.

Another random image to make things even cooler! (photo from geralt on pixabay)

The value of a state

So, we realized that our agent has uncertainties about its future! But how can it deal with them? Poor guy, it just wanted to maximize its return! Well, the trick is not just considering one possible return related to a specific sequence of actions, rewards, and next states, but rather considering all the possible returns related to all the specific sequence of actions, rewards, and next states which can occur. The agent needs to account for all the returns which might receive from all the possible futures which can be reached, and average them with their probabilities of happening.

Starting from the current state, which we might imagine for a moment as a root of a tree, the agent simulates a possible evolution of the dynamics (i.e. sequence of actions, rewards, and next state) up to the end of an episode. While reaching the end and traversing the tree, the agent multiplies the rewards it gets from state to state with the probability of that dynamic to happen (i.e. probability of choosing that action, getting that reward and ending in that state). Once the end state is reached, the agent creates another branch by considering a different possible evolution and repeats the process. This is done again and again up to all the possible branches which can be derived from our state are considered. As a final step, all these quantities are summed up to compute the weighted average which defines the value of the state, because it expresses how good is for the agent to be in that state according to all the possible future returns and related uncertainties.

Notice an important point. We said above that among the three types of uncertainties, the agent can only control the way it acts (i.e. its policy), while next states and rewards indeed depend on the environment dynamics. Hence, when we are assessing the value of the state we can fix a certain policy and compute the value with respect to all the possible next states and rewards. In other words, we can make the state value to depend on the policy, so that changing policy changes the value state. For the mathematician which lives in you, this is written as:

It reads roughly like: the value of the state s when we play with policy pi is equal to the weighted average of all the returns we get from considering all the possible futures which can take place starting from state s and playing with policy pi.

Now let’s try to focus on these concepts with an example. Assume you are in the situation of deciding if to declare your hidden, huge and crazy love to that girl (or boy). You have two options (actions): you declare, you do not declare. If you declare there is a 0.5 chance she says yes. If she says yes you are very happy and so this will give you a return of +100. But if she says no you get sad, a little more unsure of yourself and slightly more distrustful of humankind and relationships in general: we could quantify all this strange but passionate whirlwind of emotions with a good -60. On the other hand, if you choose the other option and do not declare to the girl, you live in an eternal what-if scenario but avoid yourself the risk of suffering, and so with a probability of 1.0 you get a reward of (let’s say) -10. So, what is the value of this state? Well, we still need to specify the policy. Indeed, the number of policies is infinite, because not only you (the agent) could choose to declare or not declare, but you could declare or not declare according to some probability. To keep things simple, let’s compare the two simplest cases coming from the two deterministic policies:

You deterministically choose to declare. In this case, you have a return of +100 occurring with a probability of 0.5, and a return of -60 occurring with a probability of 0.5. By rolling out the branches of the tree, we have 100*0,5 = 50 and 0,5*(-60) =-30. If we sum up, we get 50-30 = 20. Also, we need to consider the other branch. But since the other branch has zero probability of occurring because our policy prevents from choosing it, it has a return of 0 (it would be 0*(-10)). So, summing up again we obtain the state value which in this case is 1*20 + 0*(-10) = 20. Not so bad!
The player deterministically chose not to declare. In this case, the reasoning is similar and we have a deterministic return of 0*20 + 1.0 * (-10) = -10

That’s it! But 20 is greater than -10, so here is the mathematical explanation why you should always go for it!

The Bellman Equation

The Bellman equation is the last step of this tutorial. Simply put, it represents a way to express the value of the state without performing each time all the computations deriving from considering all the possible futures. In many settings in fact applying the weighted average process we described above is literally impossible: there are too many states and too many possible episode evolutions. In chess for example, the number of states is something like 10**120 (this is a 10 followed by 120 times 0). As you can imagine, computing the possible returns for each of the states would take ages indeed! The Bellman equation comes to the rescue: even if it does not completely solve the problem, it provides a way to simplify and speed up the computations.

The core point of the Bellman equation is that it expresses the value of a state just in terms of the value of the next states which can be reached from it, and neglecting in this way all the remaining future dependencies. In other words, if we know the exact value of the next states (and indeed also the immediate environment dynamics) we can exactly compute the value of the state we are right now. You can see this crucial point from its expression:

As you can notice, the value of the state s when we play with policy pi depends on the immediate environment dynamics (i.e. the action we take in s, the probability of getting the next state and the reward obtained) and the next state value.

Also, the Bellman equation suggests an interesting way to compute the value of the state. We said that we can now compute the value of the state just in terms of the value of the next states which can be reached. But this reasoning also applies to each of the next states: so if we know the exact value of the next next states we can exactly derive the value of the next state. The reasoning iterates state after state, up to ending one: if we know the exact value of the ending state, we can compute the value of the states immediately before, and knowing those we can go back up to the first one.

Finally, it is important to notice that even if it speeds up the computations, the Bellman equation can not be concretely applied to really large state games. Still, we have to scan at least once all the possible states, which can be prohibitive in many environments.

Conclusions

That’s it. Congratulations you managed to come so far! Hope you enjoyed this introduction on reinforcement learning as I did in writing it! By now, you should have a fair grasp on the basic concepts of reinforcement learning, and be able to move to the next level (..or episode!). Please feel free to comment if you have any questions!

See you around, stay gold! :)