The world’s leading publication for data science, AI, and ML professionals.

The Fundamentals of Reinforcement Learning

Markov Decision Process, Policy, Value Function, Bellman Equation, and Dynamic Programming Implementation

Photo by Lenin Estrada on Unsplash
Photo by Lenin Estrada on Unsplash

Reinforcement learning is probably one of the most relatable scientific approaches that resemble the way humans learn about things. Every day, we, the learner, learn by interacting with our environment to know what to do in certain situations, to know about the consequences of our actions, and so on.

When we were babies, we didn’t know that touching a hot kettle would hurt our hand. However, as we learned about how the environment responded to our action, i.e touching the hot kettle hurts our hand, then we learned not to touch a hot kettle. This illustrates the fundamental theory of reinforcement learning.

Reinforcement learning is learning what to do in order to maximize a numerical reward. This means that the learner should discover which actions that yield the highest reward in the long run by trying them.

In this article, I want to discuss the fundamentals behind reinforcement learning which includes: Markov decision process, policies, value function, Bellman equation, and of course, Dynamic Programming.


Markov Decision Process

Markov Decision Process is the fundamental problem which we try to solve in reinforcement learning. But, what is the definition of Markov Decision Process?

Markov Decision Process or MDP is a formulation of sequential interaction between agent and environment.

Here, the learner and decision maker is called agent, where the thing it interacts with is called environment. In MDP, the agent makes certain decisions or actions, then the environment responds by giving the agent a new situation or state and immediate reward.

Agent-environment interaction in Markov Decision Process
Agent-environment interaction in Markov Decision Process

In Reinforcement Learning, the main goal of an agent is to make a decision or action that maximizes the total amount of rewards it received from the environment in the long run.

Let’s say we want to train a robot to play the game of chess. Each time the robot win the game, the reward would be +1 and if it loses the game, the reward would be -1. In another example, if we want to train the robot to escape from a maze, the reward would decrease by -1 as the more time has passed prior to the escape.

The reward in reinforcement learning is the way how you communicate to the agent what you want it to achieve, not how you want it achieved.

Now the question is, how do we compute the cumulative amount of rewards that have been gathered by the agent after sequences of actions? The mathematical formulation of cumulative reward is defined as follows.

Above, R is the reward in each sequence of action made by the agent and G is the cumulative reward or expected return. The goal of the agent in reinforcement learning is to maximize this expected return G.

Discounted Expected Return

However, the equation above only applies when we have an episodic MDP problem, meaning that the sequence of agent-environment interaction is episodic or finite. What if we have a situation where the interaction between agent-environment is continuous and infinite?

Suppose we have a problem where the agent works like an air conditioner, its task is to adjust the temperature given certain situations or states. In this problem:

  • States: the current temperature, number of people in a room, time of the day.
  • Action: Increase or decrease the room temperature.
  • Reward: -1 if a person in the room needs to manually adjust the temperature and 0 otherwise.

In order to avoid negative rewards, the agent needs to learn and interact with the environment continuously, meaning that there is no end of MDP sequence. To solve this continuous task from the agent, we can use the discounted expected returns.

In the equation above, γ is the discount rate, where its value should be within the range 0 ≤ γ ≤1.

The intuition behind this discount rate is that the reward the agent received in earlier sequence will be worth more than the reward it received several sequences later. This assumption also makes sense in real life. 1 Euro in today’s life is worth more than 1 Euro several years in the future because of the inflation.

If γ= 0, this means that the agent is short-sighted, meaning that it takes more weight for immediate rewards of an action in the next sequence.

If γ is closer to 1, this means that the agent is far-sighted, meaning that it puts more and more weight for future rewards.

With the discounted return, as long as the reward is non-zero and γ < 1, the output of expected return would no longer be infinite.


Policy, Value Function, and Bellman Equation

Now we know that the goal of an agent in reinforcement learning is to maximize the cumulative reward. In order to maximize the reward, the agent needs to choose which action it needs to take in a given state such that it gets a high cumulative reward. The probability of an agent choosing a certain action in a given state is called policy.

Policy is a probability of an agent selecting an action A in a given state S.

In reinforcement learning, a policy is normally represented by π. This means that π(A|S) is the probability of the agent choosing action A given that it is in state S.

Now if an agent is in state S and it follows policy π, then its expected return will be called state-value function for policy π. Thus, the state-value function is normally denoted as .

Similar to state-value function, if an agent is in state S and it determines its next action based on policy π, then its expected return will be called action-value function for policy π. Thus, the action-value function is normally denoted as .

In some sense, the value function and the reward have some similarities. However, a reward refers to what is good in an immediate sense, while a value function refers to what is good in the long run. Thus, a state might have a low immediate reward but has a high value function because it is regularly followed by other states that have high rewards.

To compute the value function, the Bellman equation is commonly applied.

In Reinforcement Learning, the Bellman equation works by relating the value function in the current state with the value in the future states.

Mathematically, the Bellman equation can be written as the following.

As you can see from the mathematical equation above, what Bellman equation expressed is that it averages over all of the possible states and future rewards in any given states, depending on the dynamics environment p.

To make it easier for us to understand how the Bellman equation actually works intuitively, let’s relate it to our everyday moment.

Let’s say that two months ago, you learned how to ride a bike for the first time. One day when you rode your bike, the bike lost its balance when you pull the brake on a surface full of sand, making you slipped and injured. This means that you got a negative reward from this experience.

One week later, you rode the bike again. When you rode it on a surface full of sand, you slowed down the speed. This is because you know that when the bike loses its balance, bad things will happen even though this time you didn’t actually experience it.


Optimal Policy and Optimal Value Function

Whenever we try to solve reinforcement learning task, we want the agent to choose an action which maximizes the cumulative reward. To achieve this, it means that the agent should follow a policy that maximizes the value function we have just discussed above. The policy that maximizes the value function in all of the states is called optimal policy and normally it is defined as π*.

To understand the optimal policy better, let’s take a look at the following illustration.

Optimal policy definition
Optimal policy definition

As shown in the illustration above, we can say that πis an optimal policy compared to π because the value function at any given state following policy π is as good as or better than policy π.

If we have an optimal policy, then we can actually rewrite the Bellman equation into the following:

The final equation above is called the Bellman optimality equation. Note that in the final form of above equation, there is no specific reference regarding certain policy π. The Bellman optimality equation basically tells us that the value function of a state under an optimal policy should be equal to the expected return of the best action from that state.

From the equation above, it is very straightforward to find the optimum state-value function once we know the optimal policy. However, in real life, we often don’t know what the optimal policy is.

To find the optimal policy, the dynamic programming algorithm is normally applied. With dynamic programming, the state-value function of each state will be evaluated iteratively until we find the optimal policy.


Dynamic Programming to Find Optimal Policy

Now let’s dive into the theory behind dynamic programming to find the optimal policy. At its core, dynamic programming algorithm uses Bellman equation iteratively to do two things:

  • Policy evaluation
  • Policy improvement

Policy evaluation is a step to evaluate how good a given policy is. In this step, the state-value function for an arbitrary policy π is computed. We have seen that the Bellman equation actually helps us to compute the state-value function with a system of linear equation as follows:

With dynamic programming, the state-value function will be approximated iteratively based on Bellman equation until the value function converged in each of the state. The converged approximation of a value function can be called as the value function of a given policy .

Iterative policy evaluation
Iterative policy evaluation

After we found the value function of a given policy , we need to improve the policy. Recall that we can define an optimal policy if and only if the value function of one policy is equal or bigger than other policies in any given state. With policy improvement, the new, strictly better policy in any given state can be generated.

Notice that in the above equation, we use the that we have computed in policy evaluation step to improve the policy. If the policy doesn’t improve after we apply the above equation, it means that we have found the optimal policy.

Overall, these two steps, policy evaluation and policy improvement are done iteratively using dynamic programming. First, under any given policy, the corresponding value function is computed. Then, the policy is improved. With the improved policy, the next value function is computed and so on. If a policy does not improve anymore compared to the previous iteration, it means that we have found the optimal policy for our problem.

Iterative policy evaluation and policy improvement in dynamic programming
Iterative policy evaluation and policy improvement in dynamic programming

Implementation of Dynamic Programming to Find Optimal Policy

Now that we know all of the theory regarding dynamic programming and optimal policy, let’s implement it in a code with a simple use case.

Suppose that we want to control the increased demand of the use of parking space in a city. To do so, what we need to do is to control the price of parking system depending on the city’s preference. In general, the city council has a perspective that the more parking space is being used, the higher the social welfare is. However, the city council also prefers that at least one spot is left unoccupied for emergency use.

We can define the use case above as Markov Decision Process (MDP) with:

  • State: the number of occupied parking spaces.
  • Action: the parking price.
  • Reward: city’s preference for the situation.

For this example, let’s assume that there are ten parking spots and four different price range. This means that we have eleven states (10 plus 1 because there can be a situation where no parking space is occupied) and four actions.

To find the optimal policy with the given use case, we can use the dynamic programming with Bellman optimality equation. First, we evaluate the policy and then we improve the policy. We do these two steps iteratively until the result converges.

As a first step, let’s define a function to compute the Bellman optimality equation as shown below.

The Bellman optimality equation above evaluates the value function at any given state.

Next, let’s define a function to improve the policy. We can improve the policy by greedifying the policy. This means that we transform the policy such that the policy will have the probability of 1 of choosing action which maximize the value function at a given state.

Finally, we can wrap the policy evaluation and policy improvement into one function.

Now if we run the function above, we will get the following result:

From the result above, we can see that the value function increases as the number of parking space that is occupied increased, except when all of the parking space are occupied. This is fully expected as we can see from the preference of the city council in the use case example.

The city council has a perspective that the more the parking space is used, the higher the social welfare is and they prefer to have at least one parking space left unoccupied. Thus, the more the parking spot is occupied, the higher the value function will be apart from the last state.

Also note that when the parking occupancy is high (state nine and ten), the action change from 0 (the lowest price value) to 4 (the highest price value) to avoid the full occupancy.


References

The material in this article was inspired by Reinforcement Learning: An Introduction book by Richard Sutton and Andrew Barto as well as Fundamentals of Reinforcement Learning course by the University of Alberta on Coursera.

Make sure you read the book or take the course to dive deeper into the detail of Reinforcement Learning.

If you want to play around with the code in the above example, you can find it on my GitHub page.


Related Articles