The world’s leading publication for data science, AI, and ML professionals.

Reinforcement Learning Policy for Developers

A pragmatic approach to understand policies in Reinforcement Learning

Update: The best way of learning and practicing Reinforcement Learning is by going to http://rl-lab.com

Policy is somehow a tricky concept, mainly for Reinforcement Learning beginners. This article will try to clarify the topic in plain and simple English, away from mathematical notions. It is written with developers in mind.

Policy

If you have ever heard of best practices or guidelines then you have heard about policy. Consider, for example, fire safety guidelines for people living in high buildings. Probably the most important guideline is not to use the elevator during a fire. People should close doors, stock water, use wet sheets, and make their position known to firefighters. These series of actions are issued by guidelines that are the result of a compilation of best practices that have been accumulated throughout time and experience. So the policy tells you how to act when facing a certain situation. It does not tell you the result or value of your action. However, you implicitly expect to have the best possible outcome when following this policy, for example escaping the fire unharmed.

Action Value

Action value is when you take action and assess its result or value. For example, a person trapped on a high floor in a building on fire considers his/her options (the actions available), then tries each one of them to assess its outcome. Of course, in reality, this is not possible, because once hurt there is no coming back. But in a simulation, sports, games, there is a possibility to play many scenarios and spot the actions that give the best results.

Relation Between Policy and Action Value

It goes without saying that a policy is the compilation of the best practices based on actions with the highest score (or value).

Thus we write

meaning that at state s we take action a such that a gives the best result. Where q(s,a) is the value of the action a taken at state s.

But there is a catch. You need to really understand that this is a stochastic environment, full of uncertainty. This means that the same action at the same state may lead to different results each time it is executed.

For example, a basketball player practicing three points throws, will score some times and misses other times, even if he is shooting from the same spot (or angle) using the same hand. He/she will have to compute an average (score/number of throws) to find out the efficiency of the combination of the spot (or angle) and hand.

To discover the best angle and hand to use during a match, he/she will have to practice shooting from different angles and different hands, then average the results in order to deduce which combination works best for him/her.

Of course, one way to solve this issue is to perform a big number of iterations. At each iteration perform all available actions at that state, then take the average of q(s,a) for each action, and update policy 𝛑(s). We say that we are updating the policy on each iteration. So after each iteration, we have a new policy 𝛑 that reflects the computed q-values.

After sufficiently large number of iterations, when the average of q(s,a) starts to vary slowly enough to consider it stable, we say that we have reached a stable phase and the policy has become optimal (meaning this is the best that can be reached), we call it optimal policy and we write it 𝛑*

There are different ways to approach the topic of policy and different ways to improve it (i.e. update)

Optimal Policy

The optimal policy is when the current policy being trained has reached a stage after which it can’t be substantially improved. The importance of optimal strategy is that it allows an agent that follows it to achieve the "best possible" results.

Policy in Dynamic Programming Method

This is the easiest scenario, and also the least realistic one. All the environment dynamics are known, we just have to iteratively compute the q-values, and after each iteration assign the action with the highest q(s,a) value to the policy 𝛑(s). In Dynamic Programming, we know all the states and all the transition probabilities, so there is no need to discover anything, simply to compute the values.

More on Dynamic Programming in this article.

Policy in Monte Carlo Method

In Monte Carlo, we don’t know much about the inner workings of the environment, so we need to discover it by playing episodes. We play each episode to the end, till we reach a terminal state, then we compute the q- value q(s,a) for every state accessed and every action performed.

However, unlike Dynamic Programming we can’t compute the q-value for all states because we simply don’t know them beforehand. We have to discover them as we play the episodes. Now the question, is how do we play these episodes, in a meaningful way so we can extract the optimal policy? Of course, there is the random walk policy, that consists of picking at each state a random action to perform. This will work at the beginning, but later on, when we start discovering that some actions are better than others, it becomes quickly inefficient to keep the random walk because it won’t be learning anything. The reason for this inefficiency is because the policy is not being updated with the new values of q.

So it is better to use a policy that plays the best possible action at each state. But the problem of such a policy is that we will be executing the same action every time, and we won’t be able to discover new states, add to that we are not sure that the chosen action is truly the best one. Remember that the environment is full of uncertainty and there is nothing that guarantees that if one action that yielded good result at time t, will have the same good result at time t+1. So in order to have the best of the two approaches, we adopt a policy called 𝜀-greedy. In the 𝜀-greedy policy, we follow the greedy policy (going for the best action that we know of so far) 1- 𝜀 part of the time, and we pick a random action 𝜀 part of the time. This will allow discovering new states through random walk some of the time and exploit the best action most of the time.

Follow these links to learn more about Monte Carlo and Temporal Difference methods.

Stochastic Policy

Not everything is about picking the action that results in the best q-value. To really grasp this idea, do the following experiment: Play Rock-Scissor-Paper with someone, and pay attention to the fact that playing the same action in a row is a bad idea (even if it worked the first few times), because your opponent will countermeasure it. For example, if you frequently play Rock, your opponent will play paper and win. So the policy is no more about the best result of an action at a certain state, but about the probability distribution of actions that will let you win. In other words, it is about how unpredictable your actions should be.

Note that stochastic policy does not mean it is stochastic in all states. It suffices to be for some of them. The states in which the policy acts deterministically, its actions probability distribution (on those states) would be 100% for one action and 0% for all the other ones.

Policy Based Reinforcement Learning and Policy Gradient Step by Step explain stochastic policies in more detail.

Policy Evaluation

What does it mean to evaluate a policy? Well, it is like testing anything else. You try to find out if it delivers what it promises. In the same way, policy evaluation consists of asking a policy to provide an action given a certain state, then perform this action and computes the result (q-value for example) Eventually, not all actions will yield the expected result, and thus the policy needs some fine tuning. This is the job of the Policy Control also called Policy Improvement

Policy Control / Improvement

Policy control or improvement is about giving feedback to the policy after executing an action at a certain state. The result of the action is fed into the policy so it updates its internal behaviour, in a way to make that action more likely to be used (or not) next time the same state is reached. This feedback is called the update.

Consider again the example of the basketball player throwing at the three-point line. Starting with the left hand he/she scored a few times, so the policy is updated to use the left hand. As the practicing session goes on, he/she comes to be successful 30% of the time when using the left hand, but 70% of the time using the right hand. The policy should then be updated to use the right hand rather than the left hand when on this spot (or angle).

Conceptual Implementation

To conceptually implement a policy, let’s consider the following interface called IPolicy. It contains two methods:

  • getAction(s) returns the action to perform when in state s.
  • update(s, a, some_value), updates the current policy in a way to adjust its behaviour when at state s. This is usually done by telling the policy about the result (q-value or other) of action a performed at state s.
interface IPolicy{
   // get action given state s as indicated by the current policy
   getAction(s);

   // update policy at state s and action a
   update(s, a, some_value);
}

Each specific policy implements this interface in a way to reflect its behaviour. For example, the RandomPolicy implements getAction(s) so that it returns random action at state s. It does not implement the update method since there is no need for any update when the behaviour is always random.

class RandomPolicy : IPolicy {
   // get random action at state s
   getAction(s){
      return random action at state s
   }

   // no need for implementation since the policy is random
   update(s, a, some_value){
   }
}

The GreedyPolicy returns the action at state s that gives the best q-value. For this reason, it has to store these values in an appropriate structure. On the other hand, the update method updates the q-value computed at state s and action a. This is usually done by averaging with previously computed values.

class GreedyPolicy : IPolicy {
   // store q values by state and action
   QValueByStateAndAction store
   // get action that has the highest q for the given state s
   getAction(s){
      return store.getActionOfMaxQ(s)
   }

   // update policy by storing the average of the q computed at
   // state s and action a
   update(s, a, q){
      prev_q = store.getQForStateAndAction(s, a)
      store.updateQ(s, a, average(prev_q, q))
 }
}

The EpsilonGreedyPolicy returns a random action at state s , 𝜀 part of the time, and the action that has the best q-value (1- 𝜀) part of the time. As for the update method, it has the same logic as GreedyPolicy.

class EpsilonGreedyPolicy : IPolicy {
   // store q values by state and action
   QValueByStateAndAction store
   epsilon = 0.1
   // get random action epsilon time, highest q 
   // (1-epsilon) of the time
   getAction(s){
     rnd = computeRandomValue()
     if(rnd < epsilon) return random action at state s
     else return store.getActionOfMaxQ(s)
   }

   // update policy by storing the average of the q computed at
   // state s and action a
   update(s, a, q){
      prev_q = store.getQForStateAndAction(s, a)
      store.updateQ(s, a, average(prev_q, q))
 }
}

StochasticPolicy does not directly rely on q-value (it might indirectly do that), but it returns the action following the probability distribution of their success. For example, if action A has been successful 60% of the time, action B 30% of the time and action C 10% of the time. Then the action returned by the method follows the distribution 60% A, 30% B, 10% C.

As for the update method, it updates the underlying probability distribution of the actions.

class StochasticPolicy : IPolicy {
   // store q values by state and action
   ActionByFrequency store
// get action following the frequency of success
   getAction(s){
     return store.getActionByProbabilityOfSuccess()
   }

   // update policy by storing the occurrence of success 
   // result : success = 1; failure = 0
   update(s, a, result){
      success_count = store.getSuccessCount()
      store.setSuccessCount(s, a, success_count + result)
 }
}

High Level Algorithm

To put everything together and reach an optimal policy, we consider a high level algorithm that does the following:

  • Loop as long as necessary the steps below
  • Get an action from policy at state s: a = action.getAction(s) This is policy evaluation

  • Execute the action in the environment and observe the reward r and the new state s’
  • Compute whatever value v needed to assess the effectiveness of the action taken. This could be q-value or other, and it might involve neural network or other
  • Give feedback to the policy so it can improve itself: policy.update(s, a, v) This is policy improvement

Schematically it goes like this.

or this

Conclusion

This article focuses on policies in Reinforcement Learning in a practical way with a descriptive explanation rather than a mathematical one. It serves as a building block in the whole RL structure, and it profits best for developers.

Related articles


Related Articles