Reinforcement Learning from Scratch: Designing and Solving a Task All Within a Python Notebook

Part 1: Defining the Environment, Finding the Optimal Policy with Value-Iteration and Introducing Q-Learning

Philip Osborne, PhD Researcher
Towards Data Science

--

Summary

In this article, I will introduce a new project that attempts to help those learning Reinforcement Learning by fully defining and solving a simple task all within a Python notebook. The environment and basic methods will be explained within this article and all the code is published on Kaggle in the link below. In addition, I have created a “Meta” notebook that can be forked easily and only contains the defined environment for others to try, adapt and apply their own code to.

Context

When I first started learning about Reinforcement Learning I went straight into replicating online guides and projects but found I was getting lost and confused. “Why do the results show this? What does this parameter do? What does the environment act in this way?” were all some of the questions I began asking myself. It wasn’t until I took a step back and started from the basics of first fully understanding how the probabilistic environment is defined and building up a small example that I could solve on paper that things began to make more sense. However, I found it hard to find environments that I could apply my knowledge on that didn’t need to be imported from external sources.

Therefore, I set myself a challenge:

Can I fully define and find the optimal actions for a task environment all self-contained within a Python notebook?

By following my work I hope that that others may use this as a basic starting point for learning themselves.

Stage 1: Defining the Environment

The Task

Very simply, I want to know the best action in order to get a piece of paper into a bin (trash can) from any position in a room. I can throw the paper in any direction or move one step at a time.

Although simple to a human who can judge location of the bin by eyesight and have huge amounts of prior knowledge regarding the distance a robot has to learn from nothing.

This defines the environment where the probability of a successful throw are calculated based on the direction in which the paper is thrown and the current distance from the bin.

For example, in the image below we have three people labelled A, B and C. A and B both throw in the correct direction but person A is closer than B and so will have a higher probability of landing the shot.

Person C is closer than person B but throws in the completely wrong direction and so will have a very low probability of hitting the bin. This may seem illogical that person C would throw in this direction but, as we will show more later, an algorithm has to try a range of directions first to figure out where the successes are and will have no visual guide as to where the bin is.

Task Environment Example

To create the environment in python, we convert the diagram into 2-d dimensions of x and y values and use bearing mathematics to calculate the angles thrown. We used normalised integer x and y values so that they must be bounded by -10 and 10.

Environment Mapped to 2-d Space

Environment Probabilities

The probability of a successful throw is relative to the distance and direction in which it is thrown. Therefore, we need to calculate two measures:

  • The distance the current position is from the bin
  • The difference between the angle at which the paper was thrown and the true direction to the bin

Distance Measure
As shown in the plot above, the position of person A in set to be (-5,-5). This is their current state and their distance from the bin can be calculated using the Euclidean distance measure:

For the final calculations, we normalise this and reverse the value so that a high score indicates that the person is closer to the target bin:

Because we have fixed our 2-d dimensions between (-10, 10), the max possible distance the person could be is sqrt{(100) + (100)} = sqrt{200} from the bin. Therefore our distance score for person A is:

Direction Measure

Person A then has a decision to make, do they move or do they throw in a chosen direction. For now, let imagine they choose to throw the paper, their first throw is at 50 degrees and the second is 60 degrees from due north. The direction of the bin from person A can be calculated by simple trigonometry:

Therefore, the first throw is 5 degrees off the true direction and the second is 15 degrees.

When we consider that good throws are bounded by 45 degrees either side of the actual direction (i.e. not throwing the wrong way) then we can use the following to calculate how good this chosen direction is. Any direction beyond the 45 degree bounds will produce a negative value and be mapped to probability of 0:

Both are fairly close but their first throw is more likely to hit the bin.

Probability Calculation

We therefore calculate our probability of a successful throw to be relative to both these measures:

Creating a Generalised Probability Function

Although the previous calculations were fairly simple, some considerations need to be taken into account when we generalise these and begin to consider that the bin or current position are not fixed.

In our previous example, person A is south-west from the bin and therefore the angle was a simple calculation but if we applied the same to say a person placed north-east then this would be incorrect. Furthermore, because the bin can be placed anywhere we need to first find where the person is relative to this, not just the origin, and then used to to establish to angle calculation required.

This is summarised in the diagram below where we have generalised each of the trigonometric calculations based on the person’s relative position to the bin:

Angle Calculation Rules

With this diagram in mind, we create a function that calculates the probability of a throw’s success from only given position relative to the bin.

We then calculate the bearing from the person to the bin following the previous figure and calculate the score bounded within a +/- 45 degree window. Throws that are closest to the true bearing score higher whilst those further away score less, anything more than 45 degrees (or less than -45 degrees) are negative and then set to a zero probability.

Lastly, the overall probability is related to both the distance and direction given the current position as shown before.

Note: I have chosen 45 degrees as the boundary but you may choose to change this window or could manually scale the probability calculation to weight the distance of direction measure differently.

We re-calculate the previous examples and find the same results as expected.

Plotting Probabilities for Each State

Now that we have this as a function, we can easily calculate and plot the probabilities of all points in our 2-d grid for a fixed throwing direction.

The probabilities are defined by the angle we set in the previous function, currently this is 45 degrees but this can reduced or increased if desired and the results will change accordingly. We may also want to scale the probability differently for distances.

For example, the probability when the paper is thrown at a 180 degree bearing (due South) for each x/y position is shown below.

Animated Plot for All Throwing Directions

To demonstrate this further, we can iterate through a number of throwing directions and create an interactive animation. The code becomes a little complex and you can always simply use the previous code chunk and change the “throw_direction ” parameter manually to explore different positions. However this helps explore the probabilities and can be found in the Kaggle notebook.

Stage 2: Finding the Optimal Policy for Environment where the Probabilities are Known

Model-based Methods

The aim is for us to find the optimal action in each state by either throwing or moving in a given direction. Because we have known probabilities, we can actually use model-based methods and will demonstrate this first and can use value-iteration to achieve this via the following formula:

Value Iteration Update Rules

Value iteration starts with an arbitrary function V0 and uses the following equations to get the functions for k+1 stages to go from the functions for k stages to go (https://artint.info/html/ArtInt_227.html).

Initial Value of Each State

The calculation of MOVE actions are fairly simple because I have defined the probability of a movements success to be guaranteed (equal to 1). Therefore, the Q value of, for example, action (1,1) from state (-5,-5) is equal to:

Q((-5,-5),MOVE(1,1)) = 1*( R((-5,-5),(1,1),(-4,-4))+ gamma*V(-4,-4)))

for now, the rewards are also all 0 therefore the value for this first calculation is simply:

Q((-5,-5),(1,1)) = 1*(0+gamma*0) = 0

First Q update for Move Actions

All move actions within the first update will be calculated similarly. Value is added to the system from successful throws. Therefore, we can calculate the Q value for a specific throw action. Previously, we found the probability of throw direction 50 degrees from (-5,-5) to be equal to 0.444. Therefore, the Q value for this action updates accordingly:

Q((-5,-5),THROW(50)) =

0.444*(R((-5,-5),(50),bin) + gamma*V(bin+))) +

(1–0.444)*(R((-5,-5),(50),bin) + gamma*V(bin-)))

Again the rewards are set to 0 and the positive value of the bin is 1 while the negative value of the bin is -1. Therefore we have:

Q((-5,-5),THROW(50)) =

0.444*(0 + gamma*1) +

(1–0.444)*(0 + gamma*1) = 0.3552–0.4448 = -0.0896

It becomes clear that although moving following the first update doesn’t change from the initialised values, throwing at 50 degrees is worse due to the distance and probability of missing.

Once each Q(s,a) is calculated for all states and actions, the value of each state, V(s), is updated as the maximum Q value for this state. The process is repeated back and forth until the results converge.

Value-Iteration Update Procedure

There is not set limit for how many times this needs to be repeated and is dependent on the problem. Because our environment is so simple, it actually converges to the optimal policy within just 10 updates.

Convergence of Value-Iteration Updates

We first show the best action based on throwing or moving by a simple coloured scatter shown below.

Optimal Policy Plot v1

Improving Visualisation of Optimal Policy

Although the chart shows whether the optimal action is either a throw or move it doesn’t show us which direction these are in. Therefore, we will map each optimal action to a vector of u and v and use these to create a quiver plot (https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.quiver.html).

We define the scale of the arrows and use this to define the horizontal component labelled u. For movement actions, we simply multiply the movement in the x direction by this factor and for the throw direction we either move 1 unit left or right (accounting for no horizontal movement for 0 or 180 degrees and no vertical movement at 90 or 270 degrees).

The horizontal component is then used to calculate the vertical component with some basic trigonometry where we again account for certain angles that would cause errors in the calculations.

We see that some states have multiple best actions. Those directly north, east, south of west can move in multiple directions whereas the states (1,1), (1,-1),(-1,-1) and (-1,1) can either move or throw towards the bin.

Lastly, I decided to show the change of the optimal policy over each update by exporting each plot and passing into a small animation.

Stage 3: Finding the Optimal Policy with Reinforcement Learning where the Probabilities are hidden

Q-Learning Algorithm

We will now imagine that the probabilities are unknown to the person and therefore experience is needed to find the optimal actions.

First, let’s try to find the optimal action if the person starts in a fixed position and the bin is fixed to (0,0) as before.

We will be applying Q-learning and initialise all state-action pairs with a value of 0 and use the update rule:

Q learning Update Rule

We give the algorithm the choice to throw in any 360 degree direction (to a whole degree) or to move to any surrounding position of the current one. There are therefore 8 places it can move: north, north-east, east, etc.

When it chooses to throw the paper, it will either receive a positive reward of +1 or a negative of -1 depending on whether it hits the bin or not and the episode ends.

It will need to establish by a number of trial and error attempts where the bin is located and then whether it is better to move first or throw from the current position.

Q-Learning Pseudo-code

First, as before, we initialise the Q-table with arbitrary values of 0.

For now, the start of the episode’s position will be fixed to one state and we also introduce a cap on the number of actions in each episode so that it doesn’t accidentally keep going endlessly.

Each episode ends naturally if the paper is thrown, the action the algorithm performs is decided by the epsilon-greedy action selection procedure whereby the action is selected randomly with probability epsilon and greedily (current max) otherwise. To balance the random selection slightly between move or throwing actions (as there are only 8 move actions but 360 throwing actions) I decided to give the algorithm a 50/50 chance of moving or throwing then will subsequently pick an action randomly from these.

As before, the random movement action cannot go beyond the boundary of the room and once found we update the current Q(s,a) dependent upon the max Q(s’,a) for all possible subsequent actions. For example, if we move from -9,-9 to -8,-8, Q( (-9,-9), (1,1) ) will update according the the maximum of Q( (-8,-8), a ) for all possible actions including the throwing ones.

If the algorithms throws the paper, the probability of success is calculated for this throw and we simulate whether in this case it was successful and receives a positive terminal reward or was unsuccessful and receives a negative terminal reward.

The algorithm continues to update the Q values for each state-action pair until the results converge.

We will analyse the effect of varying parameters in the next post but for now simply introduce some arbitrary parameter choices of:
— num_episodes = 100
— alpha = 0.5
— gamma = 0.5
— epsilon = 0.2
— max_actions = 1000
— pos_terminal_reward = 1
— neg_terminal_reward = -1

Running the algorithm with these parameters 10 times we produce the following ‘optimal’ action for state -5,-5:

Clearly these are not aligned which heavily suggests the actions are not in fact optimal. Therefore, we need to consider how the parameters we have chosen effect the output and what can be done to improve the results.

Conclusion

We have introduced an environment from scratch in Python and found the optimal policy. Furthermore, I have begun to introduce the method for finding the optimal policy with Q-learning.

I will continue this in a follow up post and improve these initial results by varying the parameters. For now, I hope this demonstrates enough for you to begin trying their own algorithms on this example.

If you have any questions, please feel free to comment below or on the Kaggle pages.

Thanks

Phil

--

--