If you want to read this article without a Premium Medium account, you can do it from this friend link 🙂 https://www.learnml.wiki/applied-reinforcement-learning-ii-implementation-of-q-learning/
The first article in this series introduced the basic concepts and components of any Reinforcement Learning system, and explained the theory behind the Q-Learning algorithm. In this article the goal will be to implement the algorithm in Python3, applying it in a real training environment. All the concepts mentioned in the first article (Applied Reinforcement Learning I: Q-Learning) will be applied in this one, assuming that the reader knows and understands them, so if you are not familiar with these concepts, or have not read the first article, below you can access it:
Environment – Taxi-v3
In order to make this article didactic, a simple and basic environment has been chosen that does not add too much complexity to the training, so that the learning of the Q-Learning algorithm can be fully appreciated. The environment is Openai Gym‘s Taxi-v3 [1], which consists of a grid world where the agent is a taxi driver who must pick up a customer and drop him off at his destination.
Actions
As for the action space, the following discrete actions are available for the agent to interact with the environment: go forward, go backward, go right, go left, pick up a passenger and drop him off. This makes a total of 6 possible actions, which in turn are encoded in numbers from 0 to 5 for ease of programming. The correspondences between actions and numbers are shown in Figure 1.
States
The discrete state space is much larger, as each state is represented as a tuple containing the agent/taxi driver’s position on the grid, the position of the passenger to be picked up and his destination. Since the map is a 2-dimensional grid, the agent’s position can be fully represented by its position on the x-axis and its position on the y-axis, so the tuple representing the agent’s state is: (_pos_agentx, _pos_agenty, _passengerlocation, destination). As in the case of actions, the tuples representing states are encoded to integers between 0 and 499 (500 possible states). Figure 3 shows the correspondence between the location/destination of the agent and the integer representing that position/destination, which is used in the tuple describing the state.
To understand in a visual way the representation of the possible states as tuples, two examples of states for the agent are shown in Figure 3.
Rewards
As for the rewards received for each step performed by the agent, these will be:
- +20 for successfully delivering the passenger (terminal state)
- -10 for executing pickup or drop-off actions illegally
- -1 for any step unless other reward is triggered
Q-Learning Implementation
Now that the environment is well known, the implementation of the Q-Learning algorithm can proceed. As in the first article, the pseudocode extracted from Barto and Sutton’s book [2] will be used as a reference to support the implementation of the algorithm.
1) Initialize Q-Table
The Q-Table is initialized as a mxn matrix with all its values set to zero. where m is the size of the state space, and n is the size of the action space.
Since both actions and states are encoded to integers, the construction of the Q-Table can be done using these integers as indices.
2) Define Epsilon-Greedy Policy
The epsilon-greedy policy selects the action with the highest Q-Value for the given state or a random action, depending on the selected epsilon parameter. For example, an epsilon of 0.15 means that 15% of the time an action will be chosen randomly, while an epislon of 1 means that the action will be chosen randomly always (100% of the time).
A very interesting point of this policy is that the epsilon value can vary along the training, allowing to take more random actions at the beginning with high epsilon values (exploration phase), to finally take the actions with higher Q-Value using a very low epsilon (exploitation phase). For this environment, however, training is carried out with a constant epsilon value.
3) Define the execution of an Episode
For each episode the agent will perform as many timesteps as necessary to reach a terminal state. In each of these timesteps, the agent will 1) choose an action following the epsilon-greedy policy, and execute that action. After executing it, the agent will 2) observe the new state reached and the reward obtained, information that will be used to 3) update the Q-Values of its Q-Table.
This process is repeated for each timestep, until the optimal Q-Values are obtained.
As can be seen, the code is very simple: the get_epsilon_greedy_action() function defined before is used to select the action, the agent performs the selected action through the step() method of the environment, and finally the Q-Value is updated by applying the adaptation of Bellman’s optimality equation, as described and explained in the previous article.
4) Training the Agent
At this point, only the hyperparameters of the algorithm need to be defined, which are: learning rate α, discount factor γ and epsilon. In addition to this, it will also be necessary to specify the number of episodes that the agent must perform in order to consider the training as completed.
After defining all these variables, the execution of the training will consist of running the execute_episode() function defined above for each training episode. Each episode (and every timestep within each episode) will update the Q-Table until optimal values are reached.
It is important to note that the training execution keeps records of the rewards obtained in each episode in the _rewardshistory variable, in order to be able to show the results for evaluation after the training.
5) Evaluate the Agent
To evaluate the agent, the rewards obtained from each training episode, a visualization of the trained agent carrying out an episode and the metrics of several executions for the trained agent will be used.
The rewards obtained during the training process are a very important metric, as it should show the convergence of the rewards towards the optimal values. In this case, as the rewards of each timestep are negative except for the terminal state, the algorithm should make the rewards as close as possible to 0, or even exceed it. Plots of the rewards per episode obtained during two training sessions with different hyperparameters are shown below. As can be seen, the first plot shows how the rewards quickly reach maximum values, drawing an asymptote at 0, which means that the agent has learned to obtain the best possible rewards in each state (optimal Q-Values). On the other hand, the second graph shows that the rewards neither get better nor worse while ranging from 0 to -20000, which implies that the agent is not learning the task.
In this case, the reason that the second training works so poorly is due to the excessively high value of epsilon, since an epsilon of 0.99 will cause most actions to be chosen randomly, completely ignoring the exploitation phase.
With regards to the visualization of the agent, the render mode will be used, which allows the agent to be visualized while interacting with the environment. To see what the agent has learned, an episode is executed in which the epsilon is 0, that is: the agent always takes the action with the highest Q-Value, thus following the optimal policy.
As shown, executing the code above will display the trained agent picking up the passenger and taking him to his destination, unequivocally showing that the agent has learned how to correctly perform his task.
Finally, the behavior of the trained agent will also be evaluated in several different episodes, thus discovering the ability of the agent to successfully complete the task from different starting points and different locations and destinations of the passenger. As in the visualization, the agent will always choose the action with the highest Q-Value, thus showing what it has learned, and never an stochastic behavior.
As the execution logs show, the average number of timesteps needed to complete an episode is 13.8, and the average reward received for each episode is 7.2. Both metrics show that the agent has perfectly learned to perform the task, since it needs very few timesteps to complete the task, and achieves a reward greater than 0 in all executions.
Conclusion
Q-Learning has shown to be able to learn the task very easily, since it only needed 50 episodes to reach the optimal Q-Values when executed with the appropriate hyperparameters. Despite this very good results, the importance of hyperparameters in this type of training must be highlighted, (the two rewards-per-episode plots show how decisive epsilon can be in this type of training), as adequate values ​​make the agent converge to the optimal Q-Values ​​quickly, while erroneous values ​​can lead to the agent never learning. It must also be taken into account that the Taxi-v3 environment is a discrete and simple environment that has always shown very good results for the Q-Learning algorithm, so the fact that the agent has quickly and efficiently learned the task does not imply that this learning will be as efficient if it is applied to other types of environments.
In fact, it is essential for the application of Q-Learning that both the action space and the state space are discrete, since the construction of the Q-Table would be infeasible for continuous states or actions (the Q-Table would have infinite rows or columns). Therefore, despite the good results obtained from this training, Q-Learning is not always applicable to other types of environments, such as continuous problems, which will be introduced in future articles of this series.
Code
The complete implementation of the Q-Learning algorithm, together with the mentioned visualization and evaluation of the trained agent, can be found in my GitHub repository as a Jupyter Notebook, in order to facilitate understanding and learning.
REFERENCES
[1] OpenAI Gym Taxi -v3 https://github.com/openai/gym/blob/master/gym/envs/toy_text/taxi.py
[2] SUTTON, Richard S.; BARTO, Andrew G. Reinforcement Learning: An introduction. MIT press, 2018