Explaining Double Q Learning for OpenAI environments using the movie TENET

With admiration and not spoilers. Not a problem if you haven’t watched the movie. See how CNN based Double DQNs can solve your OpenAI Gym problems!

Paarvendhan
Towards Data Science

--

Red on the left going forwards and Blue on the right going backwards image by author

Please Read:

The post by no means aimed for click-baiting people. I assure you that there will not be any compromise on the actual topic and we will be just using the movie analogies whenever required for adding a little bit of fun to the read.

Not a vibe but a wave.

No Spoilers

As a fellow fan, I understand some of us haven’t watched the movie yet due to the pandemic and for that reason in this post, there will be absolutely no spoilers. We are only going to discuss some of the concepts which made the movie amazing but not the movie and you don’t need to watch the movie for reading this post.

Tl;Dr: https://github.com/perseus784/Vehicle_Overtake_Double_DQN

  • This read unlike posts which explain all the concepts of machine learning, in here, we focus on the learnings of the project. How each element affects our project and how those can be engineered will be the main focus.
  • This is important because you can get the concepts and understanding better with the original paper as well as the other amazing articles from different posts on the internet but this approach will be useful for you to understand the practical hurdles if you are going to implement one for yourself.

The Plan

  • Basics (Just to get it out of the way)
  • Requirements, Environment and System Configuration
  • Deep Reinforcement Learning
  • Model Architecture
  • The Pincer Movement
  • The Algorithm
  • Posterity
  • Takeaways
  • Results
  • TENET

Basics

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.

Why?

In the other methods of AI, we have a particular objective for our model to learn and tune the parameters accordingly. But over here, we let the model choose what it wants to learn just based on the rewards it is getting from doing a particular action.

This is important because we humans don't train for each task separately. We just think about the reward and equip ourselves with the skills to solve a task. I see RL is the first step or the way towards a more generic Artificial Intelligence than a task tailored one. If you want to learn a new environment /task, just change the rewards that you get. The model should learn your new environment too.

How?

We can use RL to train an agent to do any task. This is a very similar idea to how we humans work.

Do some work, get candy.

Do more work, get more candy.

Do bad/null work get no candy.

Simple enough right? read more about it here: https://medium.com/@ipaar3/saturnmind-94586f0d0158

It is like a baby. Put the baby anywhere and it will adapt to the environment that it was put in based on the rewards that it gets for each of its actions.

Is this useful / Should I continue?

This post targets people who want to learn about Deep Reinforcement Learning, a deep learning based reinforcement learning method. You can continue with the post if you are interested in any aspects of Deep RL, how to create an agent which can learn to play a game or handle a task on its own.

People who want to get some good insights about Deep Q learning, Double Q learning and its practical implementation is the target audience for this post.

That being said, there is nothing wrong in learning a new concept on the internet and I will try to keep you as engaged as possible.

Requirements, Environment and System Configuration

  • Python 3.7
  • Tensorflow 2.3.0
  • Numpy, OpenCV
  • Your OpenAI-Gym like Environment.

Our goal is to help the user car overtake the bot cars on its own in a roadway environment. We train the user car with the help of deep reinforcement learning, the reward function will penalise the user car every time it slows down, every time it crashes into bot car and if there are any bot cars in front of it. Since we use the raw pixels from the environment as our state space, we can easily change the environment and the reward function to adapt to a new problem.

  • The OpenAI-Gym Environment used here is a Highway Environment which provides a clean pipeline for our RL experiments. In case you haven’t noticed, since there might be a lot of versions and variations in implementing RL based environments, OpenAI made this process standard.
  • So, any environment that you get with OpenAI like environments will have the same format for getting the rewards, variables and game stats. For our task, we are going to take the Highway Environment from here as shown below.
Highway Environment image by the author

The Protagonist:

In this environment, we are going to train our agent(The Protagonist) to learn to overtake the vehicles in our environment by directly taking in an image and outputs an action. But doesn’t mean the same pipeline is not useful for any of the other cases. As I mentioned earlier, the same code can be applied to a very different problem since we are following the Gym format.

Please look at the zoo for examples of what you can build by changing just a couple of lines of code. The below format is the same for all the environments and environments can be replaced by just changing the code in line number 3.

import gym
import highway_env
env = gym.make("highway-v0")
observation = env.reset()
done = False
while not done:
env.render()
action = agent(observation) # your agent
observation, reward, done, info = env.step(action)
if done:
observation = env.reset()
env.close()

System Config: Since the network architecture is going to involve a couple of CNNs and it is good to a have a machine with decent GPU with more than 4GB of virtual RAM and 16Gigs of System RAM.

Code: https://github.com/perseus784/Vehicle_Overtake_Double_DQN

Deep Reinforcement Learning

Deep Reinforcement Learning in simple terms is just Reinforcement learning but instead of a Q-table, we will be using a neural network to learn policies for our environment.

Why?

  • We all know neural networks are effective in learning patterns. To put that into use, we are replacing the Q-table of our RL algorithm with a neural network.
  • The benefits of this are manyfold. Firstly, we don’t need to memorize separate policies for each state in the environment. Meaning, we can now do RL in larger environments without memory constraints. Having a policy for each state of a large environment will cause an exponential rise in Q-table size.
  • Secondly, We can now learn much more complicated policies which require temporal understanding. The old RL cannot consider the previous state of the environment and now, the DQNs can learn to take a sequence of steps and generalize on that as well.
  • Finally, We can have different types of inputs to the network like Image, A sequence of images and even numerical sequences. This can be an input for CNNs and RNNs which are very powerful in understanding those datatypes respectively.

Model Architecture

The most exciting part of the project is selecting the right neural network for the given problem. While we can use a vanilla neural network to learn the states and give out policies, it is better to build a CNN which is better at processing and handling images. This decision is made so that we can use the same network for a problem with no defined state and giving the raw input that we see, saving us some data engineering work.

Initially, I went with a MobileNet V2 like a model but later realized on testing it is not helping me and better to build our own network for learning the task.

Model Architecture image by author
  • Four sequential images of the environment are taken and given as the input. This contains our agent moving through the environment.
  • Our network has the architecture as shown in the above diagram and layer blocks are made up of basically multiple Convolutional layers and a MaxPool layer.
Image by author
  • Dropout is introduced into the architecture to avoid learning the about image more and let the network learn about our task. This way, we force the network to learn about the task and the policies.
  • The network goes from 200x100x4 dimensions (4 for 4 stacked grayscale frames) to 9x3x512 feature maps. The number of convolutional blocks used and feature maps for each along with pooling layers can be seen in the figure below.
  • These feature maps are then flattened and fed into two dense layers with 512 hidden units each.
  • This architecture helps the network learn a complex objective of the agent. Our network contains approximately 9M parameters.

Code: https://github.com/perseus784/Vehicle_Overtake_Double_DQN

The Pincer Movement

Till now we have seen how a simple neural network can replace the q-table approach but upon experimenting, this approach has a lot of issues and that can be overcome by using two different networks which operate simultaneously to make our agent better.

The pincer movement or double envelopment is a tactic in which forces attack both sides of an enemy formation at the same time.

What’s wrong?

Deep Q-learning makes so much sense on theory right? but when tried implementing, it really did not perform well once the complexity rises a little more.

Image from Double DQN paper from arxiv

The above equation is the Bellman equation for calculating the new policy for the state by taking the max of the available policies. However, this is that it takes the estimate of the maximum Q value of the next state and performing an action as seen in the equation above. This overestimation done during the learning of the network could introduce a maximization bias.

  • To put it simply, the network tends to overestimate the policies with higher values(max) for each state and this might cause the policies to converge to fluctuate, making the convergence very slower.
  • It makes sense because we are both estimating the new policy and updating it at the same time. This definitely causes disruptions in the training process.
  • To think of it in even simpler terms, we don’t want to learn and play the game at the same time (but it is ideal). It is practically better to play the game, learn from your mistakes, level up, make new mistakes and learn from it.
  • So, here we instead of a single network, we are going to introduce another network. One will of the estimators will be focusing on maximizing the Q value for the policy and the other is used to update the value.

Team Red and Blue:

Image by author

In the movie TENET, they do something called a Temporal Pincer Movement through time to gain knowledge about the environment and attack at the same time. Meaning, one team will be focusing on gathering knowledge about the environment as events pan out and the other team will use it to attack the enemy.

Double Q-learning image by author
  • Similarly here, we are going to have two networks at play. One will be our training network (Team Red) which trains our agent with gained data from playing and the other will be predicting network (Team Blue) which plays the environment and collects new experiences for the training network to be saved in memory.
  • This operation continues as we can see the rewards that are gotten by the agent gets higher over time. You can also see the updated mathematical equation for calculating the policy using both of our networks.
Modified Bellman Equation for Double Q learning image from Double DQN paper from arxiv
  • This is the original paper where they introduced this technique and it was one of the major boosters in the Deep RL world. I think now the name Double Q-learning is good to use hereafter.

The Algorithm

We have the protagonist and the environment ready for us to start with the training pipeline. Since this is an online training procedure, the data is collected, processed and fed in real-time.

Workflow:

Training pipeline image by the author
  • You can look at the pseudocode of our training procedure in the image. We will talk about the epsilon and the memory variables in the next section.
  • Once the environment starts, our predicting network will predict actions for the given state and save the rewards that it got from taking the action.
  • This is repeated for several times to collect more data to be saved in memory and to avoid training too often.
  • Every fourth iteration, the training network will train on the data gathered in the memory. Once the network is trained, the process continues to get more data.
memory = [ [current_state, action, next_state, reward, done], ...]
  • The memory data structure shown above stores the current state, the action was taken by the agent, the next state arrived after taking the action, the reward that it has acquired by taking that action and whether the episode is over or not.
  • For every 10th episode, we are updating our predicted network with our trained model from the locally saved directory. I consider this as a levelling up process.
Training flow image by author
  • The image in the left shows the flow of our training process. The training network basically trains and stores the model locally. The prediction network on the other hand for every 10 episodes updates itself by loading the locally stored model.
  • This levelling up process makes the training and convergence very smooth and from an intuitive perspective, the agent learns to make better mistakes over time than the silly ones.

Training:

The training is run for hours and it is better to record all of the logs using Tensorboard for live tracking and visualization. The main two metrics we are tracking here is the loss and the average episodic rewards graph.

Loss:

  • The loss graph will show that the network is learning better after each epoch as it shows the overall the network is making better decisions.
Loss graph by the author
  • The loss we optimize for in this system is the mean square error between the predictions and the ground truth by the training network and the prediction network respectively.
  • Please take a look at this section of code to see how we get the prediction and the ground truth from the training network and the prediction network respectively.
loss = tf.keras.losses.mean_squared_error(ground_truth, prediction)
  • We use the new bellman equation that we have to find the loss between the policy Q values that have been predicted and the right policies for that state using the reward that we got.
  • The loss graph from the training network is much smoother after we reduced the learning rate to a minimal value and we have also smoothened the loss value using Tensorboard. We’ll discuss this briefly.

Average Episodic Rewards:

  • The rewards graph will show that as the epochs keep increasing the rewards will also increase because as the network learns more, it will perform actions which gives it the maximum reward.
Reward graph by the author
  • The original paper where they developed the Double DQN technique, the authors ran the environment for 250M epochs compared to us, we only ran the training for 3K times.
  • The perfect policies are not achieved because of this but the whole point is to learn the process rather than achieving an ideal result.
  • Particularly, in our experiment, we have given more weightage for the car to stay on the right of the road for better rewards. As a side effect, the car is causing more accidents when it is on the left side of the road. This helps us understand the intricacies in designing a reward mechanism for a Reinforcement Learning task.
  • Generally, getting increased rewards over time is a good way to know our agent is learning the task that we intend to learn and even visually, the agent performs well and makes fewer mistakes as the episodes increase.
  • The accuracy metric is not a good measure of how well our network performs in reinforcement learning. Because it may be the result of Q value memorization rather than learning the task.

Code: https://github.com/perseus784/Vehicle_Overtake_Double_DQN

Posterity

Honestly, this is the most important part of this entire post. I consider these experiments and decisions are more important because these are the hurdles or walls that you will face while trying to build a project similar to this.

Posterity — The descendants of a person.

As we build our pipeline, it is important for us to add/tune some crucial elements of the system for getting the most out of it. This is done by creating multiple versions of the networks and experimenting with different hyperparameters and even some mechanisms which add value to our training. Finally, choose the right one.

You see… Tenet wasn’t founded in the past, it will be founded in the future.

Experience Replay

  • As the agent learns and traverses through the episodes, it forgets the previously learned experiences. So, when we train, it is important to recursively remind the network of what it learnt.
  • The way to do that is using an experience replay mechanism. This mechanism helps the network remember it’s experienced by randomly picking from these while training.
Memory usage image by author
  • The program basically stores a data structure with the current observation, action taken, the next state, the reward for that action and if the episode is over.
  • Also, we don’t want to store the entire history so we have a buffer memory size of 15000 such data structures.
  • The buffer memory is basically a queue with a fixed size of 15000 and it pops out the old memories as it accumulates more. The network uses this memory pool to select its batches and train on it.
  • Say our agent plays the game and it learns to pass a particular hurdle. Then when it moves on to the next hurdle, we don’t want it to forget what it has learnt from the previous section. So, the mechanism helps deep Q networks to train on experience and sort of refresh its learnings during every training period.

Exploration Vs Exploitation

  • The agent when trained from the first overfits the experience it has on a situation and does not explore new options for getting more rewards (Note: Do not confuse this with neural network overfitting).
  • Meaning, say if the agent found some sub-optimal path or a way of handling the situation, the agent since it is not having any negative effect, it will settle on this value itself.
  • This, ultimately causes the agent to settle for sub-optimal solutions and the rewards never achieve the ideal state.
Exploration vs Exploitation graph by author
  • In order to overcome this problem of getting stuck in a rigid state, we can use a value called Epsilon.
  • Epsilon is an episodic decaying variable, which starts at 1 and decays to 0 till half of the total episodes ran on the session.
  • For every render of the environment, our agent decides to randomly explore the action space or choose one from the predicted network actions. This can be implemented as code like the following:
if np.random.random() > epsilon:
action = np.argmax(get_prediction(state)) #Exploitation
else:
action = np.random.randint(0, no_of_actions) #Exploration
  • The epsilon value decides between exploration or exploitation mode of the agent and it is reduced over time. This is done for the first half of the training i.e, in our case, we reduced the epsilon value from 1 to 0.1 over a period of 1500 episodes out of 3000 episodes as shown in the figure above.

Hyperparameter Tuning

  • The usual neural network training involves a good amount of hyperparameter tuning but thanks to the efforts of people over years in the deep learning community for their findings, those were a little bit more streamlined and made easy over the years.
  • But the process gets more complicated with Deep RL especially with epsilon, learning rate and batch size tuning. This was very complex because everything mattered and even a small change in these parameters caused huge differences over time.
  • Epsilon(Exploration vs Exploitation) value was tried with different decay values and over a different period of time like not decaying till the first half the training and then decay over last half, decay within the first third of the episodes, no exploration etc. But what worked for us was slowly reducing the epsilon over the first half of the training.
  • Learning Rate, on the other hand, we went with the default TensorFlow Adam learning Rate but it was causing us more fluctuations in the loss during the training so went with a reduced 1e-4 learning rate for our training network. This helped to reduce the fluctuating loss.
  • Batch size of 32 for the final training. Initially, we went with the higher batch size such as 256, 512 but those did not show improvements and also it occasionally caused memory exhaust errors.
  • Training frequency is also a major factor in this. Training frequency is basically how many times the target network should be trained over an interval. We trained the network once every four steps agent makes in the environment. This gave enough train interval for the network as well collect new data.
  • Levelling up is how often we update the online or predicting network for transferring the experience. This was done for every 10 episodes so that the network predictions were made better over time and also not too much of updating the online network to cause too much instability in the predictions.

Code: https://github.com/perseus784/Vehicle_Overtake_Double_DQN

Takeaways

  • Choose the problem you want to solve first and then the solution. Having a solution or in this case, a particular AI technique and trying to fit it for your solving space is just bad engineering.
  • OpenAI gym is an amazing source for our RL based experiments. Get to know the format once and we can handle all the environments in the same way.
  • Know that our assumptions will go wrong however right it may sound. Like, I thought by using MobileNet v2 as a feature extractor we could never go wrong, yet it did not work for me.
  • I also tried an idea like combining some inception modules in our CNN for better results from my previous project. This idea worked amazing in that project but did not give any improvement in this project at all.
  • While working on a NN architecture, start small and increase the complexity as you go along rather than starting with a huge model where we will not be able to identify what went wrong.
  • Designing the reward function is the most important aspect of RL. Because this is what will decide the task that has to be learnt by our agent. For example, in our case, we have given rewards for agent staying in a favourite lane but this called for an undesirable effect like making more collisions on the other lanes because the agent always tried to stay at the rightmost lane.
  • Always think about ramifications of our designed rewards function which are mostly hidden in plain sight but will make sense intuitively.
  • We could make a complicated reward function to cover all edge cases and run it for a million iterations will end up making it perfect. But the point is to create a general way where you can feed in only raw images (not states which need to be engineered for each task) and rewards for training an agent.
Bird’s eye view image from the waymo site
  • Not a gimmick. This project is not just a demonstration of the power in RL. Many of the self-driving researches use this Top View approach to learning control mechanisms for their vehicle agents.
  • This includes Waymo, Tesla, Uber and more. The image shows what is being given as the input for training the control agents of SDCs which helps in navigation and cruise control.
  • The field of Artificial Intelligence is saturated with more teachers than students in today’s date. I only present this as a humble contribution as a fellow learner but not a master.

Results

Please look at each image below to see how our agent trained to overtake the other agents in the field. Here is the continuously recorded video for over a minute. You can see that it makes more mistakes when on the left side than the right because of the favourite lane bias in our reward function.

Trained agent demo images by the author

TENET

My learnings from today, belong to you tomorrow. Use it and don’t reinvent the wheel over and over.

TENET: Principles that are held true by a group or a movement.

Red on the left going forwards and Blue on the right going backwards image by author

Overall the project gave us good knowledge about Raw pixel inputs for Deep RL, Network architecture selection, hyperparameter tuning, reward function designing, Experience Replay, Exploration Vs Exploitation mechanisms and Double Deep Q learning. With this, we can further extend this project to much more advanced techniques such as A2C or DDPG for continuous space learning.

Take a look at the code here.

Full Demo:

https://youtu.be/sH00TWLwBoA

Other posts:

https://medium.com/@ipaar3

! eraC ekaT

--

--