Thoughts and Theory

At first glance, moving from vanilla Q-learning to deep Q-learning seems like a minor step. Just replace the lookup table with a neural network and you’re done. There’s more to it than that though – even for the simplest of problems deep Q-learning might struggle to achieve results.
To show how several common pitfalls can be avoided, this article treats a TensorFlow 2.0 implementation of deep Q-learning to solve the well-known cliff walking problem. First, we show how normalization and one-hot encoding align the input and output with the neural network. We then deploy three common technique that are said to drastically improve deep Q-learning: experience replay, target networks and mini batches.
The Cliff Walking Problem
The cliff walking problem (article with vanilla Q-learning and SARSA implementations here) is fairly straightforward[1]. The agent starts in the bottom left corner and must reach the bottom right corner. Stepping into the cliff that segregates those tiles yields a massive negative reward and ends the episode. Otherwise, each step comes at a small cost, meaning the shortest path is the optimal policy.
![Example of cliff walking word. The target tile yields a positive reward, each step yields a small negative reward, and falling into the cliff yields a large negative reward [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/11/1M79MspX4MckRX3wqKCgAxA.png)
The problem can be resolved with Q-learning. It is a reinforcement learning method, that stores Q-values representing expected future rewards for each state-action pair (48⋅4 in total). It is noteworthy to mention that Q-learning is an off-policy approach – for the downstream actions, we assume the best possible ones rather than those of the actual sample trajectory. As such, Q-learning encourages exploration as the impact of failure is relatively limited.

From Vanilla Q-Learning to Deep Q-Learning
In an earlier article, I provided a minimal working problem for implementing deep Q-learning in TensorFlow 2.0. Here, I assume you are familiar with both vanilla Q-learning and at least the basics of deep Q-learning. To briefly recap: in deep Q-learning we train a neural network that takes the state as input and outputs a Q-value for each action. Whereas lookup tables explode for large state spaces, deep Q-networks provide a compact representation that holds for all states. Simply replacing the lookup table is often insufficient though, we typically need the following techniques (described with implementation details here) to achieve better results[2]:
- Normalization & One-Hot Encoding
- Experience Replay
- Target Networks
- Mini Batches
Normalization & One-Hot Encoding
First things first. Neural networks tend to work better when dealing with inputs and outputs of the same magnitude. Thus, we divide all rewards by 100 so they (approximately) range between -1 and 1. The cumulative reward signal is the target output of the network.
We also have to pay some attention to defining the state, which is fed as input to the neural network. In ordinary Q-learning, we might call the goal tile ‘state 48’ and the adjacent cliff tile ‘state 47’, but naturally such numerical values mean little to the neural network. It would simply multiply with some vector of weights, unable to distinguish that 47 and 48 have completely different connotations. To resolve this problem, we apply one-hot encoding, defining the input as an array of length 48. The element corresponding to the current tile has a value 1, all others have a value 0. As a consequence, we may learn a unique set of weights corresponding to each tile
Both input and output are normalized now, so we can safely initialize the weights with something like He- or Glorot initialization and no longer worry about scale differences.
Experience Replay
Q-learning relies on temporal differences, using the difference between ‘expected’ value Q_t
and observed' value
r_t+Q_t+1as the error. Unfortunately,
Q_t+1is as much a guess as
Q_t` is – they are determined using the same neural network.
To make things worse, subsequent observations are often highly similar, using nearly identical states as input. This creates a strong correlation between adjacent observations. In particular non-linear approximations (such as neural networks) tend to deal poorly with such correlations.
In less abstract terms: consider an agent stuck in a corner of the cliff world. This agent might constantly receive the same reward signals and overfit on the network on actions in this particular corner. You will often see this happening: the agents learns that going right is bad (due to the cliff) and tends to stay on the left side of the world, constantly fitting the network to that region.
To break the correlation, we deploy experience replay. Instead of always selecting the most recent observation to update our network, we store all of our past observations and sample from this replay buffer whenever we update. Each observation is represented by a (s,a,r,s')
tuple. To compute the current Q-values (not from the time we obtained the observation), we feed s
and s'
to the prevailing network, using a
to obtain Q_t
and the argmax action to obtain Q_t+1
.
Mini Batches
In theory, we could gather many observations and fit the neural network to a single batch, determining the final policy with a single update. In many supervised learning problems, it is very common to train the network at once on a large data set. In the case of reinforcement learning, however, we would always be observing based on our initial policy (which might be very poor). We want to explore enough to avoid getting stuck in local minima, yet also learn primarily from good moves.
Thus, large batches are not very useful. We want to intertwine observing and updating to gradually improve our policy. This does not mean we have to update every observation though – a single step in the cliff walking problem often teaches us very little because we don’t observe anything meaningful. The obvious compromise are mini batches, meaning that we frequently update our network with a number of observations. Particularly combined with experience replay, this is a powerful technique to get stable updates based on a vast pool of previous observations.
Target Network
So far, we have drawn both the expectation Q_t
and the true' value
r_t+Q_t+1` from the same network. Thus, the observation and target are correlated to one another, once again using one guess to update the other.
To mitigate this problem, we make use of a target network. A target network is no more than a periodic copy of the Q-network, being updated at a lower frequency (say once every 100 episodes). As such, we decouple the expected values and target values, reducing the correlation between the pairs. The target remains fairly stable, while the policy itself is improved more gradually.
Deep Q-learning Into Action
Having outlined the theories required to implement Deep Q-learning, let’s see how it actually performs in practice. The full implementation of the learning approach on the cliff walking problem is found on my GitHub repository.
Let’s give our Q-network – 3 hidden layers with 25 neurons each – a spin. Same 0.1 learning rate as for the tabular variants, no serious tuning.
![Reinforcement learning with learning rate α=0.1 for all algorithms. Deep Q-learning effectively does not learn anything here, whereas Q-learning and SARSA converge quickly. [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/09/11bjqH_cgTNu9F5Kos8yOdg.png)
Yikes! Q-learning and SARSA rapidly converge to stable policies, whereas deep Q-learning seemingly learns nothing of note. On top of that, the latter takes much more computational effort. On my laptop, 10,000 iterations take about 20 minutes, compared to mere seconds for Q-learning and SARSA.
Fortunately, after some tuning – in particular, decreasing the learning rate to 0.001 – things looks a lot rosier. Convergence still takes a lot longer than the tabular variants, but overall deep Q-learning arrives at the same policy as regular Q-learning.
![Reinforcement learning with learning rate α=0.001 for deep Q-learning effectively does not learn anything here, whereas Q-learning and SARSA converge quickly. [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/09/1VfCUm-OdZtAyheOBGAmPHQ.png)
Let’s take out the rest of our arsenal: experience replay, a batch size of 5 (update frequency is also reduced by a factor 5 to retain the same number of data points) and a target network updated every 10 episodes. The final result:
![Deep Q-learning with experience replay, mini batches and a target network. Deploying these stabilization techniques did not result in faster convergence. [image by author]](https://towardsdatascience.com/wp-content/uploads/2021/09/1pf4Grruyuzm_0MFV92BcdA.png)
That’s…actually quite disappointing. Despite all the effort we took, a simple Q-table still beats our fancy neural network. What happened? What went wrong?
Nothing happened. Nothing went wrong. Deep Learning is simply challenging. We try to learn a generic function that holds for all states. Backpropagating errors through multiple layers takes time as well. Our unfiltered replay buffer holds many poor observations. Slowing down update frequencies – to accommodate for target networks and batch learning – also slows down convergence. Neural networks are not magic black boxes. Machine learning does not equate supernatural intelligence.
Of course, we could have spent more time on tuning to improve performance. How many observations should there be in a mini batch? What is the ideal network architecture? What priority should we give to experiences? How frequently must we update the target network? All important questions, but also questions that take time to answer. Parameter tuning is expensive.
The somewhat sobering conclusion: although potentially incredibly powerful, deep learning is also hard to implement succesfully. Depending on your problem and your objectives, it might not always be the best way to go.
Takeaways
- Deep Q-learning involves more than replacing the lookup table with a neural network. It generally is less stable performance and requires substantially more modelling- and tuning effort.
- Use appropriate normalization and one-hot encoding to make states and actions suitable for the neural network.
- Experience replay – random sampling from a buffer of past
(s,a,r,s)
tuples – breaks the correlation between subsequent observations. - A target network— a periodic copy of the Q-network – may be used to compute Q_t+1. This reduces the correlation between expected value
Q_t
and target valuer_t+Q_t+1
. - Mini batches stabilize updates, utilizing multiple observations at once to update the network.
_The full code of the deep Q-learning algorithm can be found on my GitHub repository._
My implementations of tabular Q-learning and SARSA for the cliff walking problem are detailed here:
Walking Off The Cliff With Off-Policy Reinforcement Learning
The discrete policy gradient variant is shown in this article:
Cliff-Walking Problem With The Discrete Policy Gradient Algorithm
For the deep policy gradient, check:
A minimal working example for Deep Q-learning is given in the following article:
A Minimal Working Example for Deep Q-Learning in TensorFlow 2.0
Finally, I discuss implementation examples for experience replay, batch learning and target learning here:
How To Model Experience Replay, Batch Learning and Target Networks
References
[1]Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An introduction. MIT Press.
[2] Matiisen, T. (December 19, 2015). "Demystifying Deep Reinforcement Learning". neuro.cs.ut.ee. Computational Neuroscience Lab. Retrieved September 2021.