Introduction to Reinforcement Learning (DDPG and TD3) for News Recommendation

Deep Learning methods for recomender system exposed

Published in

Towards Data Science

32 min readAug 20, 2019

TL;DR: Reinforcement Learning is the ideal framework for a recommendation system because it has Markov Property. The state is movies rated by a user. Action is the movie chosen to watch next and the reward is its rating. I made a DDPG/TD3 implementation of the idea. The main section of the article covers implementation details, discusses parameter choice for RL, introduces novel concepts of action evaluation, addresses the optimizer choice (Radam for life), and analyzes the results.
I also had released an ml20m dataset version specifically adopted for Markov decision process and to use with RL.

Reinforcement learning as-is is a pretty hard topic. When I started to dig deeper, I realized the need for a good explanation. This article, coupled with the code is my school project. I am currently in a sophomore year of high school, and I understand the hard mathematical concepts in a more ‘social studies’ kind of way. I hope this article proves to be helpful for newcomers like me.

I created a GitHub project you can clone and follow along! Make sure to check it out. You can download everything that I processed on my PC in the downloads section. As well as FAQ, dataset description, some docs, how-tos and more. It is frequently updated. I haven’t pushed for a week because I was writing this article. I hope you like it! https://github.com/awarebayes/RecNN

This article provides a comparison of the most popular recommendation methods as well as an in-depth look at various reinforcement learning algorithms, overviewing each tweakable parameter and the aftermath of changing it. I also propose alternative ways to evaluate deep learning-powered recommenders and discuss different optimizers for that application. It is also my legacy because my final year of high school is coming, so I won’t have enough time to work on it. If you are interested in supporting the project, your contributions are welcome!

Introduction

A great tale of optimization

Static vs. Dynamic time series dataset
Why you should HDF5
Encoding time series

Methods for News Recommendation

Similarity search
Matrix Factorization
Restricted Boltzmann machines
Factorization Machines
Reinforcement Learning
Methods comparison

Embeddings recap

Information Theory in Deep Learning
Information Plane Theorem
Why do embeddings=’bottleneck features’ make sense

Markov stuff recap

Markov property, chains, games, and decisions
Reward vs .Value
Continuous State Markov Process

Off-Policy methods:

DDPG: Deep Deterministic Policy Gradients

Simple explanation
Advanced explanation
Implementing in code
Why it doesn’t work
Optimizer choice
Results

TD3: Twin Delayed DDPG

Explanation
Implementation
Results

Conclusion

On-Policy methods: (coming next article…)

PPO: Proximal Policy Optimization
GAIL: Generative Adversarial Imitation Learning

Off-Policy Deep Reinforcement Learning without Exploration BQN (coming next article….)

Introduction

There are a couple of reasons this article took so long to write it.

Firstly, because of the frustration with the dataset being dynamic. When I started prototyping, it used to take more than 40 hours for just one iteration. With basic pandas and its optimization done, it shrinks down to 1.5. When I implemented dynamic dataset, the thing takes 10 minutes. If you encode the states with state representation, it comes down to 3. Also, I couldn’t get the DDPG working at all, and it added quite some impact. Thus, I ended up using a static time series dataset + TD3. However, more about it later.

However, above all, most of the articles on TDS are paid. Thus there are no premium articles, no Patreon, no money begging. You can clap to this article multiple times (please do so with the button up left) and go to the GitHub page and star the repo.

It is my school project, and starring, it is essential to me. It would also give me better chances at winning the project competition, maybe even university payment cut-offs.

Tips and tricks for time series

Avoid pandas altogether!

As you can see, pandas can be optimized, but in the end, it is still quite costly to run because even my best optimization does not scale well. The x-axis represents the power of 10. The y-axis is the time it took (in seconds). Also, the thing with deep learning is that we often run the model on the same dataset over and over. So it would make total sense to make our dataset completely static, eliminating all pandas interaction whatsoever. Let’s just run our dataset generator and save the results. If you have forked my repo and following along, the notebook is located under notes/1. Vanilla RL/1. Generating the static dataset.ipynb. Note: It is entirely mandatory; you can download the dataset generated by me.

Store your data in the HDF5 format!

Sometimes the time-series cannot be entirely fed into your RAM. Also, the HDF5 format was developed specifically for that purpose. Use whenever possible because it works way faster than PyTorch and natively comes with numpy support. The only limit is your solid-state disk so you might want to buy a PCI Express one with fast reading.

Encode Dimensions!

If you are using a static size time series (also called ‘rolling’ ts), make sure you encode the data into lower dimensions. For the classic ML approach, we have Principal Component Analysis or PCA for short. Here is a video if this is a new word for you.

You can also use LSTM Autoencoders for dynamic length Time Series. From my experiments, I noticed that linear AEs perform poorly for rolling ts. However, I use state representation as the authors of the paper proposed. Rule #1337 of DL states that 90% of the actual learning happens in the first 10 minutes. So I ran the TD3 model and used its state representation module to encode the TS.

Methods for News Recommendation

When I first started to dig into the stuff, it realized that there is no comprehensive guide to even basic techniques of recommendation. I recently had found out about Restricted Boltzmann Machines. This section aims to fix it. I attempt to overview some of the most popular ones and make a quick comparison. For more analytical results, look at the memes below.

Similarity Search

SS is the most straightforward concept to understand. Just look for similar films liked or disliked among the users. State (being the films rated) is often represented as a metric space. There are a couple of ways to encode it from raw movies indexes. The first one is to use the embedding layer, which is often the case in modern DL applications. A similarity metric such as cosine or Euclidean distance is then used to rank them nicely. However, looking back to a more classical ML approach, we have the Locality Sensitive Hashing. LSH is an algorithmic technique that hashes similar input items into the same “buckets” with high probability. Either way, we end up with a bunch of ranked states that are similar to the one we are predicting for. Then we look at the films the users liked/disliked and recommend them. If you want to use this method, I suggest you check out Facebook’s Faiss library: GitHub link.

Matrix Factorization

The idea of factorizing matrices, i.e., breaking a big matrix into a product of smaller ones, further extends similarity search. The big matrix can be expressed as a table with rows being the movies, columns being the users, and the values are the ratings. We extend that idea by assuming that the big matrix can be expressed as a dot product of two smaller matrices. They represent the hidden (embedding) representation. The process can is easily implemented using PyTorch:

user_matrix = user_embedding(users)
film_matrix = film_embedding(films)
ratings = (user_matrix * film_matrix).sum(1)loss = MeanSquares(ratings, target_ratings)
loss.backward()

‘Users’ is an integer vector of userId. ‘Films’ is an integer vector of film_id. User and Film matrices are 2D embeddings for corresponding indexes. We calculate the dot product because we want to know the rating. As you might have noticed, the method is pretty limited due to the usage of embeddings. You cannot add new films/users to the existing ones unless you are using something like Incremental SGNS or Reservoir Computing. Just a good overview article of the methods above: link. Also, if you want to get an in-depth understanding of MF, I highly recommend this video by Luis Serrano.

Restricted Boltzmann Machines

RBS is an early variant of an autoencoder. It falls under the energy-based methods. As an autoencoder, it is used for dimensionality reduction. The restricted part of the naming means that there is no interlayer propagation. The architecture looks like a usual two-layered linear network. The forward pass looks precisely like the feedforward net.

The critical difference is that RBMs are probabilistic. They use Bayes stuff to work. Whenever you try to calculate the state of the network, i.e., the sample from these weights and biases distributions, you are met with the Boltzmann equation. It is an equation from particle physics. The learning of such model consists of two main steps: Gibbs Sampling and Contrastive divergence.

I found out about these machines from Andrew Ng’s interviews with Geoffrey Hinton. When asked about his greatest achievement, the latter acknowledged his contributions to training algorithms of RBMs. Just a reminder: G.H. is a man behind backpropagation. Indeed RBMs achieve state-of-the-art performance in the Netflix competition. If you want to learn more about the energy-based models: here are Yann LeCun’s notes.

Factorization Machines (not personalized)

Factorization Machines had proven to be super useful for click-through rate prediction. Their speed allows them to be highly scalable, but they are only applicable to data with categorical features. Nevertheless, they are worth a shout out. We need to incorporate feature data into our factorization process somehow. Of course, we can consider a single feature to be resourceful enough:

ratings = linear(features.size(1), 1)loss = MeanSquares(ratings, target_ratings)
loss.backward()

As you can see, they cannot be used for a personalized recommendation!

However, it would be cool to take label-label cross-correlations of a feature into consideration. We just learned about the concept of order. Order is the number of features calculating the cross-correlation for. Assuming the order to be 2, we need to calculate the CC for two features. Nevertheless, the feature is a categorical variable, so how does one calculate the dot product for two cats? More latent variables to the god of the latent variables! Feature labels can be described using vectors, and those vectors can be regressed using the same idea of embeddings we utilized for matrix factorization.

ratings = linear(features.size(1), 1)(features)# factorization machine
latent = latent_embeddings(features)
latent_gram = latent * latent.T
features_gram = features * features.T
ratings += (latent_gram * features_gram).sum(1)loss = MeanSquares(ratings, target_ratings)
loss.backward()

Here is an article that helped me to understand this concept better: link.

Reinforcement Learning

The key advantages of using RL for news recommendation are Markov Property and State Representation. Because we do not rely on any embeddings, we can recommend any movies to any user. Movie embeddings generated for this application do not rely on the embedding layer. I used simple statistics, such as average rating, revenue, TF-IDF for texts, genres, etc.… + PCA. Thus, you can add a new movie for a recommendation without re-training the network. Alternatively, you can use these new embeddings for state representation. Markov property ensures that we can use static-length time series. More about it later.

Comparison

warning: satire

To sum it up: RL allows learning on minibatches of any size, input of static length time series, does not depend on static embeddings, works on the client-side, can be used for transfer learning, has an adjustable adversary rate (in TD3), supports ensembling, works way faster than MF, and retains Markov Property. The most significant trade-off is the accuracy: big corporations such as Netflix/Amazon still rely on MF/RBM.

Embeddings explained

This particular application, unlike Q-Learning, aims to solve the continuous control problem. In Q-Learning state is often continuous, but the action itself is discrete. Whereas in our case, the action (=movie) is not discrete, but it is a vector instead.

But how do we get this vector we will be trying to somehow regress later on? Last time I checked the ML20M dataset, there were no vectors to be found. The answer is simple: I generated these vector representation of numerically indexed movies myself. Most of the stuff is trivial: I parsed the IMDB/TMDB data and applied basic statistics to make a vector of the gathered data (by encoding categories, using TF-IDF, applying PCA) But one of the most important things I utilized is Google’s BERT for text data embeddings. I know, embeddings are technically a different thing, and these are called ‘bottleneck features’, but I will stick to that word.

However, why does some middle layer of a neural network make sense? Why can you use that data as contextual information? Those questions are answered by the field of studies called Information Theory. It is not widely popular in the context of DL, but there are fascinating lectures by Naftali Tishby about the `maximization of preserved information` and other hints of why the nets learn. I recommend you to check these out!

Some highlights:

00:19:00 Information Plane Theorem — for a sizeable typical X, the sample complexity of a DNN is entirely determined by the mutual encoder information, I(X, T), of the last hidden layer; the decoder information determines the accuracy (generalization error), I(T, Y), of the previous hidden layer. This phenomenon is the thing I use in this app. You can see the visualization of this phenomenon at 00:23.
https://www.youtube.com/watch?v=pFWiauHOFpY

Markov stuff recap

Now that we got our dataset working, understood how I transformed movie IDs into contextual vectors, it’s time to recap some things about reinforcement learning and game theory. If you read Arxiv papers on DL/RL, is is a common thing to see the basics sorted out.

News Recommendation can be thought of as a game that we are trying to win. We act based on the state, and the state is what we know about the user: ratings and movies watched combined. The action is produced based on the state and describes a point in space.

Markov Property

Everything from now on strictly obeys the Markov Property. Quoting Wikipedia: ‘ A stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it.’ Why should I care, you might ask. We assume that we can act only based on the current state, ignoring anything that happened before. Having that in mind, the problem becomes way more natural to solve because we don’t have to worry about the past. Markov methods provide a framework that allows you to focus on the things that happen at the moment.

Markov Chain

You also have heard that name, because DeepMind’s AlphaGo uses Markov Chain Monte Carlo. Monte Carlo part is only used for finite-state games such as chess or go. However, Markov Chains are everywhere! For simplicity purposes, let’s consider a discrete state. We are also assuming that recommendation is a stochastic process meaning that we randomly walk over the dataset. Like any other physical chain, Markov Chain consists of ‘loops’ called nodes. Each node has a conditional transition probability. Think about is as a random graph walk, and each node that is currently visited has probabilities that determine which adjacent node goes next. Here is a short article with 19k likes that goes into details. The cool thing is that it preserves the Markov property: transition is only dependent on the current state, being the node visited.

Markov Decision Process

The idea of Markov Chains can be further applied to our ‘game’. We want to utilize the Markov Framework for our application because it is super handy. However, how do you apply an abstract chain to state-action-reward-state-action… process? Remember the graph example I introduced? Markov decision process can also be interpreted as a graph. We assume our current State to be some node in the graph. Because it retains that exact property, we don’t need to know anything that happened before we arrived at that State (node). From that node, there is a multitude of actions that can be taken with assigned probabilities. These actions can be interpreted as adjacent edges that bring us to the New_State. When we arrive at the new State, we immediately receive a Reward. There is also one thing in the dataset that is used for TD — Done because we don’t want to propagate temporal difference beyond the last step. Although, more about it later.

States are green, actions are red, and the rewards are yellow

The dataset I built consists of State-Action-Reward-Next_State-Done values. There is an example on GitHub

Continuous State Markov Process

One last thing to fully understand what is happening is that action is not discrete. I already have mentioned a couple of times that the action is a vector. I can understand how to move in the graph based on a discrete number, but where do you go with a vector? So instead of graphs think about the MDP as an N-dimensional plane. For simplicity purposes, I will use a 2D plane as an example. Let’s consider an ant agent moving on the playground. At the moment he knows that he probably needs to bring some leaves to its house. The only problem is there is no such discrete action ‘hard coded’ into its brain. So it needs to take continuous steps in the environment: move his limbs, clitch his jaws, and avoid the evil antlion. With each step it takes, the reward is added. It might be some realization that the leaves are somewhere nearby, and it is going in the right direction. One other important distinction is that we assume that time steps are discrete. The recommendation is only possible if we know a certain number of films and the ratings assigned by the user. It would be weird to try to suggest something taking 5.6 films and 1.3 ratings into consideration.

And if you look at the ‘continuous vs. discrete action ’ example above, there is a MuJoCo Ant agent. It takes limb positions and angular velocities as an input to supply the environment. Actually, there are a lot more funny agents to fiddle with such as cheetah and Boston Dynamics robot models.

Value Function vs. Reward

This part is essential for understanding the Temporal Difference Loss, which will be covered a little bit later. In the context of Markov Games, we have such thing as the Value Function. So the Value Function does not imply a function that estimates the Reward. Value can only mean how good the action is for the current state. Although, as you will see, it does not necessarily mean ‘give me the reward’ for that action and state. It is a more abstract measure of ‘goodness’ and can be expressed as a continuous real-valued function. The range of the function can be any real number. Thus, the value is not an integer number in [-5, 5].

Part 1: Deep Learning

Now let’s forget for a brief moment about all the Markov stuff that we learned and try to get started with a basic DL approach. So you decided to build a deep-learning-powered recommender system, already know about Markov Decision process and the dataset structure, overall very eager to jump straight into the action. Let’s try a more fundamental approach, without reinforcement learning just yet. All you have is a simple linear perceptron and the dataset: a bunch of states, corresponding actions, and rewards to those actions. Translated to more human language: films watched, the movie chosen to see next and its rating by the user. I want you to go ahead and think about the things you would do with these tools.

We want to suggest good movies; hence, we train the network to generate movies like the action:

generated_action = model(state)
if is_good(reward):
 loss = MeanSquares(generated_action, action)
 loss.backward()

This approach is called ‘Policy Learning’ because we learn the policy, being the action. It has its applications, but in the endpoint, PL is very limited. If you train a network like this, it will work fine and be somewhat usable. Still, did you notice the ‘is_good’ function? We learn only using ‘good’ actions, disregarding anything else.

Part 2: Act and Criticise

So far we’d been looking at the reward as our ‘learning’ criteria. However, wouldn’t it be wiser to analyze the action instead? That is the main idea behind the Actor-Critic methods. For clarity’s sake, let me introduce the namings for the networks. The network we already contemplated is called Actor, and it acts based on the state. The network that tries to predict the reward based on the state and the Actor’s action is called Critic. As we all know, from watching Bob Ross: everyone needs a friend. So let’s add to our lonely Actor someone to mock his actions.

#teach the critic
generated_reward = critic(state, action)
reward_loss = MeanSquares(generated_reward, reward)
reward_loss.backward()# teach the actor
generated_action = actor(state)
value = critic(state, generated_action)if value > 0:
 # learn the action
 action_loss = MeanSquares(generated_action, action) # learn action
 action_loss.backward()

Eventually it the actor will do better actions (maybe maybe maybe) and the loss will converge to zero or something. However, since that, we are working in the chad Pytorch; we can do wacky things with losses without worrying about the backpropagation! What if we want to directly use the critic-produced reward for the actor-generated action as a loss metric? With Pytorch it has never been easier!

#teach the critic
generated_reward = critic(state, action)
reward_loss = MeanSquares(generated_reward, reward)
reward_loss.backward()# teach the actor
generated_action = actor(state)
action_loss = -critic(state, generated_action)
action_loss.backward()

What we did here is that we used the critic as our loss function! As it was intended since the beginning. Note the minus before the criterion. We want to maximize the rewards. Still, there is no such thing as ‘maximize’ in machine learning. We often do the opposite: minimizing the negatives.

That is the DDPG algorithm. Although, if you look in the paper, the code for the critic training will be different. That’s what we are covering next.

Part 3: Implementing Temporal Difference

There is something wrong with our Critic. We don’t need to estimate the rewards. Let’s bootstrap the Value for the action instead. Why might you ask? The answer is simple: future rewards may be dependant on the current action. Considering the actions as individual Markov chain steps, we want to maximize the Value. What is Value?

Value in the context of a Markov Game is a real-valued function: V(state, action) that indicates (not in terms of integer rewards -5 to 5, it can be any real number) how proper the particular action for the corresponding state.

So that’s where we need to bring back our Markov Chains. In this particular application, I use TD(1) version of the algorithm, meaning that I bootstrap the value 1 step in advance. You can implement TD(n), but the number of value iterations increases linearly.

To learn the value function, we bootstrap the reward and the value of the next state and text action. You can read more about TD here.

# train the actor
next_action = actor(state)
action_loss = -critic(state, next_action)
action_loss.backward()
reward_loss.backward()# train the critic
next_value= critic(next_state, next_action )
expected_value = reward + next_value
value = value_net(state, action)
value_loss = MeanSquares(value, expected_value))

Part 4: Are you copying me?

There is only one last thing you need to understand the DDPG completely. It uses the concept of Target and Learning networks. Target network is more stable rather than the learning one because it is updated using learning parameters through the soft update. It shows less tendency to overfit and overall performs better. Also, the TD loss is slightly tweaked. We don’t need to bootstrap the value for endgame actions. Gamma parameter serves for stability, and I set it to around 0.9.

# train the actor
action_loss = -critic(state, actor(state))
action_loss.backward()# train the critic
next_action = target_actor(state)
target_value= target_critic(next_state, next_action )
expected_value = reward + (1 — done) * gamma * next_value
value = value_net(state, action)
value_loss = MeanSquares(value, expected_value))
value_loss.backward()# soft update
def soft_update(*networks, soft_tau=1e-2):
 for target, learning in networks.parameters():
 target= target.data * (1.0 — soft_tau) + learning .data * soft_tausoft_update(actor, target_actor)
soft_update(critic, target critic)

Here is the actual screenshot of the update function I use

That is basically what DDPG is about.

Interpreting the results

In the next section, we will try to compare and, primarily, evaluate different reinforcement learning algorithms. But how do we tell if the results are good or not? The critic network assigns the values to our actions; however, are you sure whether the value is meaningful. Well, they are based on critic loss. If critic loss is small and the policy loss makes sense, we taught the actor. But those metrics are not enough. I also consider euclidean and cosine distances for actions. A matrix often represents this. You can see the distances described as individual grid pieces varying by color. The warmer it is, the bigger the distance. For instance here is how real actions (those I produced with statistics embeddings) look like.

You will see similar matrices for training and testing actions later on. Another method to evaluate the ‘artificiality’ of actions is to use the autoencoder reconstruction error. I train the autoencoder model to reconstruct the embedding actions. The reconstruction error is used as artificiality metric. This technique is widely used for anomaly detection because it is unsupervised. Also, I perform Kernel Density Estimation on the error distribution for pretty plotting and visual comparison. Wasserstein distance or KL Divergence can also be implied.

The goal is to make the generated test distribution close to the true (real) one. Generated train (or just generated) may be of different forms.

DDPG: an in-depth look

All I did at this point is copypasting Higgsfield’s RL adventure code with minor tweaks to work with torch 1.1

It looks like we have something! Matrices look cool? Some correlation is showing up… The loss is descending… Wait, say what is on the scale bar? Why do I have next to no cosine distances? Why is the policy loss falling into the nether realm? Why don’t my pet AIs work as supposed?

And that’s when it comes to the true joy of RL!

Countless things can go wrong! Maybe I should play around wit the learning rate? Admittedly, that’s the first thing that comes in mind for a person familiar with DL. However, why my error is growing in the first place? I haven’t encountered these things in the DL… WTF is happening with the v-net optimization, PyTorch? It doesn’t seem to have much of an impact as the process starts, but then the gradients explode, and everything goes into oblivion… So, likely, the learning rate is excellent. Maybe that’s because of the Temporal Difference bootstrapping? But not only does the TDB use the target value network, which is soft updated via soft tau, the TD it also utilizes a parameter called gamma for weighting the future expectation.

P_gamma = 0.9
P_min_value=-5
P_max_value=5
P_soft_tau=1e-2
P_policy_lr = 1e-5
P_value_lr = 1e-6
P_actor_init_w = 3e-4
P_critic_init_w = 3e-4

What am I supposed to tweak? Maybe I set the wrong parameter for clipping the value:

expected_value = torch.clamp(expected_value, min_value, max_value)

Making the DDPG understand

One thing that came as a bamboozler to me is that the weight inits trick works slightly differently than one would expect. If we initialize the last layer’s weights of the net with smaller numbers, it will generate smaller vectors. Wrong! It works exactly the opposite. So it is the main reason for the cosine distance being so short in comparison to the distribution of the real action. However, it doesn’t seem to have much of an effect for Euclidean.

Also, have you noticed the min and max parameters for the value clipping are min and max rewards, respectively? That also needs to be changed, as I mentioned it back in the Markov property stuff. I set it to reward * n_td_steps. That works fine, and at the end, the loss rarely goes below 6–7 at most.

You may not realize it, but a smaller learning rate for critic leads to actor overfitting. If the policy loss falls weirdly fast, change the LR, because the actor is overfitting. Keep in mind: not the most intuitive thing!

Always debug and keep track of the following parameters: value, target_value (mean, std), expected_value (mean, std), generated_action (cosine distances, Gramian, variance, std, means, samples KL distance to the original distribution). Always plot those metrics at the end of the testing.

P_critic_init_w=3e-5
P_actor_init_w=3e-1
P_min_value=-10 # or -100
P_max_value=10  # or  100

We solved the RL part of a not working network, now the reasons for the loss to grow are purely DL. Here are some bits of advice for you:

Play around with the optimizers. There are so many of them for a reason. Start with Adam, try Hinton’s RMS prop, pure SGD, SGD with warm restarts and CosineAnnealingLR. Although, if you don’t care, there is a neat new algorithm for you to try out! It’s called RAdam: GitHub link. I use it whenever possible because it reduces the number of parameters to optimize. It is later explained.
Use grad_clip extremely carefully. Using it on the value network changes the sign of the value function. I don’t understand why that is the case. But policy seems to work just fine. I spend almost a week figuring it out. No, for real.
Weight decay is crucial to avoid overfitting. Moreover, it is easy to use with Pytorch optimizers. Set it to a smaller value, and the exploding gradients are gone within seconds. Also, use it with grad clip for the actor.

This is supposed to bring back to life the DDPG on your/mine data.

What do the DDPG parameters change:

Value network learning rate — affects Euclidean distanced between the generated action. It also improves the cosine distance, but on a smaller scale. Only works for SGD/Adam and does not matter for Radam.

Value LR is too big: Euc is skyrocketing value LR is balanced

Actor Weight Initialization — changes the diversity of the cosine distances in the action space. If you see next to no variance in your action distances, start by changing this parameter.

Small Cosine distance as the result of lousy weight initialization

Soft tau — is responsible for the target network update. It affects the delay of the ‘rollercoaster’ of the loss function applied to the target network. You can see that both losses resemble each other. If you run the code, you will see that value loss slowly translates into the policy loss. As I said, it influences the smoothness of such translation. It is another critical parameter to consider if something does not work for you, but don’t overestimate it.
Gamma — is the weight of expected_action in temporal difference. Imho is not so important. Smaller values may lead to gradient explosions.
Min/Max values — I set them to +- 10. You can experiment with another large number. Alternatively, you can use x*n_td_iterations where x is min/max possible reward. It makes the learning process more rigid but defies the value function property. Alternatively, if you don’t care at all about the rollercoaster losses, you can use numpy infinity. The parameter has a positive influence on the value test loss.

Always look at the scale of the loss function. The value should be somewhere between [0, 10], whereas policy within min/max values for TD clipping. If it goes up to 10e7, you are wasting your time debugging!

Why the optimizer choice is important

I already paid attention to this, but I cannot stress it enough. Always use the state of the art optimizer. Check Arxiv daily, subscribe to the twitter bot and be on a lookout, because the Machine Learning community is now at its peak of productiveness.

Adam is often the default choice for optimization. Moreover, this should be changed. Adam is one of the worst options you have. Why? It’s very very learning rate sensitive. Thinking back about the learning rate schedulers: carefully memorizing which step to put the milestone at, using CosineAnnealingLR warm restarts SGD, CyclicLR. Everything of the techniques above is now in the past — no more fiddling with the learning rate. Just use RAdam! On the Variance of the Adaptive Learning Rate and Beyond!

RAdam is an ideological extension of Adam, and it uses a smart trick called Learning Rate Warmups. Unfortunately, the authors of Radam were not welcome enough to make the Medium article free, so there will be no detailed explanation. You can read their Axriv paper.

As in Figure 8, we assume gradients follow a normal distribution (mean: \mu, variance: 1). The variance of the adaptive learning rate is simulated and plotted in Figure 8 (blue curve). We can see that the adaptive learning rate has a significant variance in the early stage of training.

Nevertheless, I claim that just by changing the optimizer, I was able to achieve TD3 like performance with DDPG algorithm. And with way less overfitting in both cases

Let’s stick with Adam and see how it performs.

Here is how the Adam loss looks like. One does not have to be a Ph.D. to spot clear value overfitting. Also, do you notice the Policy Loss? It seems strangely small. The max reward is 5, and max TD clipping is 10.

Train actions seem to overfit, not much of Euclidean Distance is showing up, but alright…

Loss seems to be reasonable at 7.5. Or does it? Alright, let’s look at the autoencoder reconstruction error.

You can spot that something is wrong just by looking at the ae rec error distributions and/or comparing various metrics between them. The dumbest choice would be only to calculate KL/WDistance. You can also look at more meaningful statistical metrics such as mean/std/kurtosis/skewness instead.

Adam test actions: oops! All overfit copies!

CosDist < 0.4, Euc < 3. No words needed. Also, an observation by me: Cosine matrix defines an angle distribution of the direction of vectors. You can use basic math to visualize it.

Woah, that’s some good memory! I wish I could memorize that much. Unfortunately, that is not leaning. You can jump straight inro TD3, and it will undoubtedly help with the overfitting, but let’s try to use the RAdam instead. Just change optim.Adam to RAdam when you define the optimizers.

Moving to Radam

So I ran the same code with RAdam, and that’s what happened.

First Off — no nasty Value Overfitting. Moreover, the value loss is less. The convergence is way more delayed in comparison to Adam. There you can see a clear drop trend at around 400. Here it takes twice as many iterations to notice the pattern. Also, value loss is much smoother. The train actions are way more apart in terms of the Euclidean distance. The cosine dist seems to be preserved across all the experiments. All right, let’s see the test actions:

RAdam test actions: As you can see, we gained three times as much Euclidian distance and 1.5x cosine

The results speak for themselves! Test distribution is approximated way better with RAdam, but the training one is more obscure. Nevertheless, we only care about testing.

Radam results for the reconstruction error.

However, RAdam comes with a drawback. It is hard to finetune models with it. If you leave the model running with RAdam, the training distribution will quickly converge to the true one. Also, the testing distro will be something like a small pine cone at the very right. Need to look out for the Policy loss approaching -5, or manually stop the learning. This is what happens if you do not stop it:

The policy network overfits: values are quickly dropping. It is not the case with Adam.

Value tends to deteriorate for some reason as we will find out in a moment that is due to overfitting.

Combining the Two

My end solution was to combine the two optimizers. First, I learn with RAdam, by creating checkpoints before overfitting I ensure that I have an excellent warmed up model. Then I load these models and train them with Adam to achieve even better performance. Unfortunately, there is no room for automatization: you find the smallest policy loss and start from the nearest checkpoint.

It results in better representation of the training distro, but the testing one is way worse. But this method of fine-tuning will be more useful for the TD3 algorithm. Which we are switching onto… Now!

P.S. all models are available for downloading on the github page(including raw Radam and Adam fine-tuned)

TD3

TD3 stands for Twin Delayed DDPG. Three is the number of improvements the authors of the paper propose. As I already stated, it is an extension of Deep Deterministic Policy Gradients. The differences are the following:

Double Q Learning with no clip. We have two value networks in this algorithm. Also, instead of clipping the expected value, we take the smallest of target values. This simple trick drastically reduces the chance of policy agent exploiting and fooling the critic, because now it has to fool both of them.

#ddpg
target_q_value = target_value_net(next_state, next_action)
expected_value = reward + (1.0 - done) * gamma * target_q_value
expected_value = torch.clamp(expected_value, min_value, max_value)#td3
target_q_value1  = target_value_net1(next_state, next_action)
target_q_value2  = target_value_net2(next_state, next_action)
target_q_value   = torch.min(target_q_value1, target_q_value2)
expected_value   = reward + (1.0 - done) * gamma * target_q_value

2. Delayed Policy Updates. We update policy less frequently in comparison to values. For each three value update, we update the actor just once. This allows for better value estimation and also prevents actor fooling. P.S. it is not a hard concept and is super easy to implement. I used delayed policy updates with DDPG.

3. Action smoothing. TD3 adds noise to the target action, to make it harder for the policy to exploit Q-function errors by smoothing out Q along with changes in action. In my case, the noise is drawn from ~Normal(0, 0.1) and clipped to fit [-.3, .3].

next_action = target_policy_net(next_state)
noise = torch.normal(torch.zeros(next_action.size()), noise_std)
noise = torch.clamp(noise, -noise_clip, noise_clip)
next_action += noise

TD3 Results

The algorithm has pretty much the same parameters as the same as the DDPG (excluding min/max clip, adding noise params). All the code/dataset/pretrained models are on GitHub. TD3 is located under notes/1. Vanilla RL/2.ddpg

TD3: Value loss is growing if you do not stop it. I stop it at 1000

Test distribution is way better approximated

TD3: **Train** action pairwise distances 3x cosine 4x euclidean (compared to ddpg Adam)

TD3: **Test** action pairwise distances 3x cosine 4x euclidean (compared to ddpg Adam)

Real Action Distances. Most of them are somewhat between 10–12

Phew! We’ve made it! What you see is my final result for the On-Policy methods. Now let’s discuss the losses and other stuff. First off: value loss is growing. This phenomenon is reasonable because if we look at the values, they are not overfitting. In ddpg, they were increasing and then decreasing again.

The distances cosine distances in real and generated actions differ. It is something around ~0.6 is generated whereas the real ones are ~1. It can be increased by setting the noise std to a higher value. However, I consider it cheating because we add more noise. Alternatively, you can use another loss function in couple with policy loss to add more cosine diversity

In one of my older commits, I implemented cosine and euclidian distance loss penalties with the new Pytorch JIT compiler. P.S. it returns the pairwise distance matrix, like the ones shown above. You can do whatever you want with this thing (i.e., compare variance, std, mean, KL) to make it look like one of the real actions.

@torch.jit.script
def torch_cdist_euc(x1, x2):
    x1_norm = x1.pow(2).sum(dim=-1, keepdim=True)
    x2_norm = x2.pow(2).sum(dim=-1, keepdim=True)
    res = torch.addmm(x2_norm.transpose(-2, -1), x1,
          x2.transpose(-2, -1), alpha=-2).add_(x1_norm)
    res = res.clamp_min_(1e-30).sqrt_()
    return res@torch.jit.script
def torch_cdist_cos(x1, x2):
    x1_norm = x1 / x1.norm(dim=1, p=2).unsqueeze(1)
    x2_norm = x2 / x2.norm(dim=1, p=2).unsqueeze(1)
    res = 1 - torch.mm(x1_norm, x2_norm.transpose(0,1))
    return res

Ranking the predictions

It’s time to test our algorithms. You can download all the pre-trained models and test them yourself. The file is under notes/results/1. Ranking/ (clickable).

Below you will see examples of distance ranking. It supports scipy. spatial distances or your ones. In the notebook, I included the following: euclidean, cosine, correlation, Canberra, Minkowski, Chebyshev, Bray-Curtis, and city block (Manhatten). Cosine ranking allows for better language and genres diversity and looks very similar to correlation.

DDPG

TD3

That is it! You can see all the ranking examples for both of the algorithms here.

Conclusion

So here we are. Congratulations if you have finally made it. It might seem like a simple project: just using the existing Higgsfield’s algorithm implementation for new data, but I had been working 6 hours a day for the last month to get it working, to figure out the parameters in DDPG, and to understand the depths of PyTorch although the article was the hardest part of it. I am already writing this conclusion, but I haven’t finished Restricted Boltzmann Machines in the comparison section yet. The repo is unpushed. I wonder how it had turned out… Did you understand it?

TD3 implementation is promising. Moreover, the ranking works fine, although there is room for improvement. I am using O(n) algorithm. Highly recommend you check out the Milvus library if you want to use embeddings in production.

I am also yet to implement a web app. Already got a commit on my other repo with react and basic layout. Hopefully, it will be published soon. If not, you can do it yourself: all the models are released.

I honestly don’t have much else to say. I am passing 7131 words already.

Anyway here are some ideas from me:

Because it is recommendation project, you could use top k ranking for action. Those ranking results can be used to help the learning process:

You could add another loss that would calculate distances between real ranked and generated actions. (Another loss for the critic)
How different is the Value function for the ranked actions? Can it be further adjusted concerning the distances?

2. Add yet another cosine/euclidean loss to the generated actions. The scripts are published above.

3. Implement a ranking network. It is an extension of the policy. It takes action as input and learns to generate real actions based on top k ranking.

If you happen to implement these, feel free to commit.

Thank you:

Medium: Ping Guo, ‍임한동[ 학부재학 / 기계공학부 ], Jacky Noah, Shady Hassab, Alexander Makeev, Shivam Akhauri, Diksha Garg, Siddharth Prabhu, Zohar Komarovsky, Hary Prasad, Panagiotis Kapros, Vishal Shrinivas, Yvette Li, Lxs Lxs, Nils Schluter, Ayush Kumar, Dean Soe, Cody Bushnell, Marcus Au, باربری تهران اتوبار تهران, Navi Xie, Sang Huynh, Simon Yu, Yuzhou Zhang, Hoglan Huang, Lambjed Ben, Axel Schwanke, Anirban Saha, Baris Can Tauris, Mingju He, Jean-Philippe Corbeil, Shoaib Hafiz — It was your follow that made me feel somewhat important for the community.

GitHub: navy-xie, KnightofK9, tomatatto, lystahi, jungi21cc, nutorbit, davidjiangt, kiminh, hb1500, YerinMin, Saleh-Hassan, ImranRolo, limtete, erikson92, miladfa7, sudongtan, Belobobr, noaRricky, q710245300, andrebola, chenmingxia, ashdtu, dindanovitasari, kk-hainq, bazitur, youyiandyou, Vi-Sri, sanjmen, YvetteLi — You believed in my project back before it was implemented. Those starts begiventh by thee motivated me to keep on working.

DeepLearningSchool (www.dlschool.org) I wouldn’t be writing this article if it wasn’t for your excellent lessons and mentoring. Hope I make it to MIPT in the next year.

To the newcomers:

Feel free to tell about yourself. I started deep learning when I went to high school. Although, back then, I created countless GitHub projects of other thematics, that were by no means less ambitious. The only problem was that no one knew about it — not a single soul except for a couple of friends. If you are working hard on something, be sure to tell you about it. It’s other’s appreciation that keeps you moving forward. P.S. don’t write yet another Keras for preschoolers tutorial ;)

Recently I have published a new article covering reinforce recommender systems:

Top-K Off-Policy Correction for a REINFORCE Recommender System

OffKTopPolicy is now available for your usage out of the box with no prerequisites in my Reinforced Recommendation…

towardsdatascience.com

What’s also coming

Next article will address some problems with the fact that we have no exploration. It surely will take some time to make, but as I see it BCQ easily integrates with the DDPG. If you want to know more about RL applications with a static dataset here are these links:

Arxiv article: https://arxiv.org/abs/1812.02900

Pytorch Code: https://github.com/sfujim/BCQ

Have some questions? The comments are open…

GitHub repo: https://github.com/awarebayes/RecNN

Don’t forget to clap!

Goodbye!

Introduction to Reinforcement Learning (DDPG and TD3) for News Recommendation

Deep Learning methods for recomender system exposed

Contents

Introduction

A great tale of optimization

Methods for News Recommendation

Embeddings recap

Markov stuff recap

Off-Policy methods:

DDPG: Deep Deterministic Policy Gradients

TD3: Twin Delayed DDPG

Conclusion

On-Policy methods: (coming next article…)

Off-Policy Deep Reinforcement Learning without Exploration BQN (coming next article….)

Introduction

There are a couple of reasons this article took so long to write it.

Tips and tricks for time series

Avoid pandas altogether!

Store your data in the HDF5 format!

Encode Dimensions!

Methods for News Recommendation

Similarity Search

Matrix Factorization

Restricted Boltzmann Machines

Factorization Machines (not personalized)

Reinforcement Learning

Comparison

Embeddings explained

Markov stuff recap

Markov Property

Markov Chain

Markov Decision Process

Continuous State Markov Process

Value Function vs. Reward

Part 1: Deep Learning

Part 2: Act and Criticise

Part 3: Implementing Temporal Difference

Part 4: Are you copying me?

Interpreting the results

DDPG: an in-depth look

Making the DDPG understand

What do the DDPG parameters change:

Why the optimizer choice is important

Let’s stick with Adam and see how it performs.

Moving to Radam

Combining the Two

TD3

TD3 Results

Ranking the predictions

DDPG

TD3

Conclusion

Thank you:

To the newcomers:

Next article

Top-K Off-Policy Correction for a REINFORCE Recommender System

OffKTopPolicy is now available for your usage out of the box with no prerequisites in my Reinforced Recommendation…

What’s also coming

Written by Mikhail Scherbina