Playing Chess With Offline Reinforcement Learning

Adding A Splash Of Offline Reinforcement Learning To Our Generalized AI

Published in

Towards Data Science

8 min readApr 29, 2022

Hi everyone, today, we will be improving the generalized AI we created in my blog post Playing Chess With A Generalized AI. For this, we will be making two significant upgrades to our training process. First, we will be utilizing offline reinforcement learning (RL). And secondly, we are going to make a minor change to the data used when training our backbone layer.

In the original implementation of our AI, we used a method of training called online RL. Online RL is a form of learning where we gather training data through direct interaction with the environment and only process it once during training.

Online Reinforcement Learning | Image by author

Previous Training Set Up

We saved all game states into a cache after each action in our original implementation. Then at the end of the game, we trained all model layers with that cache.

Initial Training Diagram | Image by author

We updated the model at the end of the game and not after every action to ensure the model could learn zero-sum games (chess, go, checkers). Zero-sum games do not have an immediate reward making them challenging to train after every action. This stems from the fact that you wouldn’t be able to tell if the agent’s chosen action was good or bad until the end of the game. This is why we store all actions in a cache since, at the end of the game, we can assume all moves made by the winner had a positive impact on winning and all moves by the loser harmed our chances of winning.

Current Issues

When analyzing the loss values during training, I noticed a destabilization amongst many layers.

Representation Layer Loss Value | Image by author

Backbone Layer Loss Value | Image by author

Policy Head Loss Value | Image by author

Reward Head Loss Value | Image by author

Next State Head Loss Value | Image by author

In RL, destabilization is a somewhat familiar thing, and since our model uses the MuZero paper as a reference, we can infer that this destabilization would converge at some point, allowing the model to effectively learn to master the game like the original paper. To support this inference, we can see definitive proof that the model is learning as the latest model continually outperforms the previous model. However, by addressing this destabilization issue, I believe we can improve the speed of this convergence.

It’s tough to say what the exact cause of the destabilization is. But, I believe it’s partially caused by retraining the whole model using online RL. More specifically, training the representation and backbone layers using online RL after every game. The logic for this belief is that we do not give the models enough data to learn effective encodings by retraining the representation and backbone layers online after every game. This belief is supported by the backbone loss graph above. This graph shows that the backbone layer is having trouble creating consistent game encodings. This is apparent from the variability in the loss values found in this graph. Having ineffective game encodings is detrimental to the overall model health as it would force the model’s heads to produce inaccurate predictions since it can not infer the appropriate information. Because of this, our primary focus will be to improve the representation and backbone layers. By doing this, I believe our complete model will see significant gains.

Improved Training Set Up

As stated in the intro, the first improvement we will implement is the use of offline RL. Offline RL is a way to train an RL agent using logged data found during exploration.

Offline Reinforcement Learning Diagram | Image by author

For our case, this would be positive since we could provide more data to the backbone layer during training, as we no longer need to throw out all the previous games.

Our new method will not use offline RL independently and instead will be a hybrid approach of both offline and online RL. First, we will fine-tune the model’s heads after every game using online RL while also logging the game for future analysis. By fine-tuning the model’s heads after every game, we help improve our data exploration by having the model try out newly learnt moves.

After playing x (I used 20 in my tests) amount of games, the complete model is updated with offline learning using all logged game data. We retrain all layers after playing the xth game and not every game since training on all logged game data is very computationally intensive. Since computation costs money, we figured it would be best to train the model in the least time, not the least amount of games. We choose to retrain the complete model offline at this stage instead of continuing to train online with the belief that given the large dataset, it helps the model not place too much weight on its last game and gives it more examples to create better game encodings.

In an attempt to reduce the chance of bias towards certain game states, all duplicated game states will be dropped, keeping the most recent occurrence of the game state. Keeping the most recent instance of the game state should also help keep the model from regressing by providing examples of what the model believes is the better choice at the time of exploration.

These modifications would make our new approach look like the diagram below.

The second improvement we will be making is the actual data we use when training the backbone layer. Previously to train the model, we used a combination of the loss function from the model’s different heads.

Old Backbone Loss Function | Image by author

This loss function is not entirely ineffective, so it was not instantly clear that it needed to be changed. Although after some thought about the purpose of this layer, I realized I was not training it as effectively as I could.

The purpose of the backbone layer is to create game state encodings, meaning it would need to produce consistent encodings for all game states. This means that the game state and no action must have the same encoding as the previous game state with the respective action that would result in that game state.

Game State Encoding Definition | Image by author

The equation above represents the idea of our consistent encoding logic. With the above equation, we can utilize self-supervised learning to train the backbone layer more efficiently, making our new backbone layer loss function look like this.

New Backbone Loss Function | Image by author

With these two new upgrades, here is what our new loss values have become.

New Representation Layer Loss Value | Image by author

New Backbone Layer Loss Value | Image by author

As you can see in the above graphs, we see a better stabilization of our loss values. Especially in the representation and backbone layer, which is where we believed the problems resulted from.

With the new updates to the model, we have also seen improvements in training time. To validate this claim, I conducted a small tournament of three separate models with the same amount of parameters but all trained using different methods. The first model (Old) was trained for a longer duration using the old training method and old backbone loss function. The second model (New) was created in less time using the new training method and new backbone loss function. Finally, the third model (Full Offline) was created in less time, using offline RL after every game with the latest backbone loss function.

Based on the results in the table above, we can definitively say the new model performs better while trained for less time, validating the new training method. However, an interesting observation from this experiment is that my full offline model also performed similarly to the old way while being trained for less time. It’s difficult to know precisely which change would have caused this. Since this could mean that the offline method is better than the complete online method, that the new backbone loss function is superior to the old loss function or that the combination of full offline and the latest loss function helped.

Model Training Time Comparison | Image by author

Above I have displayed another set of training statistics to help illustrate the difference between the new and full offline training methods to better understand why the new method performs better in less time. The table above shows that the new method emphasizes exploration as it has a more extensive exploration time than the full offline model while maintaining a lower total training time. The new model further displayed this emphasis on exploration, having played more games than the full offline model. I believe this emphasis on exploration gives the new method its strength. In addition, this emphasis on exploration allows for a larger dataset in a shorter time, which fully retraining all the model layers would benefit from.

I believe the new Chinchilla paper by DeepMind helps support this idea of more data being very important as it explored the concept in-depth for Transformer based models. I’m not going to go in-depth about the Chinchilla paper here, but a speedy summary is that we have our scaling laws wrong and should be providing more data based on the number of parameters in a model.

Thanks

And there you have it, we have successfully upgraded our generalized AI allowing it to train more efficiently. You can check a full version of the code on my GitHub here. Feel free to download and try it out on your own.

Thanks for reading. If you liked this, consider subscribing to my account to be notified of my most recent posts.