Many applications of Reinforcement Learning (RL) are specifically aimed at taking the human out of the loop. OpenAI Gym [1], for example, provides a framework for training RL models to act as the player in Atari games, and numerous publications describe using RL for robotics. However, one commonly under-discussed area is applying RL methods to improve a human’s subjective experience.
In order to demonstrate one such application of this, I have developed a simple game called "Trials of the Forbidden Ice Palace" [2]. This game uses Reinforcement Learning to improve the user’s experience by tailoring the difficulty of the game to the user.
How the Game Works
The game is a traditional roguelike game: a turn-based dungeon crawler with RPG elements and a large amount of procedural generation. The player’s goal is to escape the ice palace, floor by floor, fighting monsters and gathering helpful items along the way. While enemies and items that appear on each floor are traditionally randomly generated, this game allows the RL model to generate these entities based on data collected.
As Reinforcement Learning algorithms are notoriously data hungry, the game was created with the following constraints to reduce the complexity on the RL model:
1) The game has a total of 10 floors, after which the player is victorious
2) The number of enemies and items that can be spawned each floor is fixed
Reinforcement Learning and the Environment
The core concept of Reinforcement Learning is that an automatic Agent interacts with an environment through making observations and taking actions, as depicted in Fig. 1. Through interacting with the environment, the Agent may receive rewards (either positive or negative) which the Agent uses to learn and influence future decision making.

For this application, the Agent is the RL algorithm which tailors the difficulty of the game based on which entities it chooses to spawn, and the game is the Environment that the RL algorithm may observe and have some control over.
State
The state is any observation the Agent makes about the environment, which may be used in deciding which actions to take. While there is a wealth of different data the Agent may observe (the health of the player, the number of turns required for the player to advance a floor, etc…), the variables for the first version of the game consider only the floor the player has reached and the level of the player’s character.
Actions
Due to the procedurally generated nature of the game, the Agent will decide to spawn monsters/items stochastically as opposed to having a deterministic decision each time. Since there is a large element of randomness, the Agent does not explore/exploit in the typical RL manner, and instead controls weighted probabilities of different enemies/items spawning in game.
When the Agent chooses to act, based on exploiting the best learned pattern so far, it will decide which enemy/item to spawn in game by weighted random sampling of the learned Q Matrix; whereas if the Agent chooses to explore, the Agent will instead spawn an enemy/item with equal probabilities from all entities in the game.
Rewards
The reward model for the Reinforcement Learning algorithm is crucial for the development of the intended behaviours the learned model should display, as Machine Learning methods notoriously take shortcuts to achieve their goal. As the intended objective is to maximize enjoyment for the player, the following assumptions have been made to quantify enjoyment in terms of rewards for the RL algorithm:
- A player that advances further in the game versus dying early has more fun
- A game that the player wins every time without a challenge is dull
With these objectives in mind, the RL model receives a reward when the player progresses to a new floor seen in Table I, and when the game completes as outlined in Table II.
Table I: Reward Model for Player Progression

Table II: Reward Model for Game Completion

Considering both the progression and completion scoring mechanisms above, the RL algorithm would maximize reward by allowing the player to progress to Floor 8, at which point the player should ultimately meet their demise. To minimize chances of unintended behaviours, the RL algorithm is also penalized for early player death.
Updating the Model
The RL algorithm employs Q-Learning, which has been modified to accommodate stochastic actions performed by the Agent. Modified from traditional Q-Learning [3] in which an Agent takes 1 action between states, the Agent’s action is updated considering the probability distribution of all the enemies/items that were spawned for the floor, shown in the equation below.

Where _Q'(s_t, at) is the updated value of the Q matrix, _Q(s_t, at) is the Q matrix for state s and action a pair at time step t, α is the learning rate, _rt is the reward provided from transitioning to state t+1, γ is the discount factor, and the overline component is the estimate of the future value based on the mean reward at time step t+1.
Since Reinforcement Learning methods require a large amount of training data, game data from player sessions is collected to train a global AI model which new players can use as a starting point.
Globalizing RL Training Through GCP
The global AI model is trained using game data collected by all players, and is used as the base RL model when a player has not yet played a game. A new player gets a local copy of the global RL model when first starting, which becomes tailored to their own play style as they play the game, while their game data will be used to further enhance the global AI model for future new players.

The architecture shown in Fig. 2 outlines how data is collected and how the global model is updated and distributed. GCP was used due to their free tier usage products being most suitable for collecting and storing game data for model training [4]. In this respect, the game routinely makes Cloud Function calls to GCP for storing data on the Firebase database.
Conclusions
The work presented in this article describes an application of how Reinforcement Learning was used to enhance the players experience of playing a game, as opposed to more common RL applications used to automate human actions. Game session data across all players was gathered using components of the free tier GCP architecture, allowing for the creation of a global RL model. While players’ begin the game with the global RL model, their individual experiences create a custom tailored local RL model to better suit their own play styles.
References
[1] OpenAI Gym, https://gym.openai.com
[2] RoguelikeRL – Trials of the Forbidden Ice Palace, https://drmkgray.itch.io/roguelikerl-tfip
[3] Kaelbling L. P., Littman M. L., & Moore A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4, 237–285. https://arxiv.org/pdf/cs/9605103.pdf
[4] GCP Free Tier, https://cloud.google.com/free