The world’s leading publication for data science, AI, and ML professionals.

Pendragon Four: Training Pipeline Deeper Dive for Multi Agent Reinforcement Learning

Deeper dive into training multiple RL agents simultaneously to play the mobile phone game Fate Grand Order

Over the past year I have made various versions of neural network powered bots to play the game Fate Grand Order (FGO), loosely called Project Pendragon. The work around Project Pendragon ranges from feature extraction with a series of neural networks to get information about the current game state to my most recent additions three neural networks to control each of the three characters active on the game screen. These three and one additional bot for picking action cards are the four reinforcement learning (RL) agents that make up my current version of the project, Pendragon Four.

My recent post introducing Pendragon Four covered the overarching additions and upgrades I had to make to be able to train my new agents and some of the results. However I wanted to go over some aspects of training that I found to be useful and some lessons learned after continuing to improve upon my initial Pendragon Four pipelines.

  1. Moving from DQN to Policy Gradient based training
  2. Handling invalid moves
  3. Game balancing and properly challenging the agents
Pure policy gradient bots, across multiple training runs a fairly consistent thing was the bots would use Ishtar's (middle character) second skill to instantly clear the first wave. For the bots I think this is a more stable strategy that at least gives them the opportunity to get to the late game, rather than saving that skill when they may or may not be alive.
Pure policy gradient bots, across multiple training runs a fairly consistent thing was the bots would use Ishtar’s (middle character) second skill to instantly clear the first wave. For the bots I think this is a more stable strategy that at least gives them the opportunity to get to the late game, rather than saving that skill when they may or may not be alive.

Letting Agents Learn Their Own Way to Play

My first real foray into RL was building my custom environment for my original Pendragon Alter bot which was DQN based agent whose goal was to pick command cards to play on any given turn. The basic process for a DQN based training approach is as follows:

  1. the Agent (the network in this case) is given the current state of the game. This could be the pixels of an atari pong game or whatever representation you choose. With my FGO Pendragon Alter bot it was of the 5 cards it was dealt.
  2. Agent chooses an action out if its action space. In the Pong example Andrej Karpathy has it as the probability of going up. With my FGO game it is which of the 60 possible card slot combinations is best (There are 60 ways to choose 3 card out of 5). However it should be noted that in reinforcement learning there is the idea of exploration vs exploitation. Essentially saying that sometimes an action should be chosen at random rather than simply doing what the Agent thinks is best. This helps the Agent to explore and find additional rewards that it would not have found otherwise if it has simply exploited the rewards it knows.
  3. Action is imputed into the environment, any rewards are collected, and the environment moves onto its next state, frame, or turn. Mechanically I do this by adding the rewards to the card slot combination that the network outputted. For a positive reward that category’s value in the output array increases and the network will see that given that input again that particular category is a beneficial one.
  4. The Agent is updated based on the rewards it received. After the rewards are used to modify the output array, the network is trained on the initial input state with the modified output array as the target. This helps to reinforce good choices, while taking into account bad choices as well.
  5. Rinse and repeat.

This process is effective and a reasonable one to do, but one issue that I found that in my custom environments it meant that I was forced to constrain the agents to play based on my point of view because there isn’t a score or built in reward that I could use to assign value to an individual action. So what happens is that I had to reward the bots for things that I thought were good. For example, with the card bot it would be combining 3 of a kind card combinations since there are bonuses associated with that or with the character bots in Pendragon Four it would be using skills at reasonable times. While these rewards can be effective in teaching a bot useful behavior… it doesn’t really let them explore new styles of play and it imprints my beliefs onto my agents rather than letting them learn for themselves.

So in my next round training for Pendragon Four one of my goals was to upgrade successfully to more of a policy gradient approach to training. While DQN based training methods assign rewards on individual turns and the purest form of a policy gradient type method is to assign reward based on winning or losing the game, ex +1 or -1.

Conceptually this was always a fairly confusing concept for me because an agent may take the same action in both a win and a loss but rewarded positively and then negatively the next time. This is true and turns out to be fine because the idea is that if on average that move is a good one in that situation you should win more times than you lose which should give that move a net positive value in that situation.

Policy gradient example for one agent over 3 games:

Diagram of 3 sample games each circle is a gamestate and the arrows are the actions an agent selected on that turn One the bottom are three sample rendered game states.
Diagram of 3 sample games each circle is a gamestate and the arrows are the actions an agent selected on that turn One the bottom are three sample rendered game states.

In the example above there are three games, 2 wins 1 loss. Basically what we do is assign a positive reward to all the actions that occured in the winning games and a negative reward for the actions in the losing game. So some outcomes is that on turn 5 the two winning rounds have using skill 3 (sk3) so that is a move that might be fairly positively rewarded in this subset. While using skill 2 on turn 1 is a move that only occurs in the losing game so it is disincentivized. A more interesting one might be using skill 1 on turn 4 which occurs in 1 win and 1 loss. After this subset it would have 1 example where it is disincentivized and 1 where it is incentivized. Additional games are needed to show whether or not that is a good move.

For me it is somewhat magical that this approach is highly viable and generates good results. It starts to make more sense from a law of large numbers point of view. If we take thousands upon thousands of games on average better moves will appear in more of the wins than poorer ones so we will end up with agents learning to make winning moves.

In my previous Pendragon Four pipeline I got a policy gradient pipeline working where I assigned rewards at the end of the game, but I was also giving intermediary reward for behavior I thought was good. So I was still imprinting some of my views onto the agents. So in the spirit of removing as much of myself from the pipeline as possible I wanted to move to a pure policy gradient approach where I only assigned rewards based on winning or losing. In order to get to this point I found that doing more exploration than I had previously been doing was helpful to let the bots discover more ways to play.

Once I moved to using policy gradients where I was applying rewards once the game was complete another thought that I had was to aggregate the games together to create larger "datasets" to do large updates to an agent with instead of just doing it turn by turn as I did previously. I thought that this would be good for training, but also good for utilizing more of my computational resources.

In my older pipelines I always found it annoying when I was trained on a per turn basis is that I really got no benefit from training on a GPU since I think most of the computation in that setup was from sending things to and from the GPU. So now I basically gather several hundred games together as a sort of "dataset" and train in large batches to make use of my GPUs.

Handling Invalid Moves

In previous reinforcement learning projects I have done I have kept the action spaces for my agents to all be valid actions. Mostly because I did not know how to elegantly handle invalid moves. However for this recent iteration of Pendragon Four I had to figure out how to implement this for my agents in order to handle skills going on cooldown after use and no longer being available for the bots to play.

At this point in time I only allow my agents to select a single action per turn and the current set of characters each has 4 possible moves.

pass
skill_1
skill_2
skill_3

This set of 4 moves is because one of the bots have skills that can be cast on other characters, each of their moves targets either themselves or the entire party. Choosing characters with these types of moves was a choice I made for simplicity as I prototyped my bot pipelines.

In the future I want to add additional characters with a more diverse moveset and allow bots to play more than 1 move per turn.

With this first iteration of handling invalid moves I followed the advice of stack overflow:

"Just ignore the invalid moves." -Stack Overflow

This is still more or less the way that I ended up handling invalid moves, but it gets a little more nuanced. When I trained my first round of Pendragon Four bots I had the final layer of the networks set to linear activations and when a skill went on cooldown I would set it to a low value such as the lowest value in the array. I found that sometimes this would cause a sort of race to the bottom effect where the outputs would become quite negative.

In order to avoid these large negative outputs I decided that I wanted to use something like a sigmoid or more ideally a softmax activation on the final layer of the network which would at least bound the possible values for each node to be equal to some value between 0 and 1. With these two activations it meant I could just set the skills on cooldown to 0 rather than some increasingly large negative number. Between the sigmoid and softmax the softmax activation was the one that I really wanted to use, but was unsure how to apply a reward and zero out invalid moves and still create valid probability distributions. So I did a good bit of testing using sigmoids.

Using sigmoid activations it was simple enough to keep their values between 0 and 1. If a reward pushed the value above or below these values then you could just set it back to 1 or 0. Then because a sigmoid treats every node as independent you don’t have to worry about reconstructing valid probability distributions. For my problem though it felt disingenuous to model it this way since the actions aren’t really independent of one another since I am only allowing an bot to take a single action per turn (aka a softmax).

So while mechanically this system worked it felt wrong for the problem. Which brought me back to the drawing board on how to use a softmax activation appropriately here.

With a softmax the output from the network is a valid probability distribution of some sort and we take the action that has the highest value as the action for that turn. A reward for an action also means that the others are disincentivized and vice versa. So for me mentally this made the most sense but I just had to figure out the mechanics of it since modifying values of the output array destroys that nice probability distribution.

While this took me longer than it should have to realize… I eventually figured out that I could just zero out invalid indexes of the predicted array, add or subtract the rewards from the selected move, and then recalculate the probability distribution by adding up the array and dividing every index by the sum of the array… basic math for the win! see below:

Game State: Skill 3 has been used and reward of +1
Original: [.2, .5, .3] # sample probabilities for network
Modified: [.2, 1.5, 0] # added reward to skill 2 and zeroed skill 3
Final: [.117, .882, 0] # .2/1.7 and 1.5/1.7

While this isn’t a huge change to the environment, this helped make the reward structure and goals of the agents more concrete in my brain and helped me work through other parts of this problem. Also overall it is a good thing for me to have figured out since it helps set the stage for handling other interesting RL problems where there will likely be a large number of invalid moves

Bots are dumping their remaining skills in the final round. Funnily enough the first bot on the left used a skill previously to increase its critical damage and this turn used a skill to increase its damage by a lot and bot 3 used a skill which essentially increases the likelihood of a critical hit. The first bot then gets a very large critical hit and it ends the round.
Bots are dumping their remaining skills in the final round. Funnily enough the first bot on the left used a skill previously to increase its critical damage and this turn used a skill to increase its damage by a lot and bot 3 used a skill which essentially increases the likelihood of a critical hit. The first bot then gets a very large critical hit and it ends the round.

Game Balancing and Training Various Agents

I had never really given much thought on how to balance a game environment until I started training Pendragon Four. I found that I had to think rather carefully about what environments I placed the bots into in order to get them to learn useful behavior.

I have had a few versions of these custom FGO environments. The first one I made a year ago could roughly be thought of a single round of combat between a RL agent and an enemy where they just took turns hitting one another until one died… My more recent one has 3 waves of enemies and the agent has to defeat all three waves before the agent’s health drops to 0. The first environment had around 120 hit points (HP) and dealt between 1 and 3 damage per turn. This setup roughly mirrors a standard FGO farming level. For context the agents have around 30 HP.

After my Pendragon Four environment upgrades I started to train all 4 of the agents in an environment with similar total HP and damage I found that the agents would sometimes just not play skills, or other times play all their skills immediately without really having to worry about anything.

Both of these are not very interesting since they aren’t really doing any problem solving. But their win rates weren’t terrible. In my original post I stated that

the random guess win rate for the 4 bots was around 38% and would peak at around 72% after maybe 20–30K games. After doing some additional work and a lot of experimenting I got the win rate up to around 84%.

What this 72% win rate was actually was the 3 character agents making non interesting choices and the card picker bot learning and basically hard carrying the 3 character agents through the environment. The part that got me to 84% at that time and closer to 88–90% win rate in current versions was actually a change in my training protocol.

What I found is that I train the card picker agent (different characters have different decks, so the decks the card bot has to learn to play with are different depending on the characters) then once the card picker agent is trained I do a round training the individual agents. This two step approach helped me with initial development and isolating what was learning when. I think something else it helps to isolate is that if the card picker agent is learning while the character agents are learning then the character agents get less consistent feedback during their policy gradient training since a lot of the team’s damage is due to the cards that are picked. So breaking the pipeline into pieces also helps ensure the character agents get consistent feedback for the actions they take.

Two-stage training process:

  1. The card picker agent would get trained up in one version of the game environment. This is mostly just to get it to learn how to play with a given deck and it is trained in a DQN pipeline.
  2. Then I take the trained card picker and the three randomly initialized character agents and dropped them into a MUCH harder game environment.

I mentioned the original game environment had roughly 120HP spread out over 3 waves and had it deal between 1–3 damage per turn. I adjusted the environment to have increasing HP and have increasing damage as per wave. My thought was that since the bots were able to get by without really having to try in the original game setup, if I made the game a lot harder the bots wouldn’t be able to coast their way through and would need to actually start learning useful information in order to win. See a sample of what a harder environment would look like:

Round1: 40 HP 1-3 damage
Round2: 60 HP, 2-5 damage
Round3: 80 HP, 4-6 damage

So instead of the 120HP of the previous environment there is an extra 60HP that the bots need to fight through and if they get stuck on later waves for long periods of time they will probably die.

The bots performance in this harder environment doesn’t actually look great. They start out at around 35% win rate with a trained card picker and randomly initialized character bots (win rate for 4 random bots is around 3%) and end up at around 54% after maybe 100K games. This result isn’t the one that we necessarily care about though since I can then drop these bots back into the original easier game environment that more closely mirrors an FGO farming level and they win 88–90% of the time.

Bot saved damage buffs for wave 3 and drop them to help clear harder final wave content.
Bot saved damage buffs for wave 3 and drop them to help clear harder final wave content.

Closing Thoughts

I have not really had that many occasions to simultaneously train multiple neural networks where they all need to learn and cooperate to be successful. Most of the time if there are multiple networks involved it is in the form of an ensemble where everything can be So a lot of the underlying work I had to do here was to iron out my mental model of how I wanted all of this to function and isolating it down to have the fewest moving parts at a given point in time. Not the most unique lessons learned, but a good one to relearn every now and then.

With my current training pipeline what I see are the networks learn strategies that are reliable but maybe not the most fun or interesting. For instance many training runs end up using Ishtar’s second skill at the beginning of either wave 1 or 2 to because it lets the team instantly clear that wave and minimize the amount of early damage the team takes.

Same gif as above

Pure policy gradient bots, across multiple training runs a fairly consistent thing was the bots would use Ishtar's (middle character) second skill to instantly clear the first wave. For the bots I think this is a more stable strategy that at least gives them the opportunity to get to the late game, rather than saving that skill when they may or may not be alive.
Pure policy gradient bots, across multiple training runs a fairly consistent thing was the bots would use Ishtar’s (middle character) second skill to instantly clear the first wave. For the bots I think this is a more stable strategy that at least gives them the opportunity to get to the late game, rather than saving that skill when they may or may not be alive.

When I play I tend to save those sorts of abilities for round 3. However I think this is because I as a player have a pretty good understanding that I am not likely to die before round 3 on the missions I am playing. The bots don’t really have this same mental guarantee to not die. So for them if they don’t make use of it as early as possible they may never get the chance to use their skills. This could also be a lack of long term planning getting built into the bots based on my training protocols, environment, modeling etc. So for the bot teams, over thousands of games using Ishtar’s second skill to clear early waves instead of saving it seems like a reliable way for the bots to play since it at least gives them the chance to reach the endgame versus risking dying along the way.

So now that I have ironed out more of the training process for these bots I am in a better spot to start training up some additional agents for different characters and working on incorporating more interesting team compositions.


Related Articles