Tuning a Multi-Task Pytorch Network on Fate Grand Order
In a previous post I did some multi-task learning in Keras (here) and after finishing that one I wanted to do a follow up post on doing a multi-task learning in Pytorch. This was mostly because I thought it would be a good exercise for me to build it in another framework, however in this post I will go through how I did a bit of extra tuning after building the model that I didn’t go through when I built the Keras based model.
I also used this model as part of a facial similarity pipeline I built in another series of posts (part 1 and part 2).
As a quick recap, a multi-task model is when a single model optimizes to solve a series of usually related problems. Mechanically this is done by feeding outputs from some core part of the model pipeline into a series of output "heads" whose losses can be scored and combined, typically by addition, and then the network can adjust its weights against that summed total loss.
Something that I have noticed training a number of these is that it is a little bit harder to focus on improving a single task head since the loss function that is used for optimization is created by adding together the individual loss functions. However with some early experimentation with the Pytorch networks I found that traditional tuning strategies are reasonably effective.
The basic tuning strategy that I used is the one outlined by Andrew Ng in his online lectures where is most machine learning algorithms there is a trade off between variance and bias but with neural networks this isn’t really the case. With neural networks this trade off isn’t as much of a worry because we can use different mechanisms to address both. In the case the network is underfitting you can add additional computational power to it and in the case it is overfitting you can apply regularization like dropout or batch normalization. By balancing the application of these two mechanisms you can tune your networks.
For this post all of the art images are captioned with output from the final batch normalized pytorch model.
![Gender: Female, Region: Europe, Fighting Style: melee, Alignment: LG, Main Colors: ['Silver', 'Gold', 'Blue']. Two characters, both are from Europe, use melee weapons, are LG in alignment and the colors seem pretty good.](https://towardsdatascience.com/wp-content/uploads/2019/03/1XdSKAHCyXZdMs3Z2XWhAxg.jpeg)
Dataset and Pipeline
![Gender: Female, Region: Asia, Fighting Style: melee, Alignment: CG, Main Colors: ['Red', 'Black', 'White']. This character is European and has a Chaotic Neutral alignment but figured I should include to show that the model puts out alignments besides LG and NE](https://towardsdatascience.com/wp-content/uploads/2019/03/1FnsVxOXvZD_gJD44ZpT7XA.jpeg)
The dataset for this project is the same as my previous Keras based multi-task learning post and it consists of around 400 images of characters from the mobile phone game Fate Grand Order (FGO). The dataset consists of roughly 40 different characters with 26 different labels to create a multi-label style dataset.
The categories cover areas like gender of the character, region they are from, fighting style, main colors of the image, and character alignment (Lawful Good, True Neutral, Chaotic Evil, and so on. For more detail feel free to check out that previous post.
The only other real modification that I had to make was to make a custom Pytorch dataset class that takes in a series of lists and ouptuts an image and the 5 target vectors the model. Pytorch makes it easy to take their dataset Class and modify it as needed. Typically just writing your own init, getitem, and len functions. Most of my modifications came in the getitem section where I specify how to read in an image and get the corresponding targets from the list of target lists I cheekily called "king_of_lists".

![Gender: Female, Region: Europe, Fighting Style: melee, Alignment: NE, Main Colors: ['Silver', 'Gold', 'Purple']. Similar to the Keras model it looks like it has learned about the relationship between color and character alignment in the sense characters with evil alignments tend to be drawn with darker colors in FGO.](https://towardsdatascience.com/wp-content/uploads/2019/03/11RG-Cy1y5HrwPLxpAwKcRw.png)
Building the Base Pytorch Model
For this one my first step was to start out with a very basic model. I started out by building making a model with a Resnet50 as a backbone that feeds its output into 5 output heads.
Building custom networks in Pytorch is pretty straightforward by initializing layers and things that need to be optimized in the init section. Then you define how data flows through the model in the forward section. For this use case all I really do is initialize that core model (resnet50) and then feed that output into each of the 5 heads that I created (y1o, y2o, y3o, y4o, y5o). These are then the outputs of the model rather than the standard single output you would normally see.

To see the training loop feel free to check out the notebooks (here). It is largely just me modifying standard Pytorch networks. However the fun part is how the loss function is calculated. I thought this would be more complicated to actually do… but its actually super easy… Basically I just calculate the loss for each of the heads (loss0, loss1, loss2, loss3, loss4), add them together and then use that for the backpropagation.
loss0 = criterion[0](outputs[0], torch.max(gen.float(), 1)[1])
loss1 = criterion[1](outputs[1], torch.max(reg.float(), 1)[1])
loss2 = criterion[2](outputs[2], torch.max(fight.float(), 1)[1])
loss3 = criterion[3](outputs[3], torch.max(ali.float(), 1)[1])
loss4 = criterion[4](outputs[4], color.float())
loss = loss0 + loss1 + loss2 + loss3 + loss4
loss.backward()
I trained this basic model for 50 epochs with an Adam optimizer at a learning rate of .0001 that decayed overtime and saved the one with the best validation loss. See below for the scores across all 5 tasks.
val gender_Acc: 0.9390
region_acc: 0.6707
fighting_acc: 0.8537
alignment_acc: 0.7439
color_acc: 0.8384
Overall not terrible, but also not great. However in general it does outperform my previous post’s Keras VGG based network.

The base Pytorch model generally outperforms the Keras model. It makes sense given the Keras model was using a VGG19 model as a backbone with frozen weights while the Resnet in this round is being fine-tuned on the dataset.
![Gender: Male, Region: Middle East, Fighting Style: magic, Alignment: LG, Main Colors: ['White', 'Blue', 'Purple']. This one was good to see in the Pytorch model... because the Keras one would always label this character as "Female". so this is another are where the new Pytorch models are well performing.](https://towardsdatascience.com/wp-content/uploads/2019/03/1FrU5C3RHs3KZg-nzv6RQew.png)
This is a decent start, but I think I can improve the all around model performance. Basically I believe that the Pytorch model is still underfitting. In response to this I added two more dense layers of size 256 and feed that into the model. The following snippet has the modifications. Basically just adding two layers x1 and x2 of size 256.

After training this new model there was an increase in training and validation accuracy and overall better performance as compared to the base Keras and Pytorch models I built.

On one hand we could call it a day, but something else I noticed here is that while the model was performing better overall it was now overfitting the training set. See the scores below for this second model.
Training
gender_Acc: 0.9877
region_acc: 0.9417
fighting_acc: 0.9509
alignment_acc: 0.7577
color_acc: 0.8723
Validation
gender_Acc: 0.9390
region_acc: 0.9146
fighting_acc: 0.8537
alignment_acc: 0.7439
color_acc: 0.8506
Now that it is overfitting I figured that I could add some regularization to try and counteract it. This took some tinkering, but I found that a relatively high level of batch normalization was useful in this situation. On this run I ended up using 2e-1.

After this third round the model with added batch normalization showed a fairly large ~10 percentage point increase over the previous best accuracy on alignment which I think is the hardest category and a 5 percentage point increase in fighting style. However showed decrease on gender and color. It was tied on region of origin for the given character. So overall I would say this part had mixed success but did help on a hard category.

Training
gender_Acc: 0.9816
region_acc: 0.9540
fighting_acc: 0.9632
alignment_acc: 0.8926
color_acc: 0.8129
Validation
gender_Acc: 0.9146
region_acc: 0.9146
fighting_acc: 0.9024
alignment_acc: 0.8537
color_acc: 0.7912
Something to note is that while I added batch norm to try and reduce overfitting, the gap between training and validation are similar to before… However the model’s predictions seem to generalize better than before which is also one of the goal outcomes of adding regularization.
![Gender: Female, Region: Middle East, Fighting Style: ranged, Alignment: NE, Main Colors: ['Blue', 'Gold', 'Black']. Character has a LG alignment and don't quite see the blue here, but the rest of the details are pretty good.](https://towardsdatascience.com/wp-content/uploads/2019/03/16UKHAZq2j4Z1Z5pJnWV6kg.jpeg)
Conclusion and Wrapping Up
I think this simple tuning process is a good indicator that the tactics you would use on a normal network can still be applied to a multi task model. One difficulty in this is that it is harder to target a specific deficiency in the multi-task model, right now I was just targeting larger overarching issues (overfitting, underfitting across all nodes). Adding additional layers and nodes on those heads is an option, but then become additional hyperparameters you need to tune.
So the final two models performed fairly similarly the base deeper network performs better in color and slightly in gender, tied in region, and the batch normalization model performs better in fighting style and alignment.

This raises the issue of what model to choose. You could compute something like Fbeta to deal with it to try and make a combined metric across all the different tasks. Having that singular metric makes sense if you are aiming to have a single best model.
If you are open to using multiple models another option would be to take the well performing models and ensemble them to make predictions. This would allow you to leverage the better performing areas of each of the different models. I think that would be viable in this case sense one model does better on the alignment category while a second model does better across a number of different categories by a small margin.
In another situation where you have a lower performing task, in this case color doesn’t do great, you might have some success adding specialized models into the ensemble to try and boost performance in those areas.
![Gender: Female, Region: Europe, Fighting Style: magic, Alignment: LG, Main Colors: ['Blue', 'White', 'Silver']. 15th anniversary photo for the character, only thing wrong with the predicted outputs I would say is the fighting style... there is a sword stuck in the ground but the model says she uses magic](https://towardsdatascience.com/wp-content/uploads/2019/03/1yzolF7ccktxsqWGqOdFuLA.jpeg)