End to End ResNet

Now we have to tackle the meat of the entire project and everything that came before this was set up to allow for training to happen seamlessly – downloading the dataset, formatting everything for entry into the network, using creating data transforms for image manipulation and creating dataloaders that help insert the images in a structured way.
If you are just joining, I’ve made a colab notebook that lets you run the entire code (one click) and in the end, allows you upload a picture for testing. This particular dataset and network is designed to differentiate between dogs and cats – but you can upload whatever picture you want to generate an output.
It is actually pretty instructive to do so, especially for any future work you may want to do. Knowing the underlying data, its issues and biases is critical to interpreting the results. Every dataset has bias and is imperfect, which doesn’t necessarily imply that it’s useless.
Here are links to the previous parts:
End to End Adaptation of ResNet in Google Colab – Part 1
End to End Adaptation of ResNet in Google Colab – Part 2: Hardware & Dataset Setup
End to End Adaptation of ResNet in Google Colab – Part 3: Image Pre-Processing
Let’s first unpack what’s going on in train_model below:
I pasted the entire function above just to get an overview of how ‘large’ it is, but this will be broken down in parts:
A Few Definitions
Epoch – one complete presentation of the underlying data to the model (er, architecture, as I told myself I would say).
Training Phase – If you’ll recall, we used a ratio of 75/25 (80/20 and 70/30 is also common), in that we split the dataset, randomly, assigning 75% into the training set and 25% into the validation set. Doing so randomly is important because if you had all the pictures in order, you might end up over-representing dogs or cats in the training phase if its not randomized (and as a result, the architecture would better ‘trained’ on one animal vs the other).
Validation Phase – The 25% that we assigned into the validation set should not be used for training. This is your best way of showing new data to the network to see if it has learned well. Like teaching a child addition, you wouldn’t want to present the same exact problem but a similar one to see if the child learned to add.
Optimizer Zero Grad – This isn’t a ‘vocabulary’ term per se but important to know. When the gradients are generated in a mini batch, they will continue to propagate through (which you do not want per training cycle; you want to start fresh). The zero_grad function enables you to start fresh for each minibatch to calculate gradients.
The Heavy Lifting
Most of the heavy lifting is done here:
The first if/else statement sets the model to train or evaluation mode (Pytorch feature). By default, models are in train mode. Interestingly enough, this convention doesn’t matter too much if you are not using dropout layers or batch normalization but having this convention is important for future proofing your code (and how you approach pytorch).
You then iterate over the data using the dataloaders and use the "inputs.to(device)" and "labels.to(device)" lines to load the code into the GPU. We discussed the ability to use high end GPUs through Colab earlier.
With torch.set_grad_enabled(phase == ‘train’) – the parentheses being true only if the phase is indeed in training mode, you allow for calculation of the gradients during training mode and loss.backward (again, during training), will calculation the gradients, and optimizer.step updates all the gradients with the recent calculation.
Now, if the phase is NOT train, then set_grad_enabled becomes False, gradients are not calculated and nothing is updated. The model only evaluates the inputted images (which is what you want in training).
The ‘scheduler.step()’ is a tiny line but critical to obtaining lower error rates. Changing the learning rate is critical to hyperparameter tuning. You want large learning rates, in the beginning, to be efficient and as the training proceeds, you want to decrease the learning rate. Imagine trying to find the lowest point in a field. You want large steps initially so you don’t miss any valleys and are not stuck in a small corner of your map. But as you find valleys, you don’t want to be bouncing off the sides of the wall of the valley because your steps are large. Making your steps (here, learning rate) smaller will allow you to more efficiently find the lowest point.
Finally, the accuracies are calculated and if the epoch of this particular accuracy is better than the best accuracy (initially set to 0), the best_acc variable is updated and the model weights are updated as well and the whole process repeats.
Model Download & Parameter Determination
True to the namesake, we are using a Resnet18 model. The fully connected layer (model_ft.fc) has two final outputs (cat vs dog). We then load the model onto the GPU and define the loss function as CrossEntropyLoss.
Discussion of this loss function is beyond an introductory series but suffice it to say, it is very commonly used for classification (not regression) models, and well covered by many other posts and videos.
The optimizer function (SGD – stochastic gradient descent) is also something I encourage you to read (or watch youtube videos) about.
Finally the learning rate scheduler – as discussed above, we want to learning rate to get smaller as the epoch # gets larger, to allow for better fine tuning of the architecture (weights) and a lower loss.
Finally, we put all of it together and let it run:
model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,num_epochs=10)
I recommend you do at least 5-10 epochs to convince yourself there is a stopping point that makes sense, and iterating over a hundred epochs has profoundly diminishing gains.
If you are reading the above and find it to be simplistic, I applaud you for being lightyears ahead of me. However, if there is some part above that doesn’t make sense, please leave a comment and I will connect you to an appropriate resource.
The goal of this series is not a deep dive into neural networks – I am not qualified to do that. The goal is to convince you that this architecture is accessible to you, you can run it on high end compute and upload your own image for testing in the end – and I have followed this up with an overview of what’s going on under the hood.
Next time – uploading a picture and the gritty details surrounding that.
References
[1] Deep Learning with Pytorch, Accessed October 2020
[2] Neural Networks with Pytorch. Accessed October 2020
[3] Transfer Learning for Computer Vision. Accessed October 2020
[4] Kaggle API. Accessed October 2020