How do pre-trained models work?

Published in

Towards Data Science

5 min readMar 16, 2019

Introduction

In most of my deep learning projects, I’ve used pre-trained models. I’ve also mentioned that it is generally a good idea to start with them instead of training from scratch. In this article, I’ll provide an elaborate explanation for the same, and in the process help you understand most of the code snippets. At the end of the article, I will also talk about a technique in computer vision that helps improve the performance of your model.

Let’s get started.

Training neural networks

When we train a neural network, the initial layers of the network can identify really simple things. Say a straight line or a slanted one. Something really basic.

As we go deeper into the network, we can identify more sophisticated things.

Layer 2 can identify shapes like squares or circles.

Layer 3 can identify intricate patterns.

And finally, the deepest layers of the network can identify things like dog faces. It can identify these things because the weights of our model are set to certain values.

Resnet34 is one such model. It is trained to classify 1000 categories of images.

The intuition for using pretrained models

Now think about this. If you want to train a classifier, any classifier, the initial layers are going to detect slant lines no matter what you want to classify. Hence, it does not make sense to train them every time you create a neural network.

It is only the final layers of our network, the layers that learn to identify classes specific to your project that need training.

Hence what we do is, we take a Resnet34, and remove its final layers. We then add some layers of our own to it (randomly initialized) and train this new model.

Before we look at how we do this in code, I’d like to mention that pretrained models are usually trained on large amounts of data and using resources that aren’t usually available to everyone. Take ImageNet for example. It contains over 14 million images with 1.2 million of them assigned to one of a 1000 categories. Hence it would be really beneficial for us to use these models.

The code

Now let’s take a look at how we do this in fastai. We start by loading a pretrained model.

Initially, we only train the added layers. We do so because the weights of these layers are initialized to random values and need more training than the ResNet layers. Hence we freeze the ResNet and only train the rest of the network.

Once we’ve trained the last layers a little, we can unfreeze the layers of Resnet34. We then find a good learning rate for the whole model and train it.

Our learning rate plot against loss looks as follows.

We don’t want our loss to increase. Hence, we choose a learning rate just before when the graph starts to rise (1e-04 here). The other option, and the one I have used is to select a slice.

This means that if we had only 3 layers in our network, the first would train at a learning rate = 1e-6, the second at 1e-5 and the last one at 1e-4. Frameworks usually divide the layers of a network into groups and in that case, slicing would mean different groups train at different learning rates.

We do this because we don’t want to update the values of our initial layers a lot, but we want to update our final layers by a considerable amount. Hence the slice.

This concept of training different parts of a neural network at different learning rates is called discriminative learning, and is a relatively new concept in deep learning.

We continue the process of unfreezing layers, finding a good learning rate and training some more till we get good results.

Finally, pretrained models are not just available for computer vision applications but also other domains such as Natural Language Processing.

We can now move on to tricks for computer vision projects.

Progressive image resizing

One trick to improve the performance of your computer vision model is to train a model for lower resolution images (example size = 128) and use those weights as initial values for higher resolution images (size = 256 then 512 and so on). This trick is known as progressive image resizing. I used it in one of my projects and the performance of my model increased by a good 2%.

Now an increase from 92% to 94% may not sound like such a big deal but if we are dealing with medical applications we want to be as accurate as possible. And it’s these small tricks that separate the good models from the competition winning models. Read this research paper to learn about more such tricks. All of them won’t work every time but you really have to try and experiment with what works and what doesn’t until you get a feel for it.

That will be it for this article

If you want to learn more about deep learning check out my series of articles on the same:

Deep learning series

A systematic list of all my articles on deep learning

medium.com

~Happy learning.