How Transfer Learning works

“Transfer Learning will be the next driver of ML success” — Andrew Ng.

Published in

Towards Data Science

7 min readOct 26, 2020

Transfer Learning is the process of taking a pre-trained neural network and adapting the neural network to a new different dataset by transferring or repurposing the learned features. For example, we take a model trained on ImageNet and use the learned weight in that model to initialize the training and classification of an entirely new dataset. For example, in research work by published by Sebastrian Thrun and his team, 129,450 clinical skin cancer images were classified using pre-trained models and it achieved excellent results. The approach achieves performance at the same level with results from experts on two tasks: skin cancer identification and identification of deadliest skin cancer tasks. This is just an example that shows us that artificial intelligence is indeed capable of classifying skin cancer with a level of competence comparable to dermatologists. The CEO of DeepMind has this to say on transfer learning:

“I think transfer learning is the key to general intelligence. And I think the key to doing transfer learning will be the acquisition of conceptual knowledge that is abstracted away from perceptual details of where you learned it from”- Demis Hassabis (CEO DeepMind)

Transfer learning is also particularly useful with a limited computing resource. A lof of state of the art models takes several days and weeks in some cases to train even when trained on highly powerful GPU machines. Thus, in order not to repeat the same process over a long period of time, transfer learning allows us to make use of pre-trained weights as a starting point.

Learning from scratch is hard to do and difficult to achieve the same performance level as in transfer learning approach. To put in perspective, I used an AlexNet architecture to train a dog breed classifier from scratch and achieved 10% accuracy after 200 epochs. The same AlexNet pre-trained on ImageNet, when used with freeze weights, achieved 69% accuracy over 50 epochs.

Transfer learning often involves taking the pre-trained weights in the first layers which are often general to many datasets and initializing the last layers randomly with and training them for classification purpose. Thus, in transfer learning approach, learning or backpropagation occurs only at the last layers initialized with random weights. Meanwhile, there are several approaches to transfer learning and the approach we use is dependent on the nature of our new dataset we want to classify with respect to the dataset of the pre-trained models. There are four main scenarios or cases of transfer learning.

Transfer Learning Scenarios

New dataset that is small and similar to the original training dataset
New dataset that is small but different from the original training dataset
New dataset that is large and similar to original training dataset
New data set is large but new data is different from original training data.

What do we mean by similar dataset?

Images are similar if they share similar features. Eg images of cats and dogs would be considered similar since they contain similar characteristics like eyes, face, hair, legs etc. A data set of plants would be different from a data set of animal images and would not be considered similar.

A large dataset would have a frequency of up to a million. A small dataset would generally be in the range of few thousands

AlexNet Architecture

I would use an AlexNet to demonstrate this procedure. The architecture of AlexNet is as shown below

Case 1: Small and Similar Dataset

If the new data set is small and similar to the original training data:

remove the end of the fully connected neural network
add a new fully-connected layer that has an output dimension equal to the number of classes in the new data set.
randomize the weights of the new fully connected layer;
freeze all the weights from the pre-trained network
train the network to update the weights of the new fully connected layer

The whole CNN layers of the pre-trained models are kept constant ie froze because the images are similar and they would contain higher-level features. However, this approach has a tendency of overfitting our small dataset. Thus, the weights of the original pre-trained model are held constant and not retrained.

The visualization of this approach is as follows:

Transfer Learning approach for a small and similar dataset

The code demonstration of this approach is as follows:

Line 3 above enables us to freeze our weights such that no learning takes place and the weight is held constant.

Line 6 above enable us to change the output shape of the fully connected layer. If we compare it to the original AlexNet architecture, we would observe the output size if 1000 but our new dataset has an output size of 133 in this case.

Case 2: Small and Different Dataset

If the new data set is small and different from the original training data, the approach is as follows:

we remove the end of the fully connected neural network and some CNN layers at the end of the network.
add a new fully-connected layer that has an output dimension equal to the number of classes in the new data set.
randomize the weights of the new fully connected layer;
freeze all the weights from the remaining pre-trained CNN network
train the network to update the weights of the new fully connected layer

In this case, we note that the dataset is small but different. Because our datasets are images, we leave the beginning of the network and remove the CNN layer that extracts higher features just prior to the fully connected layers. However, this approach also has a tendency of being overfitting our small dataset. Thus, the weights of the original pre-trained model are held constant and not retrained.

The visualization of this approach is as follows:

Case 3: Large and similar dataset

If the new data set is large and similar to the original training data, the approach is as follows:

remove the end of the fully connected neural network
add a new fully-connected layer that has an output dimension equal to the number of classes in the new data set.
randomize the weights of the new fully connected layer;
initialize the weights from the pre-trained network
train the network to update the weights of the new fully connected layer

Since the new dataset is similar to original training data, the higher layer features are not removed from the pre-trained network. Overfitting is not a potential problem; therefore, we can re-train all of the weights.

Here is how to visualize this approach:

Transfer Learning approach for large and small dataset

Line 3 above allows us to modify the output weight of the fully connected layer from 1000 to 133. The network is initialized with the pre-trained weights and learning occur across all layers of the network.

We do not need to specify requires_grad to be False here. The default value is True, thus the new network would only use the pre-trained weights as the starting point in the training process.

Case 4: Large and different dataset

If the new data set is large and different from the original training data, the approach is as follows:

remove the end of the fully connected neural network and add a new fully-connected layer that has an output dimension equal to the number of classes in the new data set.
randomize the weights of the new fully connected layer and initialize the weights with random weights
train the network to update the weights of the new fully connected layer

In this case, the CNN layers are mostly retrained from scratch. But we could as well initialize it with the pre-trained weights.

Here is how to visualize this approach:

Transfer Learning approach for large and different dataset

In this case, when loading our model, we simply specify we do not need the pre-trained weight and it is more of starting training from scratch. Nevertheless, it is still always better to not start with completely random weights. Starting with pre-trained weights is more common in practice and literature. Alternatively, we start or initialize our weights with approaches like Xavier Initialization, uniform distribution initialization or just a constant value initialization.

So anytime you have a lot of images to classify, why don’t you try to do transfer learning first and you can then begin to think how to tweak the model, apply different transformation techniques to improve the model performance.

Thanks for reading all along.

Some useful resources

Model Zoo: this site contains a lot of pretrained models you can download and use for your dataset. This github page is another useful resource with collections of pre-trained models.
This repository contains a dog breed classifier project I did using transfer learning.
PyTorch tutorial on transfer learning can be found here
Tensorflow tutorial on transfer learning can be found here
Keras tutorial on transfer learning can be found here.

References

Esteva, Andre, Brett Kuprel, Roberto A. Novoa, Justin Ko, Susan M. Swetter, Helen M. Blau, and Sebastian Thrun. “Dermatologist-level classification of skin cancer with deep neural networks.” nature 542, no. 7639 (2017): 115–118.
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” In Advances in neural information processing systems, pp. 1097–1105. 2012.
https://www.deeplearningwizard.com/deep_learning/boosting_models_pytorch/weight_initialization_activation_functions/