The world’s leading publication for data science, AI, and ML professionals.

Intermediate Deep Learning with Transfer Learning

A practical guide for fine-tuning Deep Learning models for computer vision and natural language processing

Getting started with Deep Learning is easy. You can have a neural network setup and training within just a few lines of code. But it can become overwhelming when you go from a beginner to an intermediate level. You are confronted with many new terms like "EfficientNet" or "DeBERTa". What are all these terms? What do they have to do with neural networks?

The beginner tutorials don’t tell you that only a few people train neural networks from scratch in practice. This is because neural networks take a lot of time and computational resources to train on a sufficiently large dataset to perform well. Hence, re-using a pre-trained model as a starting point is a common practice. This practice is called Transfer Learning [2].

The beginner tutorials don’t tell you that only a few people train neural networks from scratch in practice.

Prerequisites

This article will guide you from a beginner to an intermediate level in the field of Deep Learning. Thus, it assumes you already have some basic understanding of Deep Learning concepts.

Disclaimer

This article is inspired by the following four resources on best practices in Deep Learning I have recently come across and summarizes their key points into an article format. None of the original ideas in this article are mine – instead, view this article as study notes.


A Brief Introduction to Transfer Learning

Transfer Learning describes the practice of re-using a pre-trained neural network instead of training one from scratch to save time and computational resources. Thus, were are transferring the pre-learned weights and knowledge from one task to another. The pre-trained models are also called backbones.

Transfer learning is practical when your dataset is small or similar to the one your backbone was trained on [2]. But it never hurts to use transfer learning [8], even if your dataset is sufficiently large or your task is different from the one your backbone was trained on, because usually, the first few layers contain generic information.

While you can keep the transferred model weights and only retrain the classifier, it is common to fine-tune the model weights of the whole model.

Some popular backbones for Computer Vision (CV) problems currently are:

  • ResNet (Residual Network) [6]
  • DenseNet [7]
  • EfficientNet [12]

Respectively, some popular backbones for natural language processing (NLP) problems are:

  • BERT (Bidirectional Encoder Representations from Transformers) [3]
  • RoBERTa (Robustly Optimized BERT Pretraining Approach) [10]
  • DeBERTa **** (Decoding-enhanced BERT with disentangled attention) [5]

Of course, Transfer Learning is only possible due to researchers sharing their model checkpoints for the benefit of others [2].


The main steps of approaching a CV or NLP problem with Deep Learning are:

  1. Building a baseline
  2. Increasing complexity and improving the performance in small increments
  3. Squeezing the last bits of performance

Step 1: Building a Baseline

As a first step, you should set up a baseline to which you will compare any experiments you run in the second step. You should not spend too much time on this step. Just make sure you have a good enough working baseline.

Backbone

The backbone describes the pre-trained model you are using as your starting point. Don’t spend too much time choosing the perfect backbone for your baseline. You should switch out the backbone in the second step and experiment with others anyways (see Model complexity) – so just pick one. How about EfficientNet for CV and DeBERTa for (English) NLP problems [1, 9, 11]?

Batch size

Decide on a batch size at this stage of the model development. Don’t change the batch size unless necessary because it requires starting the tuning process all over again [4, 11].

Usually, increasing the batch size will increase the training speed. Thus, it is common practice to use the biggest batch size in powers of two (e.g., 16, 32, 64, etc.) that fits in the available memory [4].

Number of training steps (epochs)

Number of epochs – You don’t require many additional training steps for fine-tuning a pre-trained model. Usually, 5 epochs is a good starting point for the initial baseline [11].

Early stopping – While early stopping has been popular in the past [5], it is not so much anymore [1, 4, 11]. Early stopping is a technique that stops training when the validation metric does not longer improve to avoid overfitting to the training data. However, this technique can cause you to overfit to and leak information from the validation set [1, 11].

Additionally, early stopping always requires a validation set. Thus, you won’t be able to retrain the final model on the whole dataset for an extra performance boost [1] (see Retraining on the full dataset). Thus, early stopping is not recommended.

Best checkpoint picking – There is no common understanding of the best practices regarding best checkpoint picking. Best checkpoint picking is the practice of training the neural network for a fixed number of runs (without early stopping) and then selecting the run with the best validation metric. While the Deep Learning Tuning Playbook [4] recommends using best checkpoint picking, Kaggle Grandmasters don’t recommend it because it requires the use of a validation set [1, 11] (see Retraining on the full dataset)

Optimizer and its learning rate

Optimizer – The Deep Learning Tuning Playbook [4] recommends starting with either…

  • Stochastic gradient descent (SGD) or
  • Adam.

However, Kaggle Grandmasters shared that SGD requires more tuning efforts to achieve similar performance as with Adam and thus recommend Adam [1]. They specifically recommend AdamW as an optimizer [1].

Learning rate – While you would need to start with a larger learning rate when training a neural network from scratch, for fine-tuning, we can set a smaller initial learning rate of, e.g., 1e-3 [11]. This is because we assume that the weights of the pre-trained model are already good and should not be changed too much too quickly [2].

If you are unsure what learning rate to choose, you could also check the paper of the model architecture you are using to find out what learning rate they used [9].

Learning rate scheduler – While it is recommended to start with a constant learning rate for the initial baseline when training a neural network from scratch [4], you should use a learning rate scheduler for your initial baseline when fine-tuning [1, 4, 11].

Similarly to the optimizer, people like to argue about the best learning rate scheduler. While it was popular to reduce the learning rate when the validation metric reached a plateau, this approach is dependent on a validation set [1]. Kaggle Grandmasters recommend the cosine annealing learning rate scheduler [1, 11], which is shown below.

For NLP, you could keep your learning rate constant when you are not using many epochs to fine-tune, and your initial learning rate is already small [1].

Bells and whistles

Every month, there is a fancy new technique you can apply to your training pipeline. But for the sake of the baseline, it is recommended to keep it as simple as possible and add fancy features later [4].

Regularizing vs. overfitting to the validation set

A few years ago, early stopping and reducing the learning rate on a plateau were considered regularization techniques to avoid overfitting of the model to the training set [8]. Today, these techniques are considered to be leaky in terms of overfitting to the validation set [1].

Additionally, any technique dependent on a validation set prevents us from re-training the model on our final configuration on the entire dataset, which can help increase the model’s performance.

Step 2: Increasing Complexity and Improving the Performance in Small Increments

You should spend most of your time in this second step. In this step, you will run many experiments with adjustments made to your baseline model. You can also add all the bells, whistles, and whatever technique is hot that month to your model. Just add them in small increments to be able to evaluate their impact.

For this step, it is recommended to have some sort of experiment tracking system in place, whether with pen and paper or an experiment tracking tool.

Hyperparameter tuning

Theoretically, you could maximize your model’s performance by running an automated hyperparameter optimization algorithm over the entire search space of possible hyperparameters. But this is not practical. Thus, you should first define a search space by running a few initial experiments [4].

The most important hyperparameters to tune are the learning rate and the number of epochs. Kaggle Grandmasters recommend the following ranges as starting points [11]:

  • Learning rate: 1e-4 to 1e-3
  • Epochs: 2 to 10

For NLP, you will need only a few epochs because the models are better pre-trained than in CV.

Because we already have reduced ranges for the hyperparameter search spaces, we can use an algorithm to automate hyperparameter tuning [4, 8]. It is recommended to prefer random search or bayesian optimization over grid search [4, 8].

Data augmentations

A Deep Learning model’s performance heavily relies on the amount of data. Thus, increasing the amount of data with augmented data can help improve your model’s performance [1, 8, 11].

Common data augmentation in CV are:

  • Horizontal or vertical flipping
  • Rotating
  • Resizing
  • Random cropping
  • Shifting
  • Mixup
  • Cutout
  • Cutmix

You can find the implementations of cutout, mixup, and cutmix in PyTorch in this article:

Cutout, Mixup, and Cutmix: Implementing Modern Image Augmentations in PyTorch

For NLP, data augmentations that work on CV seem to work on text data as well, like random cropping, resizing, or cutout [9]. But not all CV data augmentation techniques can be translated to text data directly and thus require their own augmentation techniques:

  • Back translation: Translating text to another language and then back to the original language
  • Masked Entity Language Modeling (MELM) [13]: Randomly masking a percentage of tokens in a sentence (similar to cutout in CV [9])
  • Replacing: Replacing words in a sentence with synonyms.

Model complexity

The best suited backbone will be different for every problem. Thus, you should try a few different backbones [1, 11].

Garbage in, garbage out

Any ML model stands or falls with the quality of data you feed it. Thus, it is essential to experiment with different configurations for your model’s training pipeline and review the training data. A good approach is to review which samples your model is predicting well and which it is performing poorly on.

If you visualize these samples, you will make your life much easier [9]. This will give you a hint about possible issues in the data or training pipeline.

Step 3: Squeezing the Last Bits of Performance

Once you reach the end of experimentation (e.g., deadline or satisfactory performance), you can add some finishing touches and squeeze out the last bits of performance of your model.

Retraining on the full dataset

Deep Learning models are data-hungry. That’s why you can boost your model’s performance if you retrain your model with the final training configuration on the full dataset [1, 11].

Ensembles

While a few years ago, big ensembles were fashionable, today, smaller ensembles of up to three models are en vogue [1]. When ensembling models, you could try the following strategies:

  • Combine models with the same training configuration but with different seeds
  • Combine different models with a lot of diversity (e.g., different backbones)

Adding more bells and whistles

If you are still unsatisfied with your results, try pseudo-labeling [9] or test time augmentations [11] and see if you can squeeze that little bit of performance.

Conclusion

If you have just finished your "Introduction to Deep Learning" course and want to level up your skills, here is the high-level recipe to approach any Deep Learning problem.

Start with a simple baseline:

  • Backbone: Just one to start with – you should experiment with others anyways: EfficientNet for CV and DeBERTa for NLP
  • Batch size (fixed): Check what is the biggest batch size in powers of two (e.g., 16, 32, 64, etc.) to fit in the available memory, and then don’t change it unless really necessary.
  • Number of training steps: 5 epochs without early stopping
  • Optimizer (fixed): AdamW
  • Learning rate: 1e-3
  • Learning rate scheduler (fixed): Cosine Annealing

Set up an experiment tracking system in place and get cracking:

  • Hyperparameter tuning: Start with tuning of the learning rate (0.0001–0.001) and epochs (2–10). Use random search or bayesian optimization for automated hyperparameter tuning.
  • Data augmentation: The more, the merrier. Mixup, cutout, cutmix are especially popular because of their effectiveness.
  • Backbone: Try different backbones and model complexities
  • Bells and whistles: In this step, you can go wild with trying every hot new trend in Deep Learning and checking if it will improve your model’s performance

To squeeze out the last bit of performance, you should select up to three of your best training configurations, retrain a model on the entire dataset, and ensemble them.


Enjoyed This Story?

Subscribe for free to get notified when I publish a new story.

Get an email whenever Leonie Monigatti publishes.

Find me on LinkedIn, Twitter, and Kaggle!

References

[1] S. Bhutani with H20.ai (2023). Best Practises for Training ML Models | @ChaiTimeDataScience #160 presented on YouTube in January 2023.

[2] CS231n Convolutional Neural Networks for Visual Recognition (2023). Transfer Learning (accessed February 3rd, 2023)

[3] J. Devlin, M. M. Chang,K. Lee & K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[4,] V. Godbole, G. E. Dahl, J. Gilmer, C. J. Shallue and Z. Nado (2023). Deep Learning Tuning Playbook (Version 1.0) (accessed February 3rd, 2023)

[5] P. He, X. Liu, J. Gao, & W. Chen (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.

[6] K. He, X. Zhang, S. Ren, & J. Sun (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

[7] G. Huang, Z. Liu, L. Van Der Maaten & K. Q. Weinberger (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708).

[8] A. Karpathy (2019). A Recipe for Training Neural Networks (accessed February 3rd, 2023)

[9] D. Kłeczek with Munich NLP (2023). Ten Proven Techniques to Improve Your NLP Model Training (accessed February 9th, 2023)

[10] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen & V. Stoyanov (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

[11] P. Singer and Y. Babakhin (2022). Practical Tips for Deep Transfer Learning presented at Kaggle Days Paris in November 2022.

[12] M. Tan, & Q. Le (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114). PMLR

[13] R. Zhou, X. Li, R. He, L. Bing, E. Cambria, L. Si, & C. Miao (2021). MELM: Data augmentation with masked entity language modeling for low-resource NER. arXiv preprint arXiv:2108.13655.


Related Articles