Tips and Tricks

Transformers – Hello and we’re meeting again. We have a date, aren’t we, RoBERTa?
If you have read and followed through with my earlier post on Transformers, can you rate the complexity of reading passages? that is great! That means most probably you are already familiar with the basics of a Transformer fine-tuning or training process. If you have not seen the post, you may visit the link below.
Transformers, can you rate the complexity of reading passages?
So, how’s your model doing? Does it manage to achieve reasonably good results? Or does your Transformer model suffer from performance and instability? If yes, the root cause is often difficult to diagnose and determine. Such issues are usually more prevalent on large models and small datasets. The nature and characteristics of the associated data and downstream tasks can also play a part.
If your Transformer is not performing up to your expectation, what can you do? You may try hyperparameter tuning. In addition, you may also try to implement some of the advanced training techniques which I’m going to cover in this post. These techniques can be used for fine-tuning Transformers such as BERT, ALBERT, RoBERTa, and others.
Contents
1. Layer-wise Learning Rate Decay (LLRD) 2. Warm-up Steps 3. Re-initializing Pre-trained Layers 4. Stochastic Weight Averaging (SWA) 5. Frequent Evaluation Results Summary
For all the advanced fine-tuning techniques that we’re going to do in this post, we will use the same model and dataset that we have from Transformers, can you rate the complexity of reading passages?
In the end, we will be able to relatively compare the result of basic fine-tuning with the ones that we obtained by applying advanced fine-tuning techniques.
1. Layer-wise Learning Rate Decay (LLRD)
In Revisiting Few-sample BERT Fine-tuning, the authors describe layer-wise learning rate decay as "a method that applies higher learning rates for top layers and lower learning rates for bottom layers. This is accomplished by setting the learning rate of the top layer and using a multiplicative decay rate to decrease the learning rate layer-by-layer from top to bottom".
A similar concept called discriminative fine-tuning is also express in Universal Language Model Fine-tuning for Text Classification.
"Discriminative fine-tuning allows us to tune each layer with different learning rates instead of using the same learning rate for all layers of the model"
All these make sense as different layers in the Transformer model usually capture different kinds of information. Bottom layers often encode more common, general, and broad-based information, while the top layer closer to the output encodes information more localized and specific to the task on hand.
Before we go into the implementation, let’s quickly recap on the basic fine-tuning that we’ve done for Transformers, can you rate the complexity of reading passages?
On a roberta-base
model that consists of one embeddings layer and 12 hidden layers, we used a linear scheduler and set an initial learning rate of 1e-6
(that is 0.000001) in the optimizer. As depicted in Figure 1, the scheduler created a schedule with a learning rate that linearly decreases from 1e-6
to zero across training steps.

Now, to implement layer-wise learning rate decay (or discriminative fine-tuning), there are two possible ways of doing it.
The first way is following the method described in Revisiting Few-sample BERT Fine-tuning. We choose a learning rate of 3.5e-6
for the top layer and use a multiplicative decay rate of 0.9
to decrease the learning rate layer-by-layer from top to bottom. It will result in the bottom layers (embeddings
and layer0
) having a learning rate roughly close to 1e-6
. We do this in a function called roberta_base_AdamW_LLRD
.
Okay, we have set the learning rates for the hidden layers. How about the pooler
and regressor
head? For them, we choose 3.6e-6
, a learning rate that is slightly higher than the top layer.
In the codes below, head_params
, layer_params
, and embed_params
are dictionaries defining parameters, learning rates, and weight decays that we want to optimize. All these parameter groups are pass into the AdamW
optimizer, which is return by the function.
Below is how a linear schedule with layer-wise learning rate decay looks like:

The second approach of implementing layer-wise learning rate decay (or discriminative fine-tuning) is to group layers into different sets and apply different learning rates to each. We will refer to this as grouped LLRD.
Using a new function roberta_base_AdamW_grouped_LLRD
, we split the 12 hidden layers of roberta-base
model into 3 sets, with embeddings
attached to the first set.
- Set 1 : Embeddings + Layer 0, 1, 2, 3 (learning rate:
1e-6
) - Set 2 : Layer 4, 5, 6, 7 (learning rate:
1.75e-6
) - Set 3 : Layer 8, 9, 10, 11 (learning rate:
3.5e-6
)
Same as the first approach, we use 3.6e-6
for the pooler
and regressor
head, a learning rate that is slightly higher than the top layer.
Below is how a linear schedule with grouped LLRD looks like:

2. Warm-up Steps
For the linear scheduler that we used, we can apply warm-up steps. For example, applying 50 warm-up steps means the learning rate will increase linearly from 0 to the initial learning rate set in the optimizer during the first 50 steps (warm-up phase). After that, the learning rate will start to decrease linearly to 0.

The following plot shows the respective layer’s learning rate at step-50. These are the learning rates we set for the optimizer.

To apply warm-up steps, enter the parameter num_warmup_steps
on the get_scheduler
function.
scheduler = transformers.get_scheduler(
"linear",
optimizer = optimizer,
num_warmup_steps = 50,
num_training_steps = train_steps
)
Alternatively, you may also use get_linear_schedule_with_warmup
.
scheduler = transformers.get_linear_schedule_with_warmup(
optimizer = optimizer,
num_warmup_steps = 50,
num_training_steps = train_steps
)
3. Re-initializing Pre-trained Layers
Fine-tuning Transformer is a breeze since we are using pre-trained models. It means we are not training one from scratch, which could take up substantial resources and time. These models usually have been pre-trained on a large corpus of text data, and they contain pre-trained weights that we could use. However, to achieve better fine-tuning results, sometimes we need to discard some of these weights and re-initialize them during the fine-tuning process.
So how do we do this? Earlier, we talked about different layers of the Transformer capturing different kinds of information. The bottom layers usually encode more general information. These are useful, and so we want to preserve these low-level representations. What we want to refresh are the top layers closer to the output. They are layers that encode information more specific to the pre-training task, and now we want them to adapt to ours.
We can do this in MyModel
class that we created previously. When initializing the model, we pass in a parameter that specifies the top n
layers to re-initialize. You may ask, why n
? It turns out that choosing an optimum value for n
is crucial and can lead to faster convergence. That is, how many of the top layers to re-initialized? Well, it depends, as every model and dataset is different. For our case, the optimal value for n
is 5. You may start to experience deteriorating results if re-initializing more layers beyond the optimal point.
In the codes below, we re-initialize the weights for nn.Linear
modules with mean 0 and standard deviation defined by the model’s initializer_range
, and re-initialize the weights for nn.LayerNorm
modules with values of 1. Biases are re-initialized with values of 0.
As seen in the codes, we are also re-initializing the pooler
layer. If you’re not using pooler
in your model, you may omit the part relating to it in _do_reinit
.
4. Stochastic Weight Averaging (SWA)
Stochastic Weight Averaging (SWA) is a deep neural network training technique presented in Averaging Weights Leads to Wider Optima and Better Generalization. According to the authors,
"SWA is extremely easy to implement and has virtually no computational overhead compared to the conventional training schemes"
So, how does SWA work? As stated in this PyTorch blog, SWA comprises of two ingredients:
- First, it uses a modified learning rate schedule. For example, we can use the standard decaying learning rate strategy (such as the linear schedule that we are using) for the first 75% of training time and then set the learning rate to a reasonably high constant value for the remaining 25% of the time.
- Second, it takes an equal average of the weights of the networks traversed. For example, we can maintain a running average of the weights obtained at the end, within the last 25% of training time. After training is complete, we then set the weights of the network to the computed SWA averages.
How to use SWA in PyTorch?
In
torch.optim.swa_utils
we implement all the SWA ingredients to make it convenient to use SWA with any model.In particular, we implement
AveragedModel
class for SWA models,SWALR
learning rate scheduler, andupdate_bn
utility function to update SWA batch normalization statistics at the end of training.Source: PyTorch blog
SWA is easy to implement in PyTorch. You may refer to the sample codes below provided from PyTorch documentation for implementing SWA.

To implement SWA in our run_training
function, we take in a parameter for swa_lr
. This parameter is the SWA learning rate set to a constant value. In our case, we will use 2e-6
for swa_lr
.
Because we want to switch to the SWA learning rate schedule and start to collect SWA averages of the parameters at epoch 3, we assign 3 for swa_start
.
For each fold, we initialize the swa_model
and swa_scheduler
along with the data loader, model, optimizer, and scheduler. swa_model
is the SWA model that accumulates the averages of the weights.
Next, we loop through the epochs, calling the train_fn
and passing it the swa_model
, swa_scheduler
, and a boolean indicator, swa_step
. It is an indicator that tells the program to switch to swa_scheduler
at epoch 3.
In the train_fn
, the parameter swa_step
passed in from run_training
function controls the switch to SWALR
and the updates of parameters of the averaged model, swa_model
.
The sweet thing about SWA is that we can use it with any optimizer and most schedulers. On our linear schedule with LLRD, we can see from Figure 6 how the learning rate remains constant at 2e-6
after switching to the SWA learning rate schedule at epoch 3.

Below is how a linear schedule looks like after implementing SWA on grouped LLRD with 50 warm-up steps:

You can read more details about SWA on this PyTorch blog and on this PyTorch documentation.
5. Frequent Evaluation
Frequent evaluation is another technique worth exploring. What it simply means is instead of validating once on each epoch, we’re going to perform validation for every x
batches of training data within the epoch. This will require a little bit of structure change in our codes, as currently the training and validation functions are separate and both are called once per epoch.
What we will do is create a new function, train_and_validate
. For each epoch, run_training
will then call this new function instead of separately calling train_fn
and validate_fn
.

Inside train_and_validate
, for each batch of training data, it would run the model training codes. However, for validation, the validate_fn
would only be called on every x
batches of training data. Thus, if x
is 10 and if we have 50 batches of training data, then there will be 5 validations done for each epoch.

Results
All right, the exciting part is here… the results!
It’s pretty amazing that these techniques can contribute to so much improvement in results. The results are displayed in the table below.
The mean RMSE score gets from 0.589 with basic fine-tuning to 0.5199 after applying all the advanced techniques covered in this post.
Summary
In this post, we went through the various techniques used for fine-tuning Transformers.
☑️ First, we used layer-wise learning rate decay (LLRD). The main idea behind LLRD is to have different learning rates applied to each layer of the Transformer, or applied to the grouping of layers in the case of grouped LLRD. Specifically, top layers should have higher learning rates than bottom layers.
☑️ Next, we employed warm-up steps to the learning rate schedule. With warm-up steps in a linear schedule, the learning rates increase linearly from 0 to the initial learning rates set in the optimizer during the warm-up phase, and after which they start to decrease linearly to 0.
☑️ We also performed re-initialization for the top n
layers of the Transformer. Choosing an optimum value for n
is crucial as you may start to experience deteriorating results if re-initializing more layers beyond the optimal point.
☑️ Then we applied Stochastic Weight Averaging (SWA), a deep neural network training technique that is using a modified learning rate schedule. It also maintains a running average of the weights obtained at the end, within the last segment of training time.
☑️ Last but not least, we introduced frequent evaluation on our Transformer fine-tuning process. Instead of validating once on each epoch, we are validating for every x
batches of training data within the epoch.
With all these techniques, we see great improvement in results as exhibited in Table 1.
If you like my post, don’t forget to hit Follow and Subscribe to get notified via email.
Optionally, you may also sign up for a Medium membership to get full access to every story on Medium.
📑 Visit this GitHub repo for all codes and notebooks that I’ve shared in my post.
© 2021 All rights reserved.
References
[1] T. Zhang, F. Wu, A. Katiyar, K. Weinberger, and Y. Artzi, Revisiting Few-sample BERT Fine-tuning (2021)
[2] A. Thakur, Approaching (Almost) Any Machine Learning Problem (2020)
[3] C. Sun, X. Qiu, Y. Xu, and X. Huang, How to Fine-Tune BERT for Text Classification? (2020)
[4] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. Wilson, Averaging Weights Leads to Wider Optima and Better Generalization (2019)
[5] J. Howard and S. Ruder, Universal Language Model Fine-tuning for Text Classification (2018)