The world’s leading publication for data science, AI, and ML professionals.

Advanced Techniques for Fine-tuning Transformers

Learn these advanced techniques and see how they can help improve results

Tips and Tricks

Transformer word clouds generated with Python codes. Image by author
Transformer word clouds generated with Python codes. Image by author

Transformers – Hello and we’re meeting again. We have a date, aren’t we, RoBERTa?

If you have read and followed through with my earlier post on Transformers, can you rate the complexity of reading passages? that is great! That means most probably you are already familiar with the basics of a Transformer fine-tuning or training process. If you have not seen the post, you may visit the link below.

Transformers, can you rate the complexity of reading passages?

So, how’s your model doing? Does it manage to achieve reasonably good results? Or does your Transformer model suffer from performance and instability? If yes, the root cause is often difficult to diagnose and determine. Such issues are usually more prevalent on large models and small datasets. The nature and characteristics of the associated data and downstream tasks can also play a part.

If your Transformer is not performing up to your expectation, what can you do? You may try hyperparameter tuning. In addition, you may also try to implement some of the advanced training techniques which I’m going to cover in this post. These techniques can be used for fine-tuning Transformers such as BERT, ALBERT, RoBERTa, and others.

Contents

1. Layer-wise Learning Rate Decay (LLRD) 2. Warm-up Steps 3. Re-initializing Pre-trained Layers 4. Stochastic Weight Averaging (SWA) 5. Frequent Evaluation Results Summary

For all the advanced fine-tuning techniques that we’re going to do in this post, we will use the same model and dataset that we have from Transformers, can you rate the complexity of reading passages?

In the end, we will be able to relatively compare the result of basic fine-tuning with the ones that we obtained by applying advanced fine-tuning techniques.


1. Layer-wise Learning Rate Decay (LLRD)

In Revisiting Few-sample BERT Fine-tuning, the authors describe layer-wise learning rate decay as "a method that applies higher learning rates for top layers and lower learning rates for bottom layers. This is accomplished by setting the learning rate of the top layer and using a multiplicative decay rate to decrease the learning rate layer-by-layer from top to bottom".

A similar concept called discriminative fine-tuning is also express in Universal Language Model Fine-tuning for Text Classification.

"Discriminative fine-tuning allows us to tune each layer with different learning rates instead of using the same learning rate for all layers of the model"

All these make sense as different layers in the Transformer model usually capture different kinds of information. Bottom layers often encode more common, general, and broad-based information, while the top layer closer to the output encodes information more localized and specific to the task on hand.

Before we go into the implementation, let’s quickly recap on the basic fine-tuning that we’ve done for Transformers, can you rate the complexity of reading passages?

On a roberta-base model that consists of one embeddings layer and 12 hidden layers, we used a linear scheduler and set an initial learning rate of 1e-6 (that is 0.000001) in the optimizer. As depicted in Figure 1, the scheduler created a schedule with a learning rate that linearly decreases from 1e-6 to zero across training steps.

Figure 1: Linear schedule with an initial learning rate of 1e-6. Image by author
Figure 1: Linear schedule with an initial learning rate of 1e-6. Image by author

Now, to implement layer-wise learning rate decay (or discriminative fine-tuning), there are two possible ways of doing it.

The first way is following the method described in Revisiting Few-sample BERT Fine-tuning. We choose a learning rate of 3.5e-6 for the top layer and use a multiplicative decay rate of 0.9 to decrease the learning rate layer-by-layer from top to bottom. It will result in the bottom layers (embeddings and layer0) having a learning rate roughly close to 1e-6. We do this in a function called roberta_base_AdamW_LLRD.

Okay, we have set the learning rates for the hidden layers. How about the pooler and regressor head? For them, we choose 3.6e-6, a learning rate that is slightly higher than the top layer.

In the codes below, head_params, layer_params, and embed_params are dictionaries defining parameters, learning rates, and weight decays that we want to optimize. All these parameter groups are pass into the AdamW optimizer, which is return by the function.

Below is how a linear schedule with layer-wise learning rate decay looks like:

Figure 2: Linear schedule with layer-wise learning rate decay. Image by author
Figure 2: Linear schedule with layer-wise learning rate decay. Image by author

The second approach of implementing layer-wise learning rate decay (or discriminative fine-tuning) is to group layers into different sets and apply different learning rates to each. We will refer to this as grouped LLRD.

Using a new function roberta_base_AdamW_grouped_LLRD, we split the 12 hidden layers of roberta-base model into 3 sets, with embeddings attached to the first set.

  • Set 1 : Embeddings + Layer 0, 1, 2, 3 (learning rate: 1e-6)
  • Set 2 : Layer 4, 5, 6, 7 (learning rate: 1.75e-6)
  • Set 3 : Layer 8, 9, 10, 11 (learning rate: 3.5e-6)

Same as the first approach, we use 3.6e-6 for the pooler and regressor head, a learning rate that is slightly higher than the top layer.

Below is how a linear schedule with grouped LLRD looks like:

Figure 3: Linear schedule with grouped LLRD. Image by author
Figure 3: Linear schedule with grouped LLRD. Image by author

2. Warm-up Steps

For the linear scheduler that we used, we can apply warm-up steps. For example, applying 50 warm-up steps means the learning rate will increase linearly from 0 to the initial learning rate set in the optimizer during the first 50 steps (warm-up phase). After that, the learning rate will start to decrease linearly to 0.

Figure 4: Linear schedule with LLRD and 50 warm-up steps. Image by author
Figure 4: Linear schedule with LLRD and 50 warm-up steps. Image by author

The following plot shows the respective layer’s learning rate at step-50. These are the learning rates we set for the optimizer.

Figure 5: Hover text reflects the learning rates at step-50. Image by author
Figure 5: Hover text reflects the learning rates at step-50. Image by author

To apply warm-up steps, enter the parameter num_warmup_steps on the get_scheduler function.

scheduler = transformers.get_scheduler(
                "linear",    
                optimizer = optimizer,
                num_warmup_steps = 50,
                num_training_steps = train_steps
)

Alternatively, you may also use get_linear_schedule_with_warmup.

scheduler = transformers.get_linear_schedule_with_warmup(                
                optimizer = optimizer,
                num_warmup_steps = 50,
                num_training_steps = train_steps
)

3. Re-initializing Pre-trained Layers

Fine-tuning Transformer is a breeze since we are using pre-trained models. It means we are not training one from scratch, which could take up substantial resources and time. These models usually have been pre-trained on a large corpus of text data, and they contain pre-trained weights that we could use. However, to achieve better fine-tuning results, sometimes we need to discard some of these weights and re-initialize them during the fine-tuning process.

So how do we do this? Earlier, we talked about different layers of the Transformer capturing different kinds of information. The bottom layers usually encode more general information. These are useful, and so we want to preserve these low-level representations. What we want to refresh are the top layers closer to the output. They are layers that encode information more specific to the pre-training task, and now we want them to adapt to ours.

We can do this in MyModel class that we created previously. When initializing the model, we pass in a parameter that specifies the top n layers to re-initialize. You may ask, why n? It turns out that choosing an optimum value for n is crucial and can lead to faster convergence. That is, how many of the top layers to re-initialized? Well, it depends, as every model and dataset is different. For our case, the optimal value for n is 5. You may start to experience deteriorating results if re-initializing more layers beyond the optimal point.

In the codes below, we re-initialize the weights for nn.Linear modules with mean 0 and standard deviation defined by the model’s initializer_range, and re-initialize the weights for nn.LayerNorm modules with values of 1. Biases are re-initialized with values of 0.

As seen in the codes, we are also re-initializing the pooler layer. If you’re not using pooler in your model, you may omit the part relating to it in _do_reinit.

4. Stochastic Weight Averaging (SWA)

Stochastic Weight Averaging (SWA) is a deep neural network training technique presented in Averaging Weights Leads to Wider Optima and Better Generalization. According to the authors,

"SWA is extremely easy to implement and has virtually no computational overhead compared to the conventional training schemes"

So, how does SWA work? As stated in this PyTorch blog, SWA comprises of two ingredients:

  • First, it uses a modified learning rate schedule. For example, we can use the standard decaying learning rate strategy (such as the linear schedule that we are using) for the first 75% of training time and then set the learning rate to a reasonably high constant value for the remaining 25% of the time.
  • Second, it takes an equal average of the weights of the networks traversed. For example, we can maintain a running average of the weights obtained at the end, within the last 25% of training time. After training is complete, we then set the weights of the network to the computed SWA averages.

How to use SWA in PyTorch?

In torch.optim.swa_utils we implement all the SWA ingredients to make it convenient to use SWA with any model.

In particular, we implement AveragedModel class for SWA models, SWALR learning rate scheduler, and update_bn utility function to update SWA batch normalization statistics at the end of training.

Source: PyTorch blog

SWA is easy to implement in PyTorch. You may refer to the sample codes below provided from PyTorch documentation for implementing SWA.

Sample codes from PyTorch documentation for implementing SWA
Sample codes from PyTorch documentation for implementing SWA

To implement SWA in our run_training function, we take in a parameter for swa_lr. This parameter is the SWA learning rate set to a constant value. In our case, we will use 2e-6 for swa_lr.

Because we want to switch to the SWA learning rate schedule and start to collect SWA averages of the parameters at epoch 3, we assign 3 for swa_start.

For each fold, we initialize the swa_model and swa_scheduler along with the data loader, model, optimizer, and scheduler. swa_model is the SWA model that accumulates the averages of the weights.

Next, we loop through the epochs, calling the train_fn and passing it the swa_model, swa_scheduler, and a boolean indicator, swa_step. It is an indicator that tells the program to switch to swa_scheduler at epoch 3.

In the train_fn, the parameter swa_step passed in from run_training function controls the switch to SWALR and the updates of parameters of the averaged model, swa_model.

The sweet thing about SWA is that we can use it with any optimizer and most schedulers. On our linear schedule with LLRD, we can see from Figure 6 how the learning rate remains constant at 2e-6 after switching to the SWA learning rate schedule at epoch 3.

Figure 6: Linear schedule with LLRD, 50 warm-up steps, and SWA. Image by author
Figure 6: Linear schedule with LLRD, 50 warm-up steps, and SWA. Image by author

Below is how a linear schedule looks like after implementing SWA on grouped LLRD with 50 warm-up steps:

Figure 7: Linear schedule with grouped LLRD, 50 warm-up steps, and SWA. Image by author
Figure 7: Linear schedule with grouped LLRD, 50 warm-up steps, and SWA. Image by author

You can read more details about SWA on this PyTorch blog and on this PyTorch documentation.

5. Frequent Evaluation

Frequent evaluation is another technique worth exploring. What it simply means is instead of validating once on each epoch, we’re going to perform validation for every x batches of training data within the epoch. This will require a little bit of structure change in our codes, as currently the training and validation functions are separate and both are called once per epoch.

What we will do is create a new function, train_and_validate. For each epoch, run_training will then call this new function instead of separately calling train_fn and validate_fn.

Figure 8: Image by author
Figure 8: Image by author

Inside train_and_validate, for each batch of training data, it would run the model training codes. However, for validation, the validate_fn would only be called on every x batches of training data. Thus, if x is 10 and if we have 50 batches of training data, then there will be 5 validations done for each epoch.

Figure 9: Image by author
Figure 9: Image by author

Results

All right, the exciting part is here… the results!

It’s pretty amazing that these techniques can contribute to so much improvement in results. The results are displayed in the table below.

The mean RMSE score gets from 0.589 with basic fine-tuning to 0.5199 after applying all the advanced techniques covered in this post.

Summary

In this post, we went through the various techniques used for fine-tuning Transformers.

☑️ First, we used layer-wise learning rate decay (LLRD). The main idea behind LLRD is to have different learning rates applied to each layer of the Transformer, or applied to the grouping of layers in the case of grouped LLRD. Specifically, top layers should have higher learning rates than bottom layers.

☑️ Next, we employed warm-up steps to the learning rate schedule. With warm-up steps in a linear schedule, the learning rates increase linearly from 0 to the initial learning rates set in the optimizer during the warm-up phase, and after which they start to decrease linearly to 0.

☑️ We also performed re-initialization for the top n layers of the Transformer. Choosing an optimum value for n is crucial as you may start to experience deteriorating results if re-initializing more layers beyond the optimal point.

☑️ Then we applied Stochastic Weight Averaging (SWA), a deep neural network training technique that is using a modified learning rate schedule. It also maintains a running average of the weights obtained at the end, within the last segment of training time.

☑️ Last but not least, we introduced frequent evaluation on our Transformer fine-tuning process. Instead of validating once on each epoch, we are validating for every x batches of training data within the epoch.

With all these techniques, we see great improvement in results as exhibited in Table 1.


If you like my post, don’t forget to hit Follow and Subscribe to get notified via email.

Optionally, you may also sign up for a Medium membership to get full access to every story on Medium.

📑 Visit this GitHub repo for all codes and notebooks that I’ve shared in my post.

© 2021 All rights reserved.


References

[1] T. Zhang, F. Wu, A. Katiyar, K. Weinberger, and Y. Artzi, Revisiting Few-sample BERT Fine-tuning (2021)

[2] A. Thakur, Approaching (Almost) Any Machine Learning Problem (2020)

[3] C. Sun, X. Qiu, Y. Xu, and X. Huang, How to Fine-Tune BERT for Text Classification? (2020)

[4] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. Wilson, Averaging Weights Leads to Wider Optima and Better Generalization (2019)

[5] J. Howard and S. Ruder, Universal Language Model Fine-tuning for Text Classification (2018)


Related Articles