Hyperparameter Tuning of HuggingFace Models with AWS Sagemaker SDK

Optimizing deep neural networks with the HuggingFace Estimator and Sagemaker Tuner

Ciarán Cooney

Published in

Towards Data Science

8 min readJan 30, 2023

Image from pexels.com (https://www.pexels.com/photo/person-holding-volume-knob-1345630/)

Introduction

Even in the era of enormous pretrained neural networks, hyperparameter tuning offers the opportunity to maximize model performance for a specific downstream task. Fine-tuning, just like training from scratch, requires a reasonable set of initial hyperparameters to enable efficient and optimal training, so finding an effective method for tuning these parameters is an important piece of the deep learning jigsaw.

Hyperparameter tuning is an important concept to think about when working with some of the large pre-trained models available on HuggingFace, such as BERT, T5, wav2vec or ViT. It is easy to think that most of the potential of these models has already been exhausted through large-scale pretraining, but hyperparameters such as learning rate, number of warmup steps, weight decay, and the type of learning rate scheduler can have a significant effect on the ultimate objective of your fine-tuning task.

Fortunately, there are several strategies for searching for optimal hyperparameter configurations (e.g., grid search or bayesian), with varying levels of sophistication behind their approaches. In addition, deep learning frameworks and cloud providers are doing more and more to make it easy for practitioners to integrate hyperparmeter tuning into their ML workflow. One of these is Amazon Web Service’s (AWS) Sagemaker HyperparameterTuner. In this article, I am going to do a code walk-through of how to use Sagemaker to fine-tune a HuggingFace transformer using its hyperparameter tuner and the Sagemaker HuggingFace Estimator.

Notebooks and scripts are available here, and are part of a repo being written to demonstrate the utility of Sagemaker training, evaluating, and deploying deep learning models.

Hyperparameter Tuner and HuggingFace Estimator

Sagemaker’s HyperparameterTuner makes running hyperparameter jobs easy to maintain and cost effective. This class takes a Sagemaker estimator — the base class for running machine learning training jobs in AWS — and configures a tuning job based on arguments provided by the user. The user can specify the tuning strategy, the metric to maximize or minimize, hyperparameter ranges to search through, and several other arguments. You call .fit() on the tuner just like you would with a standard estimator, and it also provides functionality for deployment after training is complete.

I am going to demonstrate the HyperparameterTuner alongside the Sagemaker HuggingFace Estimtor. This is a bespoke estimator for working with HuggingFace models in AWS. In this example, I am going to fine-tune DistilBERT on the tweet_eval dataset for a sentiment classification task. The dataset is provided under a Creative Commons Attribution 3.0 Unported License.

Follow along with code

Following some imports, we need to set up a Sagemaker session and initialize a S3 bucket that we can read from and write to. A session in Sagemaker is a super convenient class utilizing the resources and entities Sagemaker typically uses; things such as endpoints and data in S3. If you do not specify a bucket, the session will assign a default bucket.

With the initial admin complete, the next thing to do is get the data.

Data

To demonstrate hyperparameter tuning with the HuggingFace estimator, we’re going to use the tweet_eval dataset and download it directly from the datasets library.

Load the dataset.

Following a few tokenization and processing steps, we want to convert the dataset to tensors and then store the train and test sets in the bucket we defined for our Sagemaker session.

Fortunately, HuggingFace datasets and Sagemaker have made saving data relatively simple since a datasets object provides us with a save_to_disk() method which allows us to pass in a file system argument that takes care of moving the data to S3 using s3fs.S3FileSysteM.

Store datasets in S3 with save_to_disk() method.

Now we have our train and test data stored in an S3 location where are training job can access it.

S3 locations for train and test sets. Image by author.

Hyperparameter Settings

Before running a tuning job, we want to think about the hyperparameters we want to optimize and the range of values we think might be appropriate. Common hyperparameters that get optimized by tuning include learning rate, weight decay, dropout probability, or even structural parameters like the number of layers in a neural network or the pooling strategy. In the scenario I am demonstrating here, the base model itself could even be tuned as a hyperparameter as we could load in several HuggingFace models for comparison.

However, for this example we will fine-tune DistilBERT tuned for four hyperparameters. These are:

Learning rate
Number of warm up steps
Optimizer
Weight decay

Initialize Estimator and Tuner

Before we initialize our tuning job, we need to initialize our estimator. An estimator is a class in Sagemaker that handles end-to-end training and deployment tasks. The HuggingFace estimator allows us to run custom HuggingFace code in a Sagemaker training environment by using a pre-built docker container developed specifically for the task.

We pass the estimator our training script using the entry_point argument. We also pass several additional parameters to configure the environment, the package versioning, and the instance settings. The hyperparameters argument passed to the estimator does not contain the parameters to be tuned, but arguments to be passed to our training script.

Initialize the HuggingFace estimator.

The training script training_script.py contains our code for fine-tuning DistilBERT, here. HuggingFace provides a Trainer class that handles virtually all of the training setup and procedures, and there are examples of tuning using that approach here. However, this is not always desirable and there are advantages to having more direct control over the training loop. For that reason, I have written a custom training loop in PyTorch for this task.

Check out the custom training loop if it helps, but here are a couple of snippets showing the dataloader and the model training.

Pytorch dataloader for training set.

Training loop in native Pytorch.

The snippet below shows the configuration of our hyperparameter ranges. The Sagemaker tuner comes with a suite of classes for representing parameter ranges. ContinuousParameter allows us to set a range between which we can search for continuous values. Here, it is used for learning rate and weight decay. IntegerParameter provides the same functionality for ints and we use it for warmup steps. Finally, CategoricalParameter allows us to pass a list of variables to tune — here, this is used for optimizer type.

The tuner also requires an objective metric and an objective type — something to tune the model towards and the direction we want to tune it. The metric_definitionscontains the name of one or more metrics, and a regular expression used to extract the metric from Cloudwatch logs (this is a common feature of the Sagemaker sdk).

Define hyperparameter ranges and objective metric.

Now we can define the HyperparameterTuner before beginning our tuning jobs. As well as the HuggingFace estimator, metric arguments, and hyperparameter ranges we also need to setup the maximum number of jobs and the number of parallel jobs we want to run. This is what makes the Sagemaker tuner so great and so easy to use. Then we call tuner.fit() to start the tuning job.

Initialize HyperparameterTuner and call .fit() to begin tuning.

Compare tuned hyperparameters

The tuning job ends and we have our tuned hyperparameters. The tuner comes with a tuner.analytics() method for displaying summarized results in a pandas dataframe. The FinalObjectiveValue is the loss metric we established when configuring the tuning job.

Results DataFrame from tuner analytics. Image by author.

The optimal hyperparameters are:

Learning rate = 0.000175
Optimizer = Adafactor
Warmup_steps = 192
Weight decay = 0.000111

…and a cursory glance at the results suggests that learning rate is probably the most significant factor.

Of course, we can go ahead and plot our results directly from the dataframe, but there is another way. From the Sagemaker console, we can click-through the Training and Hyperparameter tuning jobs tabs. From there, we can find our completed jobs and click on the View algorithm metrics link. This takes us to AWS CloudWatch where we can see various interactive plots and perform queries on the data returned from our tuner. The plot below is an example line plot show the test loss over two epochs.

Now we can view the results, we have a few options for using the tuned values. First, we could simply take the model trained with these parameters as our final model for inference. Second, we could use the optimal parameters to perform a longer training run in order to improve our model. Third, we could reset out hyperparameter ranges based on these results and run another tuning job to get a more granular result.

For now, I am just going to use the best model achieved by the training job to deploy and perform inference.

Deploy Endpoint and predict

To select the best model, our tuner object has a best_estimator()method. Having initialized the best performing model, it is very simple to deploy it to a Sagemaker Endpoint using the deploy()method. Here, I am specifying the number of instances to use for inference (1), and the instance type (‘ml.g4dn.xlarge’ for accelerated computing). Deployment can take a few minutes to complete, and when it is down you have your model endpoint hosted on Sagemaker.

Deploy model.

With the model deployed, we can make predictions of the sentiment of some input text. If I input the sentence, “Best thing ever!” I would expect a positive sentiment prediction with a very high confidence value. That is what we get. However, the output labels are generically set to ‘LABEL_0’ and ‘LABEL_1’, so I’ve written a little post-processing code to give us more meaningful outputs and you can see that we end up with a ‘positive’ result.

Make a prediction.

Using the deployed model to make predictions. Image by author.

Predict a class label.

Formatting predictions to be readable. Image by author.

Finally, if we no longer require use of the model for inference we can delete the endpoint so that it is no longer hosted (your mode artefacts are still stored in S3).

Delete endpoint when all tasks are complete.

Overall, my experience of using Sagemaker HyperparameterTuner has been very positive but there are a few potential downsides that you might have to consider. As with all cloud services, one of the things to be aware of is cost. This is particularly salient for a service like this in which multiple jobs, including parallelization and GPUs, are being used. Another potential downside is the high-level nature of HyperparameterTuner and the Sagemaker sdk. Some people would prefer to have more control over their programs, and for this something like boto3 may be preferable.

Conclusion

This post demonstrates how to perform hyperparameter tuning in AWS Sagemaker using the HuggingFace estimator. I hope the code walkthough shows just how easy it is to tune hyperparameters using the Sagemaker sdk and that there is a lot to be gained in model development by using it. Jupyter Notebooks for using the hyperparameter tuner are available here and here. The main github repository for Sagemaker examples is here.