How to Run Machine Learning Hyperparameter Optimization in the Cloud — Part 3

Cloud Tuning by Parallelizing Managed Training Jobs

Chaim Rand
Towards Data Science

--

Photo by Kenny Eliason on Unsplash

This the final part of a three-part post on the topic of hyperparameter tuning (HPT) machine learning models in the cloud. In the first part we set the stage by introducing the problem and defining a toy model and a training function for our tuning demonstrations. In the second part we reviewed two options for cloud based optimization, both of which involved parallel experimentation on a dedicated tuning cluster. In this part we will introduce two additional methods, optimization using a managed tuning API and building an HPT solution in which each experiment is an individual managed training job.

Option 3: Managed HPT Services

Some CSPs will include dedicated HPT APIs as part of their managed training service offering. Amazon SageMaker supports HPT via its automatic model tuning APIs. To adopt our tuning script for SageMaker HPT, we need to modify the entry point of our training script (train.py) as follows:

if __name__ == "__main__":
import argparse, os
parser = argparse.ArgumentParser()
parser.add_argument("--lr", type=float, default=0.01)
parser.add_argument("--model-dir", type=str,
default=os.environ["SM_MODEL_DIR"])
args, _ = parser.parse_known_args()
train({'lr':args.lr})

In the code block below, we show how to configure and run a SageMaker HPT job. In order to align the HPT run with our previous examples, we will use the recently announced SageMaker HPT support for the HyperBand algorithm with a similar configuration. Regrettably, as of the time of this writing, the SageMaker SDK (version 2.114.0) does not include built-in support for running a HyperBand tuning job. In the code block below we show how to work around this limitation by extending the SageMaker Session class.

from sagemaker.session import Session
class HyperBandSession(Session):
def _map_tuning_config(
self,
strategy,
max_jobs,
max_parallel_jobs,
early_stopping_type="Off",
objective_type=None,
objective_metric_name=None,
parameter_ranges=None,
):
tuning_config = super()._map_tuning_config(
strategy, max_jobs, max_parallel_jobs,
early_stopping_type, objective_type,
objective_metric_name, parameter_ranges)
tuning_config["StrategyConfig"] = {
"HyperbandStrategyConfig": {
"MinResource": 1,
"MaxResource": 8}
}
return tuning_config
# 1. define estimator (reused for each experiment)
from sagemaker.pytorch import PyTorch
estimator=PyTorch(
entry_point='train.py',
source_dir='./' #contains train.py and requirements file
role=<role>,
instance_type='ml.g4dn.xlarge',
instance_count=1,
py_version='py38',
pytorch_version='1.12',
sagemaker_session=HyperBandSession()
)
from sagemaker.tuner import (
ContinuousParameter,
HyperparameterTuner,
)
# 2. define search space
hyperparameter_ranges = {
"lr": ContinuousParameter(1e-6, 1e-1,
scaling_type='Logarithmic'),
}
# 3. define metric
objective_metric_name = "accuracy"
objective_type = "Maximize"
metric_definitions = [{"Name": "accuracy",
"Regex": "'eval_accuracy': ([0-9\\.]+)"}]
# 4. define algorithm strategy
algo_strategy = 'Hyperband'
max_jobs=32
max_parallel_jobs=8
# 5. define tuner
tuner = HyperparameterTuner(
estimator,
objective_metric_name,
hyperparameter_ranges,
metric_definitions,
strategy=algo_strategy,
max_jobs=max_jobs,
max_parallel_jobs=max_parallel_jobs,
objective_type=objective_type,
)
# 6. tune
tuner.fit(wait=False)

Pros and Cons

The main advantage to this method is its convenience. If you are already using Amazon SageMaker for training, you can enhance your code to support hyperparameter tuning with just a few additional SageMaker APIs. In particular, you do not need to adopt and integrate a dedicated HPT framework (such as Ray Tune).

The main disadvantage is its limited support of HPT algorithms. The SageMaker HPT APIs define a closed set of algorithms that you can choose from. These will not necessarily include the best (SOTA) or most ideal algorithm for your problem (see here for some examples).

Note that contrary to the previous HPT methods we have seen in which sequential trials can be run on the same node with little time delay, sequential runs in SageMaker HPT may include some start-time overhead. Recently, this overhead was dramatically reduced, by up to 20x(!!), by integrating the use of the same (warm) training instances for subsequent trials. This removed the overhead of requesting (and waiting for) an available instance (from Amazon EC2) as well as the overhead of pulling the desired docker image. However, there still remain some startup steps that are repeated for each trial, such as downloading the source code and input data.

For more details on SageMaker HPT see the API documentation as well as this cool post that shows how to overcome potential limitations on the number of instances per HPT job in order to run up to 10,000 trials.

Results

Our SageMaker HPT job ran for roughly 37 minutes and produced the results summarized in the screen capture below from the SageMaker console:

Screen Capture of SageMaker Console (by Author)

Option 4: Wrapping Managed Training Experiments with HPT

In this final method, our HPT algorithm runs locally (or on a cloud based notebook instance) and each experiment is an independently spawned, cloud based training job. The key ingredient to this type of solution is a reporting mechanism that enables the HPT algorithm to track the progress of each of the cloud based training jobs. We will provide a brief example of how to do this using SageMaker training metrics. Subsequently, we will demonstrate a full-fledged example of this method using the Syne Tune library.

In the code block below, we define a launch_experiment function and wrap it with a simple Ray Tune experiment. The launched jobs are defined with the learning rate value chosen by the HPT algorithm and with SageMaker metric_definitions that are designed to collect the evaluation accuracy output prints from the HuggingFace Trainer API. When the training job completes, the collected metric is extracted and reported to the Ray Tune session. The SageMaker session is also configured with the keep_alive_period_in_seconds flag to take advantage of the new SageMaker support for warm instance pools.

def launch_experiment(config):
# define estimator
from sagemaker.pytorch import PyTorch
estimator=PyTorch(
entry_point='train.py',
source_dir='./' #contains train.py and requirements file
role=<role>,
instance_type='ml.g4dn.xlarge',
instance_count=1,
py_version='py38',
pytorch_version='1.12',
hyperparameters={"lr": config['lr']}
keep_alive_period_in_seconds=240,
metric_definitions=[
{
"Name": "accuracy",
"Regex": "'eval_accuracy': ([0-9\\.]+)"
}]

)
# train
job_name = 'tune_model'
estimator.fit(job_name=job_name)
# use boto3 to access job
import boto3
search_params = {
"MaxResults": 1,
"Resource": "TrainingJob",
"SearchExpression": {
"Filters": [
{"Name": "TrainingJobName",
"Operator": "Equals",
"Value": job_name},
]
},
}
smclient=boto3.client(service_name="sagemaker")
res=smclient.search(**search_params)
# extract final metric
metrics=res["Results"][0]["TrainingJob"]["FinalMetricDataList"]
accuracy=metrics[[x["MetricName"]
for x in metrics].index("accuracy")]["Value"]
# report metric to ray
from ray.air import session
session.report({"accuracy": accuracy})
# configure a local Ray Tune HPT job
from ray import tune
from ray.tune.search.bayesopt import BayesOptSearch
config = {
"lr": tune.uniform(1e-6, 1e-1),
}
bayesopt = BayesOptSearch(
metric="accuracy",
mode="max")
tuner = tune.Tuner(
tune.with_resources(
tune.with_parameters(launch_experiment),
resources={"cpu": 1}),
tune_config=tune.TuneConfig(num_samples=2,
max_concurrent_trials=1,
search_alg=bayesopt,
),
param_space=config,
)
results = tuner.fit()
best_result = results.get_best_result("accuracy", "max")
print("Best trial config: {}".format(best_result.config))
print("Best final validation accuracy: {}".format(
best_result.metrics["accuracy"]))

In this simple example, we collected just the final reported metric value. In practice, we will want to monitor the metric throughout each experiment in order to terminate the ones that are underperforming. This requires a more sophisticated solution with the ability to collect and compare metrics from multiple jobs, early-stop failing jobs, and start new jobs on freed up instances. For this we adopt Syne Tune, a Python library that supports large-scale distributed hyperparameter optimization where experiments can be run both locally and in the cloud. Syne Tune has built-in support for training using Amazon SageMaker and can be enhanced to support other managed training environments. Syne Tune supports a wide variety of popular HPT algorithms and can be easily extended to support additional ones. For more details, please see the Syne Tune announcement.

The following code block contains a HuggingFace TrainerCallback that implements the metric report required by Syne Tune. This callback must be added to the list of callbacks of the HuggingFace Trainer object in our train function.

from transformers import TrainerCallback
class SyneReport(TrainerCallback):
def __init__(self) -> None:
from syne_tune.report import Reporter
super().__init__()
self.reporter = Reporter()
def on_evaluate(self, args, state, control, metrics, **kwargs):
self.reporter(epoch=int(state.epoch),
loss=metrics['eval_loss'],
accuracy=metrics['eval_accuracy'])

The code block below demonstrates how to run HPT of our toy example using the HyperBand algorithm with the same settings as in the previous sections. Once again we define the keep_alive_period_in_seconds flag to take advantage of warm instance pools.

# define the estimator
from sagemaker.pytorch import PyTorch
estimator=PyTorch(
entry_point='train.py',
source_dir='./' #contains train.py and requirements file
role=<role>,
instance_type='ml.g4dn.xlarge',
instance_count=1,
py_version='py38',
pytorch_version='1.12',
keep_alive_period_in_seconds=240,
)
from syne_tune.search_space import loguniform, uniform
max_epochs = 8
config = {
"lr": loguniform(1e-6, 1e-1),
}
from syne_tune.optimizer.schedulers.hyperband import (
HyperbandScheduler
)
scheduler = HyperbandScheduler(
config,
max_t=8,
resource_attr='epoch',
searcher='random',
metric="accuracy",
mode="max",
reduction_factor=2
)

from syne_tune.tuner import Tuner
from syne_tune.stopping_criterion import StoppingCriterion
from syne_tune.backend.sagemaker_backend.sagemaker_backend import (
SagemakerBackend
)
tuner = Tuner(
backend=SagemakerBackend(sm_estimator=est),
scheduler=scheduler,
stop_criterion=StoppingCriterion(max_num_trials_started=32),
n_workers=8,
tuner_name="synetune"
)
tuner.run()

Note that, contrary to our previous methods, Syne Tune does not (at the time of this writing and to the best of our knowledge) include an option for fixing the number of overall experiments. A Syne Tune job is stopped using a StoppingCriterion which can be determined by the amount of time that has passed, the number of experiments that have started, the number of experiments that have completed, and more. Using the StoppingCriterion we can configure the tuning job to exit after 32 experiments have completed, but so long as this criterion has not been met, Syne Tune will continue to spawn additional (wasteful) jobs.

Pros and Cons

Wrapping SageMaker experiments with an HPT solution such as Syne Tune offers a great deal of freedom in terms of algorithm choice. This is contrary to the SageMaker HPT method. Since each experiment is an individual training job, it supports auto-scaling and is conducive to the use of spot instances.

On the downside, compared to SageMaker HPT, this method requires learning and adopting an HPT framework. As in the case of the SageMaker HPT method, there is a concern regarding the start-up time of new trials, but these are mitigated by our use of warm instance pools.

Results

Our Syne Tune job ran for roughly half an hour and output the following results:

accuracy: best 0.7969924812030075 for trial-id 7 (lr=0.000027)

Summary

The table below summarizes our subjective experience with the different methods we covered:

Subjective Summary of Attributes of Cloud Based HPT Options (by Author)

Note that in this summary we mark the SageMaker HPT and Syne Tune methods as suffering from high overhead. In practice, the new support for warm instance usage reduces the overhead to the point where you might find it negligible. Nevertheless, we chose to highlight the distinction from the cluster based solutions where sequential experiments run in the same instance session.

How to Choose an HPT option

The best option will likely depend on the details of your project and your own personal preferences. For example, if you already have a reliable orchestration solution in place that supports training, you may choose to extend it to support HPT (option 1). If you despise orchestration based solutions and are unconcerned with possibility of idle instances, you might prefer running HPT within a SageMaker instance cluster (option 2). If your priority is reducing cost and you are perfectly content with a limited set of HPT algorithms, then you should probably run SageMaker HPT on spot instances (option 3). If you want to run an advanced scheduling algorithm with a high degree of variance in the number of parallel trials, Syne Tune (option 4) might be your best option. Of course, the best option for you might also be one that is not covered in this post.

Please feel free to reach out with comments, questions, and corrections. In the meantime… Happy Tuning!!

--

--

I am a Machine Learning Algorithm Developer working on Autonomous Vehicle technologies at Mobileye. The views expressed in my posts are my own.