The world’s leading publication for data science, AI, and ML professionals.

6 Steps to Migrating Your Machine Learning Project to the Cloud

How to use cloud training resources to scale up your training capacity

Hands-on Tutorials

Photo by Jeremy Bishop on Unsplash
Photo by Jeremy Bishop on Unsplash

Whether you are an algorithm developer in a growing startup company, a data scientist in a university research lab, or a kaggle hobbyist, there may come a point in time when the training resources that you have onsite no longer meet your training demands. In this post we target development teams that are (finally) ready to move their machine learning (ML) workloads to the cloud. We will discuss some of the important decisions that need to made during this big transition. Naturally, any attempt to encompass all of the steps of such an endeavor is doomed to fail. Machine learning projects come in many shapes and forms and as their complexity increases so does the undertaking of making such a significant change as migrating to the cloud. In this post we will highlight what we believe to be some of the most important considerations that are common to most typical deep learning projects.

We will demonstrate the steps that we lay out on a simple pytorch cifar10 model taken from this pytorch tutorial using pytorch version 1.9. While our demonstration will be limited to a specific machine learning framework and framework version, the general points we will make are relevant to other frameworks and versions as well.

I would like to acknowledge Yitzhak Levi for his contributions to this post.

Why Train in the Cloud?

Before diving into the migration steps let’s review some of the main advantages to training in the cloud:

  1. Accessibility: Cloud services offer access to a wide variety of machine learning training environments. A training environment has two components: 1. the hardware system(s) and 2. the supporting software stack. Cloud services typically support a long list of hardware instance types that differ in parameters such as: the type and number of CPU cores, the type and number of training accelerators, network bandwidth, memory sizes, disk sizes, and more. The wide selection enables you to choose the instance type that best suits your project needs. Furthermore, cloud services provide a long list of pre-configured software images supporting a wide variety of popular training frameworks. Not only can these software images save you a considerable amount of system configuration time, but they also often include optimizations specifically tuned to the underlying cloud environment.
  2. Scalability: Contrary to a local training environment where you are limited by the number of training instances in your possession, a cloud based training environment enables virtually infinite scalability. This enables the option of accelerating your training project by distributing single training jobs over multiple machines and/or running multiple training experiments in parallel.
  3. Storage Infrastructure: The driving factor behind your desire to migrate to the cloud may come from training-related infrastructure needs other than the computation instance. For example, your training job might use datasets so large that they require a storage infrastructure that you cannot support in-house. Once you have moved your data to the cloud, you may find it more optimal to train on a training environment that is co-located with your training data rather than to stream your data into your own personal environment.

Given these (and other) advantages, it is no wonder that there is a growing trend towards Cloud Training. However, it is important to be aware of the adaptations that you will need to make to your training applications in order to migrate to the cloud. In the next sections we will attempt to summarize what we view as some of the critical steps in making this transition successful.

  1. Choosing a training cloud service
  2. Choosing a training software image
  3. Adapting your program to access training data in the cloud
  4. Scaling your program to run on multiple workers
  5. Adapting your program to run performance analysis in the cloud
  6. Monitoring cloud based training experiments

Step 1 – Choosing a Cloud Service

Given the large number of cloud service providers and ever increasing list of AI related offerings, it is easy to see how one might find the task of choosing the best training service overwhelming. Among the most popular cloud service providers are Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), each of which offer multiple different options for performing training. One might say that the difficulty of this step is due to the large number of possibilities.

The differences between the many options can be divided into two types: 1. differences that can be attributed to documented features of the cloud services, and 2. differences that are difficult to anticipate and that you are likely to identify only once you actually test out the cloud services.

Documented differences might include parameters such as the cost of the service, the list of supported instance types, and the list of supported training frameworks. For example, you might find that instance type most suited to your workload is only offered by one service provider.

Differences that are hard to anticipate might include the availability of the instance type of your choice, the speed at which your data can be streamed from storage, or limitations on scaling to multi-instance training. For example, you might find that the rate of streaming data from cloud storage is measurably faster when using one training service than it is when using another service.

Unfortunately, the differences that are hard to anticipate might have a meaningful impact on your training performance and thus may be decisive in identifying the training option most suitable for you.

Managed vs. Un-managed Training Services

The large variety of cloud based machine learning services enable developers to interface at a wide range of levels of abstraction. On one end of the spectrum you will have services that provide nothing more than a bare metal training instance, leaving the task of full instance configuration and software stack creation to the user. On the other end of the spectrum you will find managed services in which the user can delegate many elements of the training project, such as the choice of training algorithm or even the creation of training data, to the service provider. To highlight some of the differences between the managed and un-managed service, we will provide an example of how each service type might be used.

Managed Training Example: To perform a managed training session using a service such as Amazon SageMaker or Google Vertex AI you would use the service API to specify details such as the training instance type, the number of desired instances, the framework and version of choice, the location of the training code, and the path to which to upload the training results. The managed service will proceed to start up the requested systems and automatically download and run the training code. When the training is concluded, the resultant artifacts will be uploaded to the preconfigured location and the training instance(s) will be terminated. A record of the training session will be logged capturing details of the training session including pointers to the training code and resultant artifacts. These captured records can be used to analyze usage patterns of the training service and facilitate better cost management and governance.

Managed services might provide additional features to facilitate the end to end machine learning pipeline. For example, Amazon SageMaker includes a wide range of features and tools such as Amazon SageMaker Ground Truth for labeling data, Amazon SageMaker Debugger for monitoring and profiling training sessions, Amazon SageMaker Studio for comparing training experiments, and Amazon SageMaker Neo for optimizing models for deployment.

In the code block below we provide an example of how to instantiate a managed cifar10 Pytorch training job using Amazon Sagemaker’s python API.

from sagemaker.pytorch import PyTorch

estimator=PyTorch(entry_point='cifar10.py',
                    role=<AWS IAM role>,
                    py_version='py3',
                    framework_version='1.9.0', #pytorch version
                    instance_count=1,
                    instance_type='ml.g4dn.xlarge',
                    )
estimator.fit()

Un-managed Training Example: One way to run an un-managed training session would be to:

  1. Use AWS EC2 or Google’s Compute Engine to create one or more training instances with a machine learning SW image of your choosing.
  2. Connect to the training instance over SSH, manually configure the environment, download the training code, and start the training session.
  3. Manually monitor the training progress and shutdown the instance upon completion.
  4. Optionally record the training session details for purposes of experiment reproduction, usage pattern analysis, or governance.

Clearly, the un-managed option requires a great deal of manual intervention. At the same time the direct access to the training machines enables a great deal of flexibility. This is contrary to the managed option which typically places restrictions on access to the training machines.

Training Orchestration: Training orchestration refers to the automated deployment and management of machine learning training jobs. These days there are literally dozens of tools that provide support for orchestrating cloud based training jobs. These vary according to their cost and the details of their feature support. A training orchestration tool can be used to build a managed training environment that is specifically tailored to your needs. You can design your environment in a manner that provides the advantages of both a managed service, such as automation and experiment tracking, and an un-managed service, such as the freedom to connect directly with the training instances. Naturally, creating and maintaining a personalized managed environment may require a non-trivial effort.

Choose an Option One Must

Despite all of our lengthy explanations, you may still feel somewhat lost as to how to choose the right training service. This decision is complicated by the potential unknowns, as we discussed above. Here is a strategy that one of our development teams might adopt. Borrowing a term from the world of Reinforcement Learning, I like to call it the Exploration Exploitation strategy for choosing a cloud based training service.

  1. Choose from one of the many options based on your current (limited) knowledge. This decision might be determined by your level of comfort or by cost considerations.
  2. Try to design your code in such a way that minimizes the dependency on any one option and maximizes your ability to easily transition from one service to another.
  3. Identify the weaknesses of your current solution and the opportunities to increase efficiency.
  4. Dedicate a portion of your development time towards exploration of the alternative training service options. In the event that an improved option is identified, return to step 1.

Step 2— Choosing a Training SW Image

Regardless of your choice of training service the configuration APIs will include controls for determining the SW training image. Depending on the service this might be a virtual machine image (such as an Amazon Machine Image (AMI) or a Google Compute Engine Image) or a Docker image. As noted above, cloud services provide a long list of pre-configured software images. These vary based on OS type and version, driver versions, package ingredients, and more. In most cases, our preference is to use one of the official images that are offered by the training service we have chosen. Not only do these images undergo vigorous testing but they also often include optimizations specifically designed for the underlying cloud environment. However, you might find that none of the images quite fit your needs. You might require a specific CUDA driver installation or a specific software installation package. Alternatively, you might already have a finely cultivated Docker image that you wish to carry over to the cloud. Many cloud services support the creation and use of custom images.

Even in the case that you opt for a custom image, we have found it safest to derive the custom image from an existing official service image. Here we demonstrate a case where we extend an official AWS pytorch Docker image with the latest version of opencv and with a homegrown python utilities package.

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.0-gpu-py38-cu111-ubuntu20.04
RUN apt-get update &amp;&amp; apt-get -y install python3-opencv
COPY my_utils.pkg ./my_utils.pkg
RUN pip install my_utils.pkg

Once you have created your Docker image, you will need to upload it to a cloud registry such as Amazon ECR or Google Container Registry in order to use it. See the documentation for more details. In the block below we have modified our Amazon Sagemaker startup script to use the custom image we created.

from sagemaker.pytorch import PyTorch

estimator=PyTorch(entry_point='cifar10.py',
                    role=<AWS IAM role>,
                    image_uri=<path to image in ECR>, 
                    instance_count=1,
                    instance_type='ml.g4dn.xlarge',
                    )
estimator.fit()

Step 3— Accessing Your Training Data in the Cloud

The next decision you will need to make when migrating your machine learning application to the cloud is how you will access your training data. In this post we will assume that your dataset is maintained by a cloud based object storage solution such as Amazon S3 or Google Storage. The simplest way to access your data would be to download your full training dataset onto the training instance’s local disk and proceed to train precisely as you would in your local environment. However, your dataset might be too large to fit on a local disk or you may be reluctant to let your training instance remain idle for the duration of the time that it would take to download the full dataset. Cloud services provide a number of ways to interface with data residing in cloud storage (a few of which we have covered in this post). Regardless of the option you choose it is likely that you will need to introduce some changes to your code.

We will demonstrate the type of adaptation required to your code by modifying our cifar10 pytorch example to pull data from a dataset that is stored in Amazon S3. The method we will use for interfacing with the object storage will be the recently announced Amazon Sagemaker Fast File Mode. Fast File Mode (FFM) exposes the data in S3 to machine learning application in such a way that it appears as if it is accessing a local file system. This provides the convenience of accessing the data as if it was stored locally without the overhead and cost of actually downloading it before training. To adapt our program to use Fast File Mode small adjustments are required to both the data loading portion of the training script and the Amazon SageMaker training job instantiation script. The changes are highlighted in the code blocks below. For this example, we assume that the training data is located at s3://my-bucket/cifar-training-data/.

Training script:

dataroot=os.environ["SM_CHANNEL_TRAINING"]
transform=transforms.Compose([transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
batch_size=4

trainset=torchvision.datasets.CIFAR10(root=dataroot, train=True,
                                      transform=transform)
trainloader=torch.utils.data.DataLoader(trainset,                 
              batch_size=batch_size, shuffle=True, num_workers=2)

testset=torchvision.datasets.CIFAR10(root=dataroot, train=False,
                                     transform=transform)
testloader=torch.utils.data.DataLoader(testset, 
              batch_size=batch_size, shuffle=False, num_workers=2)

SageMaker startup script:

from sagemaker.pytorch import PyTorch

estimator=PyTorch(entry_point='cifar10.py',
                    role=<AWS IAM role>,
                    py_version='py3',
                    framework_version='1.9.0', #pytorch version
                    instance_count=1,
                    instance_type='ml.g4dn.xlarge',
                    input_mode='FastFile'
                    )
estimator.fit("s3://my-bucket/cifar-training-data/")

For more on the topic of Fast File Mode check out this recent post.

It should be noted that the details of how you store a large dataset, including how your data is partitioned and the format in which it is stored, requires careful design as it can have a meaningful impact on the speed at which data can be accessed. You can find more details on this topic in this blog post.

Step 4— How to Use Multiple Workers to Accelerate Training

As discussed above, one of the advantages to migrating to the cloud is the opportunity it opens up to scale your training job to run on multiple workers in parallel. One common way to do this is to perform data distributed training. In data distributed training learning each worker performs a training step on a different subset (local batch) of the training data. It then publishes its resultant gradients and updates its own model taking into account the combined knowledge learned by all of the workers. By running data distributed training over N workers you can potentially accelerate your training by up to a factor of N. Inevitably, distributing your training over multiple workers will also require adaptations to your code. We will demonstrate this on our simple cifar10 pytorch example. There are a number of popular frameworks for implementing distributed training in pytorch, most notably the built-in Distributed Data Parallel library and [Horovod](https://horovod.readthedocs.io/en/stable/). Here we demonstrate the changes required to our two scripts in order to run distributed training on a training instance with 4 GPUs using Horovod.

The code block below includes the changes required to the optimizer definition and the training loop. For more details on pytorch distributed training using Horovod, check out the official Horovod pytorch turorial.

import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(hvd.local_rank())
net.cuda()
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
# Add Horovod Distributed Optimizer
optimizer = hvd.DistributedOptimizer(optimizer,  
                         named_parameters=net.named_parameters())

# Broadcast parameters from rank 0 to all other processes.
hvd.broadcast_parameters(net.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)
for epoch in range(2):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].cuda(), data[1].cuda()
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0
if hvd.rank() == 0:
    # save checkpoint to location that will be autonmatically
    # uploaded to S3    
    model_dir = os.environ.get('SM_MODEL_DIR', '/tmp')
    torch.save(net.state_dict(), 
               os.path.join(model_dir, 'cifar_net.pth'))
print('Finished Training')

The following code block demonstrates how to use Amazon SageMaker’s API to start up a Horovod job:

from sagemaker.pytorch import PyTorch
distribution={
            'mpi': {
                'enabled': True,
                'processes_per_host': 4,
                'custom_mpi_options': '--NCCL_DEBUG INFO'
            }
        }
estimator=PyTorch(entry_point='cifar10.py',
                    role=<AWS IAM role>,
                    py_version='py3',
                    framework_version='1.9.0', #pytorch version
                    instance_count=1,
                    instance_type='ml.g4dn.12xlarge',#4 GPU instance
                    input_mode='FastFile',
                    distribution=distribution
                  )
estimator.fit("s3://my-bucket/cifar-training-data/")

Note that we have relied on the shuffling operation on the input dataset to ensure that each of the four Horovod training processes receive different training batches. Alternatively, you could implement a data sharding scheme in which the input data is divided into 4 disjoint data subsets, one for each training process. If you choose to shard your data you should be aware that some sharding implementations download/stream the entire dataset to each training process and only then shard, rather than shard at the data source. This can overload your training resources unnecessarily and result in wasted costs.

In this post we have not addressed the adaptations that may be required to the optimizer settings in order to converge on the increased global batch size. For more details on distributed training, including some of the challenges you might face, checkout the following post.

A Guide to (Highly) Distributed DNN Training

Step 5— How to Analyze Your Runtime Performance

It is no secret that training machine learning models can be expensive. One of primary ways in which you, the ML application developer, can reduce cost is to seek opportunities for improving training efficiency by conducting regular analyses of the speed at which the training is performing and the manner in which the system resources are being utilized. While this is a good habit in any training environment, it is of critical importance when training in the cloud where the cost implications of training inefficiencies are immediate. If your training code does not already include hooks for analyzing runtime performance, now is the time to incorporate them. There are a number of different tools and methodologies for conducting performance analysis. Here we will demonstrate how we modify our cifar10 training loop to use pytorch’s built in performance profiling hook. We have configured the profiler to store the results in Amazon S3 cloud storage. For simplicity we have programmed the script to capture the profiling data on a single GPU.

# configure profile on rank 0 only
active = 3 if hvd.rank()==0 else 0
log_dir = os.path.join(model_dir, 'profile')
with torch.profiler.profile(
    schedule=torch.profiler.schedule(wait=1,warmup=1, 
                                     active=active,repeat=1),     
    on_trace_ready=torch.profiler.tensorboard_trace_handler(
                                              log_dir),
    activities=[torch.profiler.ProfilerActivity.CPU,   
                torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True,
    with_stack=True) as prof:
    for epoch in range(2):  # loop over the dataset multiple times
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data[0].cuda(), data[1].cuda()
            # zero the parameter gradients
            optimizer.zero_grad()
            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            prof.step()
            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:    # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0

The image below shows an example of the profiling summary in TensorBoard. For details on how to analyze the profiling results in TensorBoard, checkout this pytorch profiling tutorial.

Profiling Summary (from TensorBoard Profiler Turorial)
Profiling Summary (from TensorBoard Profiler Turorial)

Many cloud services will collect training system utilization metrics and expose these to the user (e.g. via Amazon CloudWatch or Google Cloud Monitoring). These can be used to identify training inefficiencies and make appropriate changes. For example, you may find that your GPU (generally the most important training resource) is highly idle and deduce that perhaps a different training instance type (with lower GPU power) might suit your needs better. Here is an example of how the system utilization metrics appear in the Amazon SageMaker web interface:

Instance Metrics (from Amazon SageMaker console)
Instance Metrics (from Amazon SageMaker console)

Amazon SageMaker offers additional tools for profiling performance as detailed in this blog post.

Step 6— How to Monitor Your Training Experiments

During the course of a machine learning training project, you might run dozens of different experiments with different architectures, input data, or hyper-parameters. In order to glean useful information from these experiments and be able to make improvements, it is essential that you have the appropriate tools for comparing them. Monitoring in a cloud based training environment requires collecting evaluation metrics from all of the independent training experiments and storing them in a centralized location. One of the key design decisions you will need to make will be whether to push or pull the metric data. In a push based solution, each of the experiments will be programmed to push to a predefined centralized monitoring service. While this may seem like the most straight forward option, it may require sophisticated design of cloud network and cloud permissions. In a pull based solution each experiment uploads its metrics to cloud storage and the monitoring service is programmed to pull the metrics from there. While this does not require any special cloud configuration you may find it to be less convenient.

There are a great many tools on the market for monitoring cloud based training experiments. These tools include Comet.ml, Neptune, Weights & Biases, Sacred, MLflow, Guild AI, cnvrg.io, ClearML, and many, many more. These differ based on cost and the details of their feature offerings.

Automated Early Stopping

In a typical machine learning project you may find yourself running multiple experiments in parallel with the expectation that some may fail. In a cloud based environment it is highly desirable to automatically detect and terminate failing experiments early. This could potentially reduce cost considerably. Integrating automated monitoring will require you to program rules for detecting common training failures such as exploding or vanishing gradients, non-decreasing loss, etc.

In the code block below we provide an example of how to incorporate monitoring and early stopping into our cifar10 pytorch script. For simplicity we use Tensorboard for comparing experiment metrics.

from torch.utils.tensorboard import SummaryWriter
# choose path in S3 so that metrics can be tracked during training
writer = SummaryWriter(<predetermined path in S3>)
halt_training = False
for epoch in range(2):  # loop over the dataset multiple times
    if halt_training:
        break
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].cuda(), data[1].cuda()
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        # terminate training if loss is invalid
        if np.isnan(loss.item()) or np.isinf(loss.item()):           
            print('Invalid loss, terminating training')
            halt_training = True
            break                   
        loss.backward()
        optimizer.step()
        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0
            if hvd.rank()==0:
                writer.add_scalar('training loss',
                            running_loss / 2000,
                            epoch * len(trainloader) + i)

In this case we have programmed the script to upload the training metrics to cloud storage. To visualize and compare them to other experiments, you will need to pull the results into TensorBoard. Here is an example of the TensorBoard metric scalar visualization.

Experiment Comparison in TensorBoard (by author)
Experiment Comparison in TensorBoard (by author)

For more details on using TensorBoard in pytorch check this tutorial.

Amazon SageMaker offers dedicated tools for monitoring and automated stopping. See [here](https://towardsdatascience.com/upgrade-your-dnn-training-with-amazon-sagemaker-debugger-d302fab5ee61) and here for more details.

Summary

Cloud based training offers a degree of flexibility and scalability that would be virtually impossible to recreate in your local environment. In this post we have gone through some of the steps that are required to make your transition to the cloud successful. I believe that aside from the mechanics of this transition, training in the cloud requires a shift in mindset and an adjustment to your basic development habits, as discussed in a previous blog post.

6 Development Habits for Increasing Your Cloud ML Productivity

For me, thoughts of clouds in the sky conjure up feelings of excitement, innovation, and endless opportunity. I have found such feelings to be no less appropriate for describing the potential of cloud ML. I hope you do too.


Related Articles