Customizing Your Cloud Based Machine Learning Training Environment — Part 1

How to leverage the power of the cloud without compromising development flexibility

Published in

Towards Data Science

8 min readMay 21, 2023

Cloud-based machine learning (ML) services offer a great number of conveniences to the AI developer, perhaps none as important as the access they provide to a wide variety of fully provisioned, fully functional, and fully maintained ML training environments. For example, managed training services, such as Amazon SageMaker and Google Vertex AI, enable users to specify (1) the desired instance types (e.g., with the latest available GPUs), (2) an ML framework and version, (3) a code source directory, and (4) a training script, and will automatically start up the chosen instances with the requested environment, run the script to train the AI model, and tear everything down upon completion. Among the advantages of such offerings is the potential for significant savings in the time and cost of building and maintaining your own training cluster. See here for more on the benefits and considerations of cloud-based training as well as a summary of some of the common steps required to migrate an ML workload to the cloud.

However, coupled with the convenience of using a predefined, fully provisioned, fully validated, ML environment comes the potential for limitations on development flexibility. This is contrary to a local, “on-prem” environment that you can freely define to your heart’s desire. Here are a few scenarios that demonstrate the potential limitation:

Training dependencies: It is easy to conceive of training flows that have dependencies on specific packages (e.g., Linux packages or Python packages) that might not be included in the predefined environments provided by your cloud service of choice.
Development platform independence: You may desire (or require) a development environment that is independent of the underlying runtime platform. For example, you may want to have the ability to use the same training environment regardless of whether you are running on your own PC, on a local (“on-prem”) cluster, or in the cloud, and regardless of the cloud service provider and cloud service you choose. This can reduce the overhead of needing to adapt to multiple environments and might also facilitate debugging issues.
Personal development preferences: Engineers (especially seasoned ones) can be quite particular about their development habits and development environments. The mere possibility of a limitation, introduced by a change to the development process, however much its value and importance to the team (e.g., migrating to cloud ML), might stir up significant resistance.

In this two-part blog post we will cover some of the options available for overcoming the potential development limitations of cloud-based training. Specifically, we will assume that the training environment is defined by a Docker image containing a Python environment and demonstrate a few methods for customizing the environment to meet our specific needs.

For the sake of demonstration, we will assume that the training service of choice is Amazon SageMaker. However, please note that the general methods apply equally to other training services as well and that this choice should not be viewed as an endorsement of one cloud-based service over any other. The best option for you will likely depend on many factors including the details of your project, budgetary considerations, and more.

Throughout the blog we will reference and demonstrate certain library APIs and behaviors. Note these are true as of the time of this writing and are subject to modification in future versions of the libraries. Please be sure to refer to the latest official documentation before trusting anything we write.

I would like to thank Yitzhak Levi whose experimentation formed the basis of this blog post.

Managed Training Example

Let’s begin with a simple demonstration of training in the cloud using a managed service. In the code block below we use Amazon SageMaker to run the train.py PyTorch (1.13.1) script from the source_dir folder on a single ml.g5.xlarge GPU instance type. Note that this is a rather simple example of training with Amazon SageMaker; for more details be sure to see the official AWS documentation.

from sagemaker.pytorch import PyTorch

# define the training job
estimator = PyTorch(
    entry_point='train.py',
    source_dir='./source_dir',
    framework_version='1.13.1',
    role='<arn role>',
    py_version='py39',
    job_name='demo',
    instance_type='ml.g5.xlarge',
    instance_count=1
)

# deploy the job to the cloud
estimator.fit()

To get a better idea of the potential limitations on development flexibility when training in the cloud and how to overcome them, let’s review the steps that occur behind the scenes when deploying a training job using a service such as Amazon SageMaker.

What Happens Behind the Scenes

We will focus here on the primary steps that occur; for the full details please see the official documentation. Although we will review the actions that are taken by the SageMaker API, other managed training APIs exhibit similar behaviors.

Step 1: Tar and upload the code source directory to cloud storage.

Step 2: Provision the requested instance type(s).

Step 3: Pull the appropriate pre-built Docker image to the instance(s). The appropriate image is determined by the properties of the training job. In the example above, we requested Python 3.9, PyTorch 1.13.1, and a GPU instance type in the us-east-1 AWS region and will, accordingly, end up with the 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker image.

Step 4: Run the Docker image. The Docker ENTRYPOINT is a script (defined by the service) that downloads and unpacks the training code from cloud storage and runs the user defined training script (more details on this below).

Step 5: When the training script has completed, stop and release the instance.

Note that we have greatly simplified the actions taken by the service and focused on those that will pertain to our discussion. In practice there are many more “management” activities that take place behind the scenes that include accessing training data, orchestrating distributed training, recovery from spot interruptions, monitoring and analyzing training, and much more.

About the Docker Image

The AWS Deep Learning Containers (DLCs) github repository includes the pre-built AWS Docker images used by the SageMaker service. These can be analyzed to get a better understanding of the environment in which the training script runs. In particular, the Dockerfile defining the image used in the example above is located here. From a cursory review of the file we can see that in addition to standard Linux and Python packages (e.g., OpenSSH, pandas, scikit-learn, etc.), the image contains several AWS specific packages, such as enhanced version of PyTorch for AWS and libraries for utilizing Amazon EFA. The AWS specific enhancements include features for managing and monitoring training jobs and, more importantly, optimizations for maximizing resource utilization and runtime performance when running on AWS’s training infrastructures. Furthermore, upon closer analysis of the Dockerfile it becomes clear that its creation required diligent work, occasional workarounds, and extensive testing. Barring other considerations (see below), our first choice will always be to use the official AWS DLCs when training on AWS.

The ENTRYPOINT in the Dockerfile points to a script called start_with_right_hostname.sh. This calls the train.py script from the sagemaker-training Python package which, in turn, calls a function that parses the SAGEMAKER_TRAINING_MODULE environment variables and ultimately runs the PyTorch-specific entry-point from the sagemaker-pytorch-training Python package. It is this entry-point that downloads the source code and starts up the training as described in step 4 above. Keep in mind that the flow we just described and the code to which we linked are valid as of the time of this writing and may change as the SageMaker APIs evolve. While we have analyzed the startup flow for a SageMaker PyTorch 1.13 training job as configured in the example above, the same analysis can be performed for other types of cloud-based training jobs.

The publicly accessible AWS DLC Dockerfiles and SageMaker Python package source code allow us to get a sense of the cloud-based training runtime environment. This enables us to gain a better understanding of the capabilities and limitations of training in a cloud-based environment and, as we will see in the next sections, understand the tools at our disposal for introducing changes and customizations.

Customizing the Python Environment

The first and simplest way to customize your training environment is by adding Python packages. Training scripts will often depend on Python packages (or on specific versions of packages) that are not included in the Python environment of the cloud service’s default Docker image. The SageMaker APIs address this by allowing you to include a requirements.txt in the root of the source_dir folder passed into the SageMaker estimator (see the API call example above). Following the downloading and unpacking of the source code (in step 4 above), the SageMaker script will search for a requirements.txt file and install all of its contents using the pip package installer. See here for more details on this feature. Other cloud services include similar mechanisms for automating installation of package dependencies. This type of solution can also be easily accomplished by simply including a designated package installation routine at the beginning of your own training script (though this needs to be carefully coordinated in a case where you are running multiple processes e.g., with MPI, on a single instance).

While this solution is simple and easy to use, it does have some limitations:

Installation Time

One thing to take into consideration is the overhead (in time) required to install the package dependencies. If we have a long list of dependencies or if any of the dependencies require a long installation time, customizing the environment in this manner might increase the overall time and cost of training to an unreasonable degree and we might find one of the alternative methods we will discuss to better suit our needs.

Access to Repository

Another thing to keep in mind about this method is that it relies on access to a Python package repository. If the network of your training environment is configured to allow free access to the internet, then this is not a problem. However, if you are training in a private network environment, such as Amazon Virtual Private Cloud (Amazon VPC), then this access might be restricted. In such cases, you will need to create a private package repository. In AWS this can be done using AWS CodeArtifact or by creating a private PyPI mirror. (Although the use case is a bit different, you might find the recipes described here and here to be helpful.) In the absence of an accessible package repository, you will need to consider one of the alternative customization options we will present.

Conda Package Dependencies

The customization solution we have presented is great if all of our dependencies are pip packages. But what if our dependency exists only as a conda package? For example, suppose we wish to use the s5cmd for speeding up data streaming from cloud storage. When we have full control over the training environment, we can choose to build our Python environment using conda and freely install conda package dependencies. If we use a Docker container provided by a cloud service, we don’t control the Python environment creation and may not enjoy the same freedoms. Point in fact, attempting to install s5cmd via subprocess on SageMaker (conda install -y s5cmd) fails. Even if we were to figure out a way to make it work, installing a package in this manner may lead to undesired side effects such as overwriting other AWS-optimized packages or generally destabilizing the conda environment.

Linux Package Dependencies

In theory, this method could also be extended to support Linux packages. If, for example, our training script depended on a specific Linux package or package version, we could install it using a subprocess call at the beginning of our script. In practice, many services, including Amazon SageMaker, limit the Linux user permissions in a manner that prevents this option.

Summary

The first solution we have presented for customizing our training environment is simple to use and allows us to take full advantage of the Docker images that were specially designed and optimized by the cloud service provider. However, as we have seen, it has a number of limitations. In the second part of our post we will discuss two additional approaches that will address these limitations.