Amazon SageMaker Operators for Kubernetes—examples for distributed training, hyperparameter tuning and model hosting

Learn how to write your own YAML config files to use Amazon SageMaker Operators for Kubernetes

Published in

Towards Data Science

11 min readMar 4, 2020

Use Amazon SageMaker Operators for Kubernetes to run training jobs, model tuning jobs, batch transform jobs, and set up inference endpoints on Amazon SageMaker using Kubernetes config files and kubectl

At re:invent 2019, AWS announced Amazon SageMaker Operators for Kubernetes, which enables Kubernetes users to train machine learning models, optimize hyperparameters, run batch transform jobs, and set up inference endpoints using Amazon SageMaker — without leaving your Kubernetes cluster. You can invoke Amazon SageMaker functionality by writing familiar Kubernetes config files in YAML and applying them your Kubernetes cluster using kubectl CLI tool.

This lets you extend the capacity and capability of your Kubernetes cluster for machine learning by offloading training and inference workloads to Amazon SageMaker. For a more introductory treatment on Amazon SageMaker Operators for Kubernetes read the following blog post:

Kubernetes and Amazon SageMaker for machine learning — best of both worlds

In this blog post, I’ll present step-by-step instructions of creating Kubernetes config files for running distributed training jobs, hyperparameter tuning jobs and hosting scalable model inference endpoints using Amazon SageMaker.

The intended reader for this guide is a developer, researcher, or DevOps professional who has basic familiarity with Kubernetes. Even if you’re new to Kubernetes and Amazon SageMaker, I walk through all the necessary steps required to submit training jobs and host inference endpoints.

All code, config files and demo Jupyter notebooks are available on GitHub: https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git

Amazon SageMaker Operators for Kubernetes and how to use it

Amazon SageMaker Operators for Kubernetes is implemented as a custom resource in Kubernetes and enables Kubernetes to invoke Amazon SageMaker functionality. Below, I’ll provide step-by-step instructions for implementing each of these use cases:

Use case 1: Distributed training with TensorFlow, PyTorch, MXNet and other frameworks
Use case 2: Distributed training with a custom container
Use case 3: Hyperparameter optimization at-scale with TensorFlow
Use case 4: Hosting an inference endpoint with BYO model

To follow along, I assume you have an AWS account, and AWS CLI tool installed on your host machine.

All code, config files and demo Jupyter notebooks are available on GitHub: https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git

Setup

Let’s start by spinning up a Kubernetes cluster. With the eksctl CLI tool, all it takes is a simple command and 15 mins of your time for a very simple cluster with a couple of nodes.

Follow the instructions in the AWS documentation to install eksctl CLI tool. After that run the following command, and go get a cup of coffee. This command launches a single node Amazon Elastic Kubernetes Service (EKS) cluster, which is sufficient for the examples in this post. Note, you can still run large-scale distributed training and hyperparameter tuning jobs on 100s of nodes on Amazon SageMaker using the Amazon SageMaker Operators for Kubernetes.

Create a Kubernetes cluster

eksctl create cluster \
    --name sm-operator-demo \
    --version 1.14 \
    --region us-west-2 \
    --nodegroup-name test-nodes \
    --node-type c5.xlarge \
    --nodes 1 \
    --node-volume-size 50 \
    --node-zones us-west-2a \
    --timeout=40m \
    --auto-kubeconfig

Install Amazon SageMaker Operators for Kubernetes

Once the cluster is up and running, follow the instructions in the user guide to install Amazon SageMaker Operators for Kubernetes. You can also refer to this helpful blog post to guide your installation process: Introducing Amazon SageMaker Operators for Kubernetes

To verify installation run

kubectl get crd | grep sagemaker

You should get an output that looks something like this:

batchtransformjobs.sagemaker.aws.amazon.com                 2020-02-29T21:21:24Z
endpointconfigs.sagemaker.aws.amazon.com                    2020-02-29T21:21:24Z
hostingdeployments.sagemaker.aws.amazon.com                 2020-02-29T21:21:24Z
hyperparametertuningjobs.sagemaker.aws.amazon.com         2020-02-29T21:21:24Z
models.sagemaker.aws.amazon.com                                2020-02-29T21:21:24Z
trainingjobs.sagemaker.aws.amazon.com                           2020-02-29T21:21:24Z

These are all the tasks you can perform on Amazon SageMaker using the Amazon SageMaker Operators for Kubernetes, and we’ll take a closer look at (1) training jobs (2) hyperparameter tuning jobs (3) hosting deployments.

Download examples from GitHub

Download training scripts, config files and Jupyter notebooks to your host machine.

git clone https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git

Download training dataset and upload to Amazon S3

cd kubernetes-sagemaker-demos/0-upload-dataset-s3

Note: TensorFlow must be installed on the host machine to download the dataset and convert into the TFRecord format

Run through upload_dataset_s3.ipynb to upload your training dataset to Amazon

Use case 1: Distributed training with TensorFlow, PyTorch, MXNet and other frameworks

If you’re new to Amazon SageMaker, one of its nice features when using popular frameworks such as TensorFlow, PyTorch, MXNet, XGBoost and others is that you don’t have to worry about building custom containers with your code in it and pushing it to a container registry. Amazon SageMaker can automatically download any training scripts and dependencies into a framework container and run it at scale for you. So you just have to version and manage your training scripts and don’t have to deal with containers at all. With Amazon SageMaker Operators for Kubernetes, you can still get the same experience.

Navigate to the directory with the 1st example:

cd kubernetes-sagemaker-demos/1-tf-dist-training-training-script/
ls -1

Output:

cifar10-multi-gpu-horovod-sagemaker.py
k8s-sm-dist-training-script.yaml
model_def.py
upload_source_to_s3.ipynb

The two python files in this directory cifar10-multi-gpu-horovod-sagemaker.py and model_def.py are TensorFlow training scripts that implement Horovod API for distributed training.

Run through upload_source_to_s3.ipynb to create a tar file with the training scripts and upload it to the specified Amazon S3 bucket.

k8s-sm-dist-training-script.yaml a config file that when applied using kubectl kicks of a distributed training job. Open it in your favorite text editor to take a closer look.

First you’ll notice that kind: TrainingJob. This suggests that you’ll submit an Amazon SageMaker training job.

Under hyperParameters, specify the hyperparameters that cifar10-multi-gpu-horovod-sagemaker.py can accept as inputs.

Specify additional parameters for distributed training:

sagemaker_program — cifar10-multi-gpu-horovod-sagemaker.py TensorFlow training script that implements Horovod API for distributed training
sagemaker_submit_directory — location on Amazon S3 where training scripts are located
sagemaker_mpi_enabled and sagemaker_mpi_custom_mpi_options — enable MPI communication for distributed training
sagemaker_mpi_num_of_processes_per_host — set to the number of GPUs on the requested instance. For p3dn.24xlarge instance with 8 GPUs set this value to 8.

Specify the deep learning framework container by selecting the appropriate container from here:

https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-images.html

Amazon SageMaker will automatically download the training scripts specified under sagemaker_submit_directory into the container instantiated from trainingImage.

To track performance you can also specify a metric definition.

Under resource config specify how many instances or nodes you want to run this multi-node training on. The above config file specifies that it’ll run distributed training on 32 GPUs.

Finally, specify the dataset location on Amazon S3. This should be the same bucket name you chose when running upload_source_to_s3.ipynb Jupyter notebook to upload the training dataset.

To start distributed training, run:

kubectl apply -f k8s-sm-dist-training-script.yaml

Output:

trainingjob.sagemaker.aws.amazon.com/k8s-sm-dist-training-script created

To get the training job information run:

kubectl get trainingjob

Output:

NAME                          STATUS       SECONDARY-STATUS   CREATION-TIME          SAGEMAKER-JOB-NAME
k8s-sm-dist-training-script   InProgress   Starting           2020-03-03T08:25:40Z   k8s-sm-dist-training-script-91027b685d2811ea97a20e30c8d9dadc

Now navigate to AWS Console > Amazon SageMaker > Training jobs

And you’ll see that a new training job with the same name as the output of kubectl get trainingjob

To view the training logs, click on the training job in the console and click on “View Logs” under the Monitor section. This will take you to CloudWatch where you can view the training logs.

Alternatively, if you installed smlogs plugin, then you can run the following to view logs using kubectl:

kubectl smlogs trainingjob k8s-sm-dist-training-script

Use case 2: Distributed training with a custom container

If you’re working with custom proprietary algorithms then you’ll have to build your own Docker containers. To submit a training job with a custom container you’ll have first build the container image locally and push it to Amazon Elastic Container Registry (ECR). After you push the image to ECR, you’ll update the Kubernetes job config file with your ECR path, rather than the TensorFlow container path we provided in the previous example.

Navigate to the directory with the 2nd example:

cd kubernetes-sagemaker-demos/2-tf-dist-training-custom-container/docker
ls -1

Output:

build_docker_push_to_ecr.ipynb
cifar10-multi-gpu-horovod-sagemaker.py
Dockerfile
model_def.py

Run through build_docker_push_to_ecr.ipynb to build a docker file and push it to an ECR registry.

Navigate to AWS Console > Amazon ECR. You should see your newly pushed Docker container here:

Navigate to

cd kubernetes-sagemaker-demos/2-tf-dist-training-custom-container/

Open k8s-sm-dist-custom-container.yaml config file in your favorite text editor to take a closer look.

The only change you need to make is the trainingImage section where you’ll need to provide the location in your ECR registry where you pushed your custom container to.

To start distributed training with your custom container, run:

kubectl apply -f k8s-sm-dist-custom-container.yaml

Output:

trainingjob.sagemaker.aws.amazon.com/k8s-sm-dist-custom-container created

Use case 3: Hyperparameter optimization at-scale with TensorFlow

Hyperparameters for a machine learning model are options not optimized or learned during the training phase but affect the performance of a model. To submit an Amazon SageMaker hyperparameter tuning job, you’ll need to create a Kubernetes config file of kind: hyperparameterTuningJob, instead of trainingJob as you did in the previous two examples.

Another difference is that instead of fixed hyperparameters, here you’ll specify ranges of hyperparameters so that Amazon SageMaker can try different options to arrive at the best model.

Navigate to the directory with the 3rd example:

cd kubernetes-sagemaker-demos/3-tf-hyperopt-training-script
ls -1

Output

cifar10-training-script-sagemaker.py
inference.py
k8s-sm-hyperopt-training-script.yaml
requirements.txt
upload_source_to_s3.ipynb

Run through upload_source_to_s3.ipynb to upload training scripts to Amazon S3.

Open k8s-sm-hyperopt-training-script.yaml config file in your favorite text editor to take a closer look.

kind: HyperParameterTuningJob suggests that this is an Amazon SageMaker Model Tuning job.

Under resourceLimits specify how many training jobs you want the hyperparameter tuner to run in order to explore and find the best set of hyperparameters. maxNumberOfTrainingJobs specifies the total number of jobs you want to run with different hyperparameter combinations and maxParallelTrainingJobs specifies how many instances you want to run this on at any given time. Strategy can be Bayesian or Random.

Hyperparameters can be integerParameterRanges, continuousParameterRanges or categoricalParameterRanges. In the above example optimizer and batch size are categorical which means Amazon SageMaker will randomly pick one of the specified values. For learning rate and momentum, Amazon SageMaker will randomly pick a continuous value in the specified range.

To start a hyperparameter tuning job, run:

kubectl apply -f k8s-sm-hyperopt-training-script.yaml

Output:

hyperparametertuningjob.sagemaker.aws.amazon.com/k8s-sm-hyperopt-training-script created

To get more details about the hyperparameter tuning job, run:

kubectl get hyperparametertuningjob

Output

NAME                              STATUS       CREATION-TIME          COMPLETED   INPROGRESS   ERRORS   STOPPED   BEST-TRAINING-JOB   SAGEMAKER-JOB-NAME
k8s-sm-hyperopt-training-script   InProgress   2020-03-03T09:13:58Z   0           2            0        0                             50d11d175d2f11ea89ac02f05b3bb36a

The hyperparameter tuning job spawns multiple training jobs which you can see by asking kubectl to get a list of training jobs

kubectl get trainingjob

Output

NAME                                            STATUS       SECONDARY-STATUS   CREATION-TIME          SAGEMAKER-JOB-NAME
50d11d175d2f11ea89ac02f05b3bb36a-001-673da61b   InProgress   Starting           2020-03-03T09:14:11Z   50d11d175d2f11ea89ac02f05b3bb36a-001-673da61b
50d11d175d2f11ea89ac02f05b3bb36a-002-7952d388   InProgress   Downloading        2020-03-03T09:14:11Z   50d11d175d2f11ea89ac02f05b3bb36a-002-7952d388

Navigate to AWS Console > Amazon SageMaker > Hyperparameter tuning jobs

You should see the hyperparameter tuning job in progress

Use case 4: Hosting an inference endpoint with BYO model

To deploy a model in Amazon SageMaker hosting services, you just need to bring your own model in compressed tar file. If you want to host a model that you trained on Amazon SageMaker then the output will already be in the required format.

If you ran the example above, just navigate to Amazon S3 bucket where the training job results were saved AmazonS3 > YOUR_BUCKET > JOB_NAME > output. Here you should find a file called model.tar.gz which includes the training model.

Navigate to the directory with the 4th example:

cd kubernetes-sagemaker-demos/3-tf-hyperopt-training-script
ls -1

Output

k8s-sm-inference-host-endpoint.yaml

Open k8s-sm-inference-host-endpoint.yaml config file in your favorite text editor to take a closer look.

Specify the type of instance for hosting under instanceType, and provide a weight for A/B testing if you’re hosting multiple models. Under modelDataUrl specify the location of the trained model on Amazon S3.

To deploy a model, run:

kubectl apply -f k8s-sm-inference-host-endpoint.yaml

Output:

hostingdeployment.sagemaker.aws.amazon.com/k8s-sm-inference-host-endpoint created

To view details about the hosting deployment, run:

kubectl get hostingdeployments

Output:

NAME STATUS SAGEMAKER-ENDPOINT-NAME
k8s-sm-inference-host-endpoint Creating k8s-sm-inference-host-endpoint-cdbb6db95d3111ea97a20e30c8d9dadc

Navigate to AWS Console > Amazon SageMaker > Endpoints

You should be able to see an Amazon SageMaker endpoint inservice and ready to accept requests.

Conclusion

In this blog post, I covered how you can Kubernetes and Amazon SageMaker together, to get the best out of both when running machine learning workloads.

I presented a quick overview of Amazon SageMaker Operators for Kubernetes and how you can use it to leverage Amazon SageMaker capabilities such as distributed training, hyperparameter optimization and hosting inference endpoints that can elastically scale. After that, I showed you the step-by-step process of submitting your training and deployment requests using Kubernetes CLI kubect.

I’ve made all the config files available on GitHub so feel free to use it, modify it and make it your own. Thanks for reading, all the code and examples are available on GitHub here:
https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git

If you have questions, please reach out to me on twitter (@shshnkp), LinkedIn or leave a comment below. Enjoy!

Amazon SageMaker Operators for Kubernetes—examples for distributed training, hyperparameter tuning and model hosting

Learn how to write your own YAML config files to use Amazon SageMaker Operators for Kubernetes

Amazon SageMaker Operators for Kubernetes and how to use it

Setup

Create a Kubernetes cluster

Install Amazon SageMaker Operators for Kubernetes

Download examples from GitHub

Download training dataset and upload to Amazon S3

Use case 1: Distributed training with TensorFlow, PyTorch, MXNet and other frameworks

Use case 2: Distributed training with a custom container

Use case 3: Hyperparameter optimization at-scale with TensorFlow

Use case 4: Hosting an inference endpoint with BYO model

Conclusion

Written by Shashank Prasanna