Amazon SageMaker Operators for Kubernetes—examples for distributed training, hyperparameter tuning and model hosting
Learn how to write your own YAML config files to use Amazon SageMaker Operators for Kubernetes
At re:invent 2019, AWS announced Amazon SageMaker Operators for Kubernetes, which enables Kubernetes users to train machine learning models, optimize hyperparameters, run batch transform jobs, and set up inference endpoints using Amazon SageMaker — without leaving your Kubernetes cluster. You can invoke Amazon SageMaker functionality by writing familiar Kubernetes config files in YAML and applying them your Kubernetes cluster using kubectl
CLI tool.
This lets you extend the capacity and capability of your Kubernetes cluster for machine learning by offloading training and inference workloads to Amazon SageMaker. For a more introductory treatment on Amazon SageMaker Operators for Kubernetes read the following blog post:
Kubernetes and Amazon SageMaker for machine learning — best of both worlds
In this blog post, I’ll present step-by-step instructions of creating Kubernetes config files for running distributed training jobs, hyperparameter tuning jobs and hosting scalable model inference endpoints using Amazon SageMaker.
The intended reader for this guide is a developer, researcher, or DevOps professional who has basic familiarity with Kubernetes. Even if you’re new to Kubernetes and Amazon SageMaker, I walk through all the necessary steps required to submit training jobs and host inference endpoints.
All code, config files and demo Jupyter notebooks are available on GitHub: https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git
Amazon SageMaker Operators for Kubernetes and how to use it
Amazon SageMaker Operators for Kubernetes is implemented as a custom resource in Kubernetes and enables Kubernetes to invoke Amazon SageMaker functionality. Below, I’ll provide step-by-step instructions for implementing each of these use cases:
- Use case 1: Distributed training with TensorFlow, PyTorch, MXNet and other frameworks
- Use case 2: Distributed training with a custom container
- Use case 3: Hyperparameter optimization at-scale with TensorFlow
- Use case 4: Hosting an inference endpoint with BYO model
To follow along, I assume you have an AWS account, and AWS CLI tool installed on your host machine.
All code, config files and demo Jupyter notebooks are available on GitHub: https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git
Setup
Let’s start by spinning up a Kubernetes cluster. With the eksctl CLI tool, all it takes is a simple command and 15 mins of your time for a very simple cluster with a couple of nodes.
Follow the instructions in the AWS documentation to install eksctl CLI tool. After that run the following command, and go get a cup of coffee. This command launches a single node Amazon Elastic Kubernetes Service (EKS) cluster, which is sufficient for the examples in this post. Note, you can still run large-scale distributed training and hyperparameter tuning jobs on 100s of nodes on Amazon SageMaker using the Amazon SageMaker Operators for Kubernetes.
Create a Kubernetes cluster
eksctl create cluster \
--name sm-operator-demo \
--version 1.14 \
--region us-west-2 \
--nodegroup-name test-nodes \
--node-type c5.xlarge \
--nodes 1 \
--node-volume-size 50 \
--node-zones us-west-2a \
--timeout=40m \
--auto-kubeconfig
Install Amazon SageMaker Operators for Kubernetes
Once the cluster is up and running, follow the instructions in the user guide to install Amazon SageMaker Operators for Kubernetes. You can also refer to this helpful blog post to guide your installation process: Introducing Amazon SageMaker Operators for Kubernetes
To verify installation run
kubectl get crd | grep sagemaker
You should get an output that looks something like this:
batchtransformjobs.sagemaker.aws.amazon.com 2020-02-29T21:21:24Z
endpointconfigs.sagemaker.aws.amazon.com 2020-02-29T21:21:24Z
hostingdeployments.sagemaker.aws.amazon.com 2020-02-29T21:21:24Z
hyperparametertuningjobs.sagemaker.aws.amazon.com 2020-02-29T21:21:24Z
models.sagemaker.aws.amazon.com 2020-02-29T21:21:24Z
trainingjobs.sagemaker.aws.amazon.com 2020-02-29T21:21:24Z
These are all the tasks you can perform on Amazon SageMaker using the Amazon SageMaker Operators for Kubernetes, and we’ll take a closer look at (1) training jobs (2) hyperparameter tuning jobs (3) hosting deployments.
Download examples from GitHub
Download training scripts, config files and Jupyter notebooks to your host machine.
git clone https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git
Download training dataset and upload to Amazon S3
cd kubernetes-sagemaker-demos/0-upload-dataset-s3
Note: TensorFlow must be installed on the host machine to download the dataset and convert into the TFRecord format
Run through upload_dataset_s3.ipynb
to upload your training dataset to Amazon
Use case 1: Distributed training with TensorFlow, PyTorch, MXNet and other frameworks
If you’re new to Amazon SageMaker, one of its nice features when using popular frameworks such as TensorFlow, PyTorch, MXNet, XGBoost and others is that you don’t have to worry about building custom containers with your code in it and pushing it to a container registry. Amazon SageMaker can automatically download any training scripts and dependencies into a framework container and run it at scale for you. So you just have to version and manage your training scripts and don’t have to deal with containers at all. With Amazon SageMaker Operators for Kubernetes, you can still get the same experience.
Navigate to the directory with the 1st example:
cd kubernetes-sagemaker-demos/1-tf-dist-training-training-script/
ls -1
Output:
cifar10-multi-gpu-horovod-sagemaker.py
k8s-sm-dist-training-script.yaml
model_def.py
upload_source_to_s3.ipynb
The two python files in this directory cifar10-multi-gpu-horovod-sagemaker.py
and model_def.py
are TensorFlow training scripts that implement Horovod API for distributed training.
Run through upload_source_to_s3.ipynb
to create a tar file with the training scripts and upload it to the specified Amazon S3 bucket.
k8s-sm-dist-training-script.yaml
a config file that when applied using kubectl
kicks of a distributed training job. Open it in your favorite text editor to take a closer look.
First you’ll notice that kind: TrainingJob
. This suggests that you’ll submit an Amazon SageMaker training job.
Under hyperParameters, specify the hyperparameters that cifar10-multi-gpu-horovod-sagemaker.py
can accept as inputs.
Specify additional parameters for distributed training:
sagemaker_program
— cifar10-multi-gpu-horovod-sagemaker.py TensorFlow training script that implements Horovod API for distributed trainingsagemaker_submit_directory
— location on Amazon S3 where training scripts are locatedsagemaker_mpi_enabled
andsagemaker_mpi_custom_mpi_options
— enable MPI communication for distributed trainingsagemaker_mpi_num_of_processes_per_host
— set to the number of GPUs on the requested instance. Forp3dn.24xlarge
instance with 8 GPUs set this value to 8.
Specify the deep learning framework container by selecting the appropriate container from here:
https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers-images.html
Amazon SageMaker will automatically download the training scripts specified under sagemaker_submit_directory
into the container instantiated from trainingImage.
To track performance you can also specify a metric definition.
Under resource config specify how many instances or nodes you want to run this multi-node training on. The above config file specifies that it’ll run distributed training on 32 GPUs.
Finally, specify the dataset location on Amazon S3. This should be the same bucket name you chose when running upload_source_to_s3.ipynb Jupyter notebook to upload the training dataset.
To start distributed training, run:
kubectl apply -f k8s-sm-dist-training-script.yaml
Output:
trainingjob.sagemaker.aws.amazon.com/k8s-sm-dist-training-script created
To get the training job information run:
kubectl get trainingjob
Output:
NAME STATUS SECONDARY-STATUS CREATION-TIME SAGEMAKER-JOB-NAME
k8s-sm-dist-training-script InProgress Starting 2020-03-03T08:25:40Z k8s-sm-dist-training-script-91027b685d2811ea97a20e30c8d9dadc
Now navigate to AWS Console > Amazon SageMaker > Training jobs
And you’ll see that a new training job with the same name as the output of kubectl get trainingjob
To view the training logs, click on the training job in the console and click on “View Logs” under the Monitor section. This will take you to CloudWatch where you can view the training logs.
Alternatively, if you installed smlogs plugin, then you can run the following to view logs using kubectl
:
kubectl smlogs trainingjob k8s-sm-dist-training-script
Use case 2: Distributed training with a custom container
If you’re working with custom proprietary algorithms then you’ll have to build your own Docker containers. To submit a training job with a custom container you’ll have first build the container image locally and push it to Amazon Elastic Container Registry (ECR). After you push the image to ECR, you’ll update the Kubernetes job config file with your ECR path, rather than the TensorFlow container path we provided in the previous example.
Navigate to the directory with the 2nd example:
cd kubernetes-sagemaker-demos/2-tf-dist-training-custom-container/docker
ls -1
Output:
build_docker_push_to_ecr.ipynb
cifar10-multi-gpu-horovod-sagemaker.py
Dockerfile
model_def.py
Run through build_docker_push_to_ecr.ipynb
to build a docker file and push it to an ECR registry.
Navigate to AWS Console > Amazon ECR
. You should see your newly pushed Docker container here:
Navigate to
cd kubernetes-sagemaker-demos/2-tf-dist-training-custom-container/
Open k8s-sm-dist-custom-container.yaml
config file in your favorite text editor to take a closer look.
The only change you need to make is the trainingImage section where you’ll need to provide the location in your ECR registry where you pushed your custom container to.
To start distributed training with your custom container, run:
kubectl apply -f k8s-sm-dist-custom-container.yaml
Output:
trainingjob.sagemaker.aws.amazon.com/k8s-sm-dist-custom-container created
Use case 3: Hyperparameter optimization at-scale with TensorFlow
Hyperparameters for a machine learning model are options not optimized or learned during the training phase but affect the performance of a model. To submit an Amazon SageMaker hyperparameter tuning job, you’ll need to create a Kubernetes config file of kind: hyperparameterTuningJob
, instead of trainingJob
as you did in the previous two examples.
Another difference is that instead of fixed hyperparameters, here you’ll specify ranges of hyperparameters so that Amazon SageMaker can try different options to arrive at the best model.
Navigate to the directory with the 3rd example:
cd kubernetes-sagemaker-demos/3-tf-hyperopt-training-script
ls -1
Output
cifar10-training-script-sagemaker.py
inference.py
k8s-sm-hyperopt-training-script.yaml
requirements.txt
upload_source_to_s3.ipynb
Run through upload_source_to_s3.ipynb
to upload training scripts to Amazon S3.
Open k8s-sm-hyperopt-training-script.yaml
config file in your favorite text editor to take a closer look.
kind: HyperParameterTuningJob
suggests that this is an Amazon SageMaker Model Tuning job.
Under resourceLimits specify how many training jobs you want the hyperparameter tuner to run in order to explore and find the best set of hyperparameters. maxNumberOfTrainingJobs specifies the total number of jobs you want to run with different hyperparameter combinations and maxParallelTrainingJobs specifies how many instances you want to run this on at any given time. Strategy can be Bayesian or Random.
Hyperparameters can be integerParameterRanges, continuousParameterRanges or categoricalParameterRanges. In the above example optimizer and batch size are categorical which means Amazon SageMaker will randomly pick one of the specified values. For learning rate and momentum, Amazon SageMaker will randomly pick a continuous value in the specified range.
To start a hyperparameter tuning job, run:
kubectl apply -f k8s-sm-hyperopt-training-script.yaml
Output:
hyperparametertuningjob.sagemaker.aws.amazon.com/k8s-sm-hyperopt-training-script created
To get more details about the hyperparameter tuning job, run:
kubectl get hyperparametertuningjob
Output
NAME STATUS CREATION-TIME COMPLETED INPROGRESS ERRORS STOPPED BEST-TRAINING-JOB SAGEMAKER-JOB-NAME
k8s-sm-hyperopt-training-script InProgress 2020-03-03T09:13:58Z 0 2 0 0 50d11d175d2f11ea89ac02f05b3bb36a
The hyperparameter tuning job spawns multiple training jobs which you can see by asking kubectl to get a list of training jobs
kubectl get trainingjob
Output
NAME STATUS SECONDARY-STATUS CREATION-TIME SAGEMAKER-JOB-NAME
50d11d175d2f11ea89ac02f05b3bb36a-001-673da61b InProgress Starting 2020-03-03T09:14:11Z 50d11d175d2f11ea89ac02f05b3bb36a-001-673da61b
50d11d175d2f11ea89ac02f05b3bb36a-002-7952d388 InProgress Downloading 2020-03-03T09:14:11Z 50d11d175d2f11ea89ac02f05b3bb36a-002-7952d388
Navigate to AWS Console > Amazon SageMaker > Hyperparameter tuning jobs
You should see the hyperparameter tuning job in progress
Use case 4: Hosting an inference endpoint with BYO model
To deploy a model in Amazon SageMaker hosting services, you just need to bring your own model in compressed tar file. If you want to host a model that you trained on Amazon SageMaker then the output will already be in the required format.
If you ran the example above, just navigate to Amazon S3 bucket where the training job results were saved AmazonS3 > YOUR_BUCKET > JOB_NAME > output
. Here you should find a file called model.tar.gz
which includes the training model.
Navigate to the directory with the 4th example:
cd kubernetes-sagemaker-demos/3-tf-hyperopt-training-script
ls -1
Output
k8s-sm-inference-host-endpoint.yaml
Open k8s-sm-inference-host-endpoint.yaml
config file in your favorite text editor to take a closer look.
Specify the type of instance for hosting under instanceType
, and provide a weight for A/B testing if you’re hosting multiple models. Under modelDataUrl
specify the location of the trained model on Amazon S3.
To deploy a model, run:
kubectl apply -f k8s-sm-inference-host-endpoint.yaml
Output:
hostingdeployment.sagemaker.aws.amazon.com/k8s-sm-inference-host-endpoint created
To view details about the hosting deployment, run:
kubectl get hostingdeployments
Output:
NAME STATUS SAGEMAKER-ENDPOINT-NAME
k8s-sm-inference-host-endpoint Creating k8s-sm-inference-host-endpoint-cdbb6db95d3111ea97a20e30c8d9dadc
Navigate to AWS Console > Amazon SageMaker > Endpoints
You should be able to see an Amazon SageMaker endpoint inservice and ready to accept requests.
Conclusion
In this blog post, I covered how you can Kubernetes and Amazon SageMaker together, to get the best out of both when running machine learning workloads.
I presented a quick overview of Amazon SageMaker Operators for Kubernetes and how you can use it to leverage Amazon SageMaker capabilities such as distributed training, hyperparameter optimization and hosting inference endpoints that can elastically scale. After that, I showed you the step-by-step process of submitting your training and deployment requests using Kubernetes CLI kubect
.
I’ve made all the config files available on GitHub so feel free to use it, modify it and make it your own. Thanks for reading, all the code and examples are available on GitHub here:
https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git
If you have questions, please reach out to me on twitter (@shshnkp), LinkedIn or leave a comment below. Enjoy!