Kubernetes and Amazon SageMaker for machine learning — best of both worlds

Use Amazon SageMaker to extend the capacity and capability of your Kubernetes cluster for machine learning workloads

Published in

Towards Data Science

10 min readMar 3, 2020

Kubernetes and Amazon SageMaker — best of both worlds

If you’re part of a team that trains and deploys machine learning models frequently, you probably have a cluster setup to help orchestrate and manage your machine learning workloads. Chances that you’re using Kubernetes (and KubeFlow) or Amazon SageMaker.

Until now you had to choose your orchestration system and stick with it. You either (1) provisioned a Kubernetes cluster based on your data science team’s expected workload or (2) you went fully-managed and relied on Amazon SageMaker to automatically provision and teardown resources as needed.

Wouldn’t it be nice if you could have the best of both worlds?

Use Kubernetes to manage your workflows, and get burst capacity with Amazon SageMaker for large-scale distributed training?
Develop algorithms and models with Kubeflow Jupyter notebooks and run hyperparameter experiments at scale using Amazon SageMaker?
Train models using Kubeflow and host an inference endpoint Amazon SageMaker that can elastically scale to millions of users?

With the Amazon SageMaker Operators for Kubernetes you can do exactly that! You can use it to train machine learning models, optimize hyperparameters, run batch transform jobs, and set up inference endpoints using Amazon SageMaker, without ever leaving your Kubernetes cluster.

Use Amazon SageMaker Operators for Kubernetes to run training jobs, model tuning jobs, batch transform jobs, and set up inference endpoints on Amazon SageMaker using Kubernetes config files and kubectl

Amazon SageMaker Operators for Kubernetes is a custom resource in Kubernetes that enables invoking Amazon SageMaker functionality using Kubernetes CLI and config files. In fact, many of Kubernetes’s core functionalities are built as custom resources and this modularity makes Kubernetes very extensible. For Kubernetes users, Amazon SageMaker Operators enables you to have a single consistent way for interacting with both Kubernetes and Amazon SageMaker.

In this blog post, I’ll present an overview of Amazon SageMaker Operators for Kubernetes, why it matters, and common usage patterns so you can decide if this is for you. All code, config files and demo Jupyter notebooks referenced in this blog post are available on GitHub:

https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git

For a deep dive on how to implement distributed training, model tuning and model hosting examples with Amazon SageMaker Operators for Kubernetes check out this accompanying post:

Amazon SageMaker Operators for Kubernetes — examples for distributed training, hyperparameter tuning and model hosting

A match made in the cloud

Kubernetes and Kubeflow projects enjoy a strong user community and are some of the fastest growing open-source projects in machine learning. As long as you have in-house expertise to set up, manage and troubleshoot Kubernetes clusters, you can get everything you need as a data scientist or machine learning researcher — Jupyter notebooks and support for distributed training with KubeFlow, hyperparameter tuning with KubeFlow and Katib, and easy inference deployment with KFServing. As a Kubernetes user, you have complete flexibility in where you run it (on-prem or cloud), and what systems you run it on. This also means you’re responsible for keeping cluster utilization high to reduce operational costs, which can be challenging given the nature of bursty or spiky machine learning workloads.

Amazon SageMaker takes a different approach. For starters, it offers a fully managed suite of services for almost every part of the machine learning workflow — from data labeling, hosted Jupyter notebook development environment, managed training clusters that is automatically provisioned and teared down after use, hyperparameter optimization, managed model hosting services and more. As an Amazon SageMaker user, you don’t have to focus on things like infrastructure management and cluster utilization.

As a machine learning practitioner, you should be able to leverage the benefits of both. For example, you should be able pair a continuously running, (more or less) fixed capacity self-managed Kubernetes infrastructure with an on-demand, fully-managed and elastic Amazon SageMaker infrastructure that is only provisioned for just when you need it. That’s a powerful idea — data scientist teams can let their ideas run loose and experiment to their heart’s content, without the constraints imposed by existing Kubernetes setups.

You can already do this today — but not without switching back and forth between these two systems. With Amazon SageMaker Operators for Kubernetes, you can now do it without ever leaving the Kubernetes environment that you may already be familiar with.

Scenarios and use cases

With Amazon SageMaker Operators for Kubernetes you can offload workloads such as single node training, distributed or multi-node training, large-scale hyperparameter tuning and hosting inference deployments to Amazon SageMaker’s fully managed infrastructure. So, the question then becomes, when does it make sense to offload to Amazon SageMaker vs. run the workload on your Kubernetes cluster?

Let’s explore this with a couple of hypothetical scenarios.

Scenario #1 — excess capacity for large-scale training

Use Amazon SageMaker Operators for Kubernetes to submit training jobs via kubectl. Amazon SageMaker provisions the required capacity and runs the training job.

Let’s say you’re currently running a Kubernetes cluster in your local data center or on AWS using Amazon EKS. When you set it up, you budgeted and chose the number of CPUs, GPUs and storage in your data center based on the workloads at that time of provisioning. Now your team has grown, or you have more data and are in need of more compute horse-power. You have a quick deadline on a machine learning training experiment that can be completed in 1 day if you had access to 128 GPUs, but on your Kubernetes cluster they are all busy for other projects. You just need excess burst capacity for a short period of time.

Your options are

Extend your existing Kubernetes cluster and add the required resources
Spin up another Kubernetes cluster with the required resources
Use Amazon SageMaker for on-demand provisioning

(1) and (2) are additional infrastructure work you didn’t sign up for. (3) is a great option, but requires you to leave your familiar Kubernetes environment and which isn’t integrated to any CI/CD automation you have setup.

There’s a 4th option. Using the Amazon SageMaker Operators for Kubernetes to submit Amazon SageMaker jobs via kubectl, just how you’d submit other Kubernetes jobs. Behind the scenes an Amazon SageMaker managed cluster with specified number of instances will be provisioned automatically for you. The training job will then be executed on that Amazon SageMaker managed cluster, and once the training is done, the cluster is automatically teared down and you’ll be presented with exact duration of training which is what you’ll be billed for.

Scenario #2 — hosting scalable inference endpoints

Use Amazon SageMaker Operators for Kubernetes to host inference endpoints via kubectl. Amazon SageMaker provisions the required instances and run model servers.

Let’s consider another scenario. You have CI/CD automation setup for training, validation and deployment around Kubernetes. And the model you’ve hosted using Kubernetes is consumed by your customers via an endpoint through a mobile app or a website. The model is hosted on a GPU instance since latency and performance is critical for your customer experience. You want to free up GPU resources for training and need the ability to auto scale and perform real-time model monitoring. These are capabilities already available on Amazon SageMaker hosting services, but you want to leverage it without disrupting your existing CI/CD workflow. Using Amazon SageMaker Operators for Kubernetes, you can deploy a trained model right from Kubernetes in the same declarative fashion with config files in YAML, that integrates easily into your existing setup and still allows you to get the benefits of Amazon SageMaker hosting.

Let’s now take a look at some common usage patterns for using Kubernetes and Amazon SageMaker together.

Use case #1 — Distributed training with TensorFlow, PyTorch, MXNet and other frameworks

Workflow: User uploads training code to Amazon S3. Amazon SageMaker downloads training code, pulls the specified framework container and runs the training script in it. User doesn’t have to deal with building and pushing containers.

With you distributed a.k.a multi-node training, you can dramatically reduce time to train a model by distributing the workload across multiple GPUs. When you’re low on GPU capacity in your Kubernetes cluster, you can configure a distributed training job to be run on Amazon SageMaker managed cluster instead. In addition to the quick access to excess capacity on AWS, you also get additional Amazon SageMaker benefits such as ability to leverage Spot Instance to dramatically lower cost to train, monitoring and tracking of training jobs on AWS console or using the AWS CLI, and ability host the trained model with a few clicks.

If you’re working with widely used frameworks such as TensorFlow, PyTorch, MXNet, XGboost and others, all you have to do is upload your training script to Amazon S3 as a tar.gz file, and submit a training job to Amazon SageMaker via Kubernetes config file written in YAML. Take a look at the GitHub repository for the example code and config files. Here are the changes you’ll need to make to submit your Amazon SageMaker training job via Kubernetes’ kubectl

Here’s an excerpt from k8s-sm-dist-training-script.yaml file that you’ll find on the GitHub repository for this blog post.

apiVersion: sagemaker.aws.amazon.com/v1
kind: TrainingJob
metadata:
  name: k8s-sm-dist-training-script 
spec:
    hyperParameters:
        - name: learning-rate
          value: "0.001"
        - name: batch-size
          value: "256"
…
        - name: sagemaker_program
          value: 'cifar10-multi-gpu-horovod-sagemaker.py'
        - name: sagemaker_submit_directory
          value: 's3://sagemaker-jobs/training-scripts/sourcedir.tar.gz'
...
    algorithmSpecification:
        trainingImage: 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-gpu-py27-cu100-ubuntu18.04
        trainingInputMode: File
...
    resourceConfig:
        instanceCount: 128
        instanceType: "ml.p3.2xlarge"
        volumeSizeInGB: 50
...

This reads like any other Kubernetes config written in YAML. For training jobs you’ll notice right at the top that kind: TrainingJob

Here are a few key sections where you specify aspects of your training job:

hyperParameters — These are specified in the YAML spec, so you can automate running different experiments by changing and submitting training jobs
sagemaker_submit_directory — S3 location to where you uploaded your training scripts. This is unique vs. submitting a training using Kubernetes since you don’t have to build a custom container! Amazon SageMaker will automatically download your training script into an existing TensorFlow container and then run the training for you. No messing with Docker files and custom containers.
resourceConfig — how many instances you need of what type. This config will request for 128 V100 GPUs to run distributed training.
trainingImage — Pick from pre-built containers for TensorFlow, PyTorch, MXNet, for Training or Inference, for Python2 or Python 3, and for CPU or GPU.

Submit the job just like you would any other Kubernetes config file.

kubectl apply -f k8s-sm-dist-training-script.yaml

Use case #2 — Distributed training with a custom container

Workflow: User builds a custom container locally and pushes it to Amazon ECR. Amazon SageMaker pulls the custom container and run it on a fully-managed training cluster.

If you’re working with custom proprietary algorithms and build your own Docker containers, then you’d prefer to specify the container image rather than a TensorFlow, PyTorch, MXNet other framework training script. Unlike in use case #1, you’ll have to go through additional steps to first build a custom docker container locally and push it to Amazon Elastic Container Registry (ECR) and specify its URI under trainingImage. If you don’t have custom algorithms that require building of custom containers I recommend following approach in use case #1.

apiVersion: sagemaker.aws.amazon.com/v1
kind: TrainingJob
metadata:
  name: k8s-sm-dist-custom-container 
spec:
    hyperParameters:
        - name: learning-rate
          value: "0.001"
        - name: weight-decay
          value: "0.0002"
...
    algorithmSpecification:
        trainingImage: <ACCOUNT_ID>.dkr.ecr.us-west-2.amazonaws.com/<IMAGE>:latest
        trainingInputMode: File
        metricDefinitions: 
         - name: val_acc
         - regex: 'val_acc: ([0-9\\.]+)'

The code in the GitHub repository also includes Jupyter notebooks that replicate these steps.

Submit the job:

kubectl apply -f k8s-sm-dist-custom-container.yaml

Use case #3 — Hyperparameter optimization at-scale

Hyperparameters for a machine learning model are options not optimized or learned during the training phase. Amazon SageMaker offers hyperparameter optimization feature and implements both Bayesian and Random search. This is not unlike the capability offered by KubeFlow’s Katib project. To run a large-scale hyperparameter tuning job on Amazon SageMaker, create a Kubernetes config file of kind: HyperparameterTuningJob. Here you’ll specify the hyperparameter ranges instead of fixed hyperparameters. This instructs Amazon SageMaker to try different options to arrive at the best model. maxNumberOfTrainingJobs specifies the total number of jobs you want to run with different hyperparameter combinations and maxParallelTrainingJobs specifies how many instances you want to run this on at any given time.

apiVersion: sagemaker.aws.amazon.com/v1
kind: HyperparameterTuningJob
metadata:
    name: k8s-sm-hyperopt-training-script 
spec:
    hyperParameterTuningJobConfig:
        resourceLimits:
            maxNumberOfTrainingJobs: 32
            maxParallelTrainingJobs: 8
        strategy: "Bayesian"
        trainingJobEarlyStoppingType: Auto
        hyperParameterTuningJobObjective:
            type: Maximize
            metricName: 'val_acc'
        parameterRanges:
            continuousParameterRanges:
            - name: learning-rate
              minValue: '0.0001'
              maxValue: '0.1'
              scalingType: Logarithmic
...
            categoricalParameterRanges:
            - name: optimizer
              values:
              - 'sgd'
              - 'adam'
...

Submit the job:

kubectl apply -f k8s-sm-dist-custom-container.yaml

Use case #4 — Hosting an inference endpoint with BYO model

Workflow: User uploads a trained model as a tar.gz file to Amazon S3. If the model was trained using Amazon SageMaker, the model.tar.gz will already be available on Amazon S3. Amazon SageMaker downloads the model file, pulls the serving container and hosts the endpoint on a fully-managed instance.

Once the model is trained, you can host it using Amazon SageMaker hosting rather than hosting it on your Kubernetes cluster. You can leverage additional capabilities and cost saving features for inference deployments using Amazon SageMaker. To deploy, you need to create a config file of kind: HostingDeployment. Here you’ll specify the type of instance, provide a weight for A/B testing if you’re hosting multiple models, and location of the trained model on Amazon S3 as shown below.

apiVersion: sagemaker.aws.amazon.com/v1
kind: HostingDeployment
metadata:
  name: k8s-sm-inference-host-endpoint 
spec:
    region: us-west-2
    productionVariants:
        - variantName: AllTraffic
          modelName: tf-cifar10-resnet-model
          initialInstanceCount: 1
          instanceType: ml.c5.large
          initialVariantWeight: 1
    models:
        - name: tf-cifar10-resnet-model
          executionRoleArn: arn:aws:iam::<ACCOUNT_ID>:role/service-role/AmazonSageMaker-ExecutionRole-20190820T113591 
          containers:
              - containerHostname: tensorflow
                modelDataUrl: s3://sagemaker-jobs/trained-tf-model/model.tar.gz
                image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-inference:1.15.2-cpu-py36-ubuntu18.04

Get ready to implement!

In this post, I gave you a quick overview of Amazon SageMaker Operator for Kubernetes and how you can use it with your existing Kubernetes cluster. I presented 2 scenarios and 4 different use-cases to leverage Amazon SageMaker benefits without ever leaving your Kubernetes environment.

For step-by-step instructions on how to implement the examples presented in this blog post check out this accompanying post:

Amazon SageMaker Operators for Kubernetes — examples for distributed training, hyperparameter tuning and model hosting

To run the examples head over to GitHub:
https://github.com/shashankprasanna/kubernetes-sagemaker-demos.git

If you have questions, please reach out to me on twitter (@shshnkp), LinkedIn or leave a comment below.