The world’s leading publication for data science, AI, and ML professionals.

Optimizing the price-performance ratio of a Serverless Inference Service with Amazon SageMaker

Finding optimal settings for inference with AWS Lambda using SageMaker Hyperparameter Tuning jobs and Locust

I have recently published a step-by-step guide to serverless model deployments with Amazon SageMaker Pipelines, Amazon API Gateway, and AWS Lambda.

Photo by Pineapple Supply Co. on Unsplash
Photo by Pineapple Supply Co. on Unsplash

With AWS Lambda, you pay only for what you use. Lambda charges based on the number of requests, execution duration, and amount of memory allocated to the function. So how much memory should you allocate to your inference function?

In this post, I will show how you can use SageMaker Hyperparameter Tuning (HPO) jobs and a load-testing tool to automatically optimize the price/performance ratio of your serverless inference service.

We will reuse the XGBoost model binary and Lambda inference container as in my previous post for inference. We will specify a Lambda memory range and get SageMaker to find the optimal value to use based on a latency-memory aggregate score. SageMaker will run many jobs, each of them creating an inference service, load testing it, and giving back the score to optimize on. We will use Locust and the Invokust wrapper to load test from within the SageMaker jobs.

Walkthrough overview

We will find the optimal memory allocation for Lambda in 3 steps:

  • We will first create and delete inference services from SageMaker Training jobs using Boto3.
  • Then, we will load test the services with Invokust, and generate an aggregate score for price-performance.
  • Finally, I will show how you can launch a SageMaker HPO job to automatically to find the optimal memory value.

Prerequisites

To go through this example, make sure you have the following:

  1. We will reuse the container and the model binary from my previous post for the serverless inference service. Make sure you are familiar with this example before starting.
  2. Be familiar with SageMaker hyperparameter optimization (HPO) jobs.
  3. Have access to a SageMaker environment using Studio, a Notebook Instance, or from your laptop.
  4. This GitHub repository cloned into your environment to follow the steps.

Step 1: Creating and deleting inference services from SageMaker Training jobs using Boto3

First, we will use Boto3 to create simple inference services based on API Gateway and AWS Lambda. Each SageMaker Training job will start by creating a service, and then delete it before ending.

Image by author: A view of the serverless inference service
Image by author: A view of the serverless inference service

For the Training jobs, we will use the PyTorch Estimator and the scripts from the source_dir folder. Here we are more interested in using a provided container than the framework itself.

The container entry point is entry_point.py and you can find the boto3 scripts under the stack folder.

Creating the Lambda function with Boto3

We will use the ApiGateway and LambdaFunction classes to create the service.

Below is the LambdaFunction class with simple create and delete functions:

The key parameters for the Lambda functions are:

  • container: container image URI with inference code in it.
  • model_s3_uri: model binary location in S3.
  • memory: memory allocated to the function. It varies based on SageMaker HPO inputs.
  • role: the IAM role assumed by the function.

Step 2: Load testing the inference service and giving it a price-performance score

When an inference service is created, our entry_point.py performs a load test to get its latency response performance.

For this we use Locust, an open source load testing tool written in Python that is developer friendly. Invokust is a wrapper for running Locust from Python itself without the need to use the locust command line. It makes it very easy to run a load tester from within a SageMaker Training job.

SageMaker script mode makes it easy to install those dependencies via a requirements.txt file.

Creating a locust API user with example payload

You can find below an example user behavior for Locust to use:

It will send an example payload to the inference service so it can get predictions from it.

Running the load test and scoring the inference service

We then use Invokust to simulate 1000 users, a spawn rate of 100, and a run time of 1 minute for the load test:

This will allow us to gather statistics on the inference service. Particularly, we want to use the 95th percentile stat for response time as a key metric to assess the service performance.

Below is the code snippet for load testing from our entry point:

With line 11, we calculate a basic score aggregate for the inference service. For illustrative purposes, we simply multiply the Lambda Memory, and Response Time 95th Percentile to get a number. This is the number we will ask SageMaker to Minimize with the HPO job.

Feel free to use more complex scoring based on latency and cost requirements for your ML project.


Step 3: Launching a SageMaker HPO job to find the optimal price-performance ratio

Now we are ready to launch our SageMaker HPO job! You can find below an example notebook to launch one with our code:

The HPO job will use Bayesian Search and try to minimize the aggregate score of our inference service. This is to find an optimal latency-memory (cost) ratio based on our XGBoost model.

Analyzing results of your HPO job

When your HPO job ends, you can navigate to the SageMaker console and find its results. Under the Best training job tab, you can find the minimal score found by SageMaker and the associated Lambda Memory you can use in your inference service.

Image by author: In my case, 300MB seems like a good value to allocate to Lambda
Image by author: In my case, 300MB seems like a good value to allocate to Lambda

A quick look at the CloudWatch logs for this job shows a 53ms response time for this inference service:

Image by author: A view at the CloudWatch logs with scoring from a job
Image by author: A view at the CloudWatch logs with scoring from a job

Conclusion

In this post, I have shown how you can use Amazon SageMaker to optimize the price-performance ratio of a Serverless Inference Service. In my previous post on serverless deployment with SageMaker Pipelines, I allocated 1024MB to the Lambda function. By using SageMaker HPO, we automatically discover that we can allocate 300MB instead, and still get a 53ms latency response from the service. This is 3x difference in memory allocation!

Now, how about using this pattern to optimize other infrastructure stacks?SageMaker endpoints, KFServing, others? Please feel free to share your ideas.


Related Articles