The world’s leading publication for data science, AI, and ML professionals.

Orchestrating Transient Data Analytics Workflows via AWS Step Functions

AWS Automation Tips & Tricks

Photo by Crystal Kwok on Unsplash
Photo by Crystal Kwok on Unsplash

Introduction

AWS Step Functions is a fully managed service designed to coordinate and chain a series of steps together to create something called a state machine for automation tasks. It supports visual workflows and state machines are defined as JSON structure via Amazon State Language ( ASL). In addition, state machines can be scheduled via Amazon CloudWatch as an event rule cron expression.

In this blog, I will walk you through 1.) how to orchestrate data processing jobs via Amazon EMR and 2.) how to apply batch transform on a trained machine learning model to write predictions via Amazon Sagemaker. Step Functions can be integrated with a wide variety of AWS services including: AWS Lambda, AWS Fargate, AWS Batch, AWS Glue, Amazon ECS, Amazon SQS, Amazon SNS, Amazon DynamoDB, and more.

Example 1: Orchestrate Data Processing Jobs via Amazon EMR

1a.) Let’s view our input sample dataset (dummy data from my favorite video game) in Amazon S3.

Image by Author
Image by Author

1b.) Next, I will create a state machine that spins up an EMR cluster (group of EC2 instances) via ASL.

1c.) Now we can perform some data processing (simple partition by a column) by submitting a job to the cluster and terminating infrastructure upon completion via ASL. Let’s also inspect our output data in S3.

Image by Author
Image by Author

1d.) Here is the complete JSON ASL structure and a visual workflow screenshot of what is going on.

Image by Author
Image by Author

1e.) Finally, let’s schedule via CloudWatch to execute every 15 min as a simple cron expression (0 /15 ? ).

Image by Author
Image by Author

In summary for this example, you can utilize Step Functions to automate and schedule your data processing jobs. The level of sophistication typically depends on data volume and can range from a few simple steps on a single EC2 machine to distributing multiple jobs in parallel on the same cluster or across multiple clusters with different instance types. I recommend including additional EMR tuning configurations for selected software (i.e. YARN, Spark, Hive, Sqoop) in the JSON ASL to optimize job performance. Also, choose the number and kind of EC2 instance types wisely to save costs and execution time. Your decision should mostly depend on the total data volume that needs processed and job type (CPU or memory constrained).

To the next example …

Example 2: Apply Batch Transform on Data for Inference with a Trained ML Model via Amazon SageMaker

2a.) For data understanding let’s view the raw labeled (dependent variable is the last column named rings) training dataset (abalone from the UCI Machine Learning Repository) in S3. The trained model predicts age of the abalone (a type of shellfish) from physical measurements.

https://archive.ics.uci.edu/ml/datasets/abalone

Image by Author
Image by Author

2b.) Next, let’s create the ASL structure triggering a batch transform job on a raw unlabeled (no rings column) batch dataset that needs inference via the trained model stored in S3. Please note currently you must attach an inline policy to the role selected for the state machine.

Image by Amazon Web Services
Image by Amazon Web Services
Image by Amazon Web Services
Image by Amazon Web Services

2c.) Lastly, we can view the Step Functions visual workflow and the job’s output results with a prediction score column added. Note, the model is a pipeline model that includes preprocessing (one hot encoding, scaling, etc.) the data before sending it to the supervised learning algorithm.

Image by Author
Image by Author
Image by Author
Image by Author

In summary for this example, you can utilize Step Functions to automate and schedule your machine learning jobs (pre-processing, training, tuning, model hosting, self-service & batch inference). Step Functions and SageMaker integration support the following APIs: CreateEndpoint, CreateEndpointConfig, CreateHyperParameterTuningJob, CreateLabelingJob, CreateModel, CreateTrainingJob, and CreateTransformJob.

Conclusion

These two examples cover a small portion of what AWS services are capable of to scale data engineering and machine learning workflows. Thank you for reading this blog.


Related Articles