
Introduction
Data processing (i.e. cleaning, preparing, transforming or curating) is an essential step to developing resilient, reusable, and reliable Machine Learning (ML) pipelines. Amazon Web Services provides specific services for performing many data and ML processing tasks. Amazon SageMaker is a ML service designed to build, train, and deploy ML models across the entire ML lifecycle. Scikit-Learn a.k.a SKLearn is a Python ML library designed to perform a plethora of data science duties for statistics, feature engineering, supervised learning, and unsupervised learning. Apache Spark is a distributed computing big data analytics framework designed to transform, engineer, and process massive amounts of data (think terabytes and petabytes) across a cluster of machines. Spark contains many data engineering and data science interfaces such as Spark SQL, Spark Structured Streaming, and Spark ML. Both SKLearn and Spark are fully supported and integrated within the SageMaker Python SDK hence providing the ability to deploy SKLearn/Spark code via Amazon SageMaker Processing. SageMaker Processing is an internal platform feature specialized to run various types of data and ML transformations (i.e. pre-processing & post-processing). For example, feature engineering, as well as, generating training, validation, and test sets for data validation and model evaluation.
Here is official Amazon SageMaker Processing Documentation on all current available features.
SageMaker fully supports deploying customized data processing jobs for ML pipelines via SageMaker Python SDK. For building, SageMaker has pre-built SKLearn/Spark docker images (fully managed running on Amazon EC2). With SKLearn scripts, the main SageMaker Class is sagemaker.sklearn.processing.SKLearnProcessor
. With Spark scripts, the main SageMaker Classes are sagemaker.spark.processing.PySparkProcessor
(Python) and sagemaker.spark.processing.SparkJarProcessor
(Scala).
In this blog using SageMaker Processing I will walk you through examples of how to deploy a customized data processing and feature engineering script on Amazon SageMaker via 1.) SKLearn, 2.) Spark Python, and 3.) Spark Scala.
The public dataset used for these demonstrations is available here:
https://archive.ics.uci.edu/ml/datasets/abalone
Each script will perform some basic feature engineering techniques. First, as a best practice, the dataset will be split into train and test sets for model evaluation and preventative feature extraction leakage. For categorical predictor variables, One Hot Encoding, will be implemented. For numeric predictor variables, Standard Scaler, will be implemented. It is important to note to perform both a .fit()
and .transform()
on the train set however only a .transform()
on the test set. This ensures the trained model will not include any bias, as well as, avoids learning/computing a new mean/variance on the unseen features in the test set. This is a critical step to implement properly as each slice of the data serves a particular purpose in the pipeline. Specifically, Train set (used to train model), Validation set (used to tune, estimate performance, and compare multiple models), and Test set (used to evaluate predictive strength of model).
Disclaimer: The public datasets and EC2 instance types used in this blog contain very small data volumes and compute sizes. Therefore, they are being used for demonstration purposes and cost savings only. Size, configure, and tune infrastructure & applications accordingly.
Example 1: SKLearn SageMaker Processing
1a.) First, import dependencies and optionally set S3 bucket/prefixes if desired.
1b.) Next, initialize the appropriate class instance (i.e. SKLearnProcessor
) with any additional parameters.
1c.) Now, execute the job with appropriate input(s), output(s), and argument(s). For example, this job reads raw data stored in S3, splits data into train and test sets, performs feature engineering, and writes to storage (copied from internal EC2 local EBS volume to external S3).
1d.) Confirm and view the S3 output results (features, labels) via AWS CLI and S3 Select Query.
aws s3 --recursive ls s3://sagemaker-processing-examples/sklearn-datasets/

SELECT * FROM s3object s LIMIT 5

SELECT * FROM s3object s LIMIT 5

1e.) For reference, here is the complete script (sklearn-processing.py
) I developed that is being called by Sagemaker in this example.
Example 2: Spark Python SageMaker Processing
2a.) First, import dependencies and optionally set S3 bucket/prefixes if desired.
2b.) Next, initialize the appropriate class instance (i.e. PySparkProcessor
) with any additional parameters.
2c.) Now, execute the job with appropriate argument(s). For example, this job reads raw data stored in S3, splits data into train and test sets, performs feature engineering, and writes to S3 storage.
2d.) Confirm and view the S3 output results (features + label dataframes) via AWS CLI and S3 Select Query.
aws s3 --recursive ls s3://sagemaker-processing-examples/spark-python-datasets/

SELECT * FROM s3object s LIMIT 1

2e.) For reference, here is the complete script (spark-python-processing.py
) I developed that is being called by SageMaker in this example.
Example 3: Spark Scala SageMaker Processing
3a.) First, import dependencies and optionally set S3 bucket/prefixes if desired.
3b.) Next, initialize the appropriate class instance (i.e. SparkJarProcessor
) with any additional parameters.
3c.) Now, execute the job with appropriate argument(s). For example, this job reads raw data stored in S3, splits data into train and test sets, performs feature engineering, and writes to S3 storage. Please note the Spark Scala application jar file was compiled via SBT build tool.
3d.) Confirm and view the S3 output results (features + label dataframes) via AWS CLI and S3 Select Query.
aws s3 --recursive ls s3://sagemaker-processing-examples/spark-scala-datasets/

SELECT * FROM s3object s LIMIT 1

3e.) For reference, here is the complete script (spark-scala-processing.scala
) I developed that is being called by SageMaker in this example.
Conclusion
This blog covers the essentials of getting started with SageMaker Processing via SKLearn and Spark. SageMaker Processing also supports customized Spark tuning and configuration settings (i.e. spark.executor.cores
, spark.executor.memory
, etc.) via the configuration
parameter that accepts a list[dict]
or dict
passed during the run()
command. Here is the official Amazon SageMaker Processing Class Documentation.
AWS recently announced at re:Invent 2020 a few new SageMaker data preparation features including Data Wrangler, Feature Store, and Clarify. Here is the official Amazon SageMaker Features Documentation. Thank you for reading this blog.