Cloud processing is now simpler and cheaper!

A very simple and cheap way to run/distribute your existing processing/training code on the cloud

Published in

Towards Data Science

7 min readSep 29, 2020

It happened to me, and I’m sure it’s happening to you and to many many data scientists, who work on their small/medium size project out there:

You’ve invested a lot in your own training pipeline (pre-processing -> training -> testing), tried it locally a few times using different parameters, and it seems to be great. But… you realize you need much more RAM/CPU/GPU/GPU memory or just all of them together to be able to get the most of out of it?

It can happen for many reasons —

The training takes too much time with your local setup
You need the batch size to be larger, and it can’t fit in your local GPU memory
You’d like to tune the hyperparameters, so many training runs are required
You’d like to move some of the preprocessing steps to be done during training, e.g. to save disk space / loading time, and the CPU / RAM can’t make it
…

So, theoretically, you have everything you need, but you just need to run it on a better HW… Should be a non-issue today, shouldn’t it?

Existing solutions

Well, there’re indeed many solutions out there, here’s a a few related technologies / platforms / solutions:

General

Apache Airflow —”a platform … to programmatically author, schedule and monitor workflows”
Ray — “fast and simple distributed computing”

Cloud providers AI solutions

Kubeflow — “the machine learning toolkit for kubernetes” (pipelines)
GCP AI Platform — ”one platform to build, deploy, and manage machine learning models” (training, pipelines, distributed PyTorch, Distributed TensorFlow)
Azure Machine Learning — “enterprise-grade machine learning service to build and deploy models faster” (training)
AWS Sagemaker — “Machine learning for every developer and data scientist” (training, Distributed PyTorch, Distributed TensorFlow)

Distributed training frameworks

RaySGD — “a lightweight library for distributed deep learning”, built on top of Ray
Horovod — “distributed deep learning training framework”

TensorFlow

TensorFlow distributed training documentation
TensorFlow’s own tutorial on GCP distributed GCP training
TensorFlow Training (TFJob) for Kubernetes (part of Kubeflow)

PyTorch

…

Comparison between the pros and cons of all existing solutions deserves its own post (or even a series of them), and I’m sure I forgot to mention many others :)

But, I couldn’t find one that allows you to just run my existing code, with very little or no additional coding, cheaply, and without lot of prior specific platform knowledge. I also just wanted to try having my own open source project, including documentation and a full automated testing and publishing pipeline :)
These are the reasons I came up with simple-sagemaker and this post.

Simple-Sagemaker to the rescue

Simple-sagemaker allows you to take your existing code, as is, and run it on the cloud, with no or very little code changes.

The remaining of this post shows how to use the library for general purpose processing. Follow up posts will demonstrate how to use it for more advanced cases, such as PyTorch distributed training etc.

A more comprehensive documentation and examples can be found on the github project, along with a few more examples, including the source code for all examples in this post.

Requirements

Python 3.6+
An AWS account + region and credentials configured for boto3, as explained on the Boto3 docs

Installation

pip install simple-sagemaker

Running a shell command

Now, to get the shell command cat /proc/cpuinfo && nvidia-smi run on a single ml.p3.2xlarge spot instance, just run the following ssm command (documentation of the ssm CLI is given below):

ssm shell -p ssm-ex -t ex1 -o ./out1 --it ml.p3.2xlarge --cmd_line "cat /proc/cpuinfo && nvidia-smi"

Once the job is completed (a few mins), the output logs get downloaded to ./out1 :

As you may be able to guess, with this single command line, you got:

A pre-built image to use for running the code is chosen (Sagemaker’s PyTorch framework image is the default).
An IAM role (with the default name SageMakerIAMRole_ssm-ex) with AmazonSageMakerFullAccess policy is automatically created for running the task.
A ml.p3.2xlarge spot instance is launched for you, you pay just for the time you use it!
Note: this is a bit more than the execution time, as it include the time to initiate (e.g. download the image, code, intput) and tear down (save outputs) etc.
The shell command get executed on the instance.
The shell command exist code is 0, so it’s considered to be completed succesfully.
The output logs get saved to CloudWatch, then downloaded from CloudWatch to the ./out1 folder.

Pretty cool #1, isn’t it?

Distributing Python code

Similarly, to run the following ssm_ex2.py on two ml.p3.2xlarge spot instances:

Just run the below ssm command:

ssm run -p ssm-ex -t ex2 -e ssm_ex2.py -o ./out2 --it ml.p3.2xlarge --ic 2

The output is saved to ./out2:

As you may have guessed again, here you get also the following:

The local python script is copied to a dedicated path ([Bucket name]/[Project name]/[Task name]/[Job Name]/source/sourcedir.tar.gz) on S3 (more details here).
Two spot instances get launched with the code from that S3 bucket.

Pretty cool #2, isn’t it?

High level flow

Here’s a chart representing the flow in high level:

A fully* featured advanced example

And now to an advanced and fully (well, almost :) ) featured version, yet simple to implement. Note: As we customizing the image, docker engine is needed.

The example is composed of two parts, each of them demonstrates a few features. In addition, the two parts are “chained”, meaning that part of the output of the first one is an input to the second one.

In order to exemplify most of the features, the following directory structure is used:

.
|-- code
|   |-- internal_dependency
|   |   `-- lib2.py
|   |-- requirements.txt
|   `-- ssm_ex3_worker.py
|-- data
|   |-- sample_data1.txt
|   `-- sample_data2.txt
`-- external_dependency
    `-- lib1.py

code — the source code folder

internal_dependency — a dependency that is part of the source code folder
requirements.txt — pip requirements file lists needed packages to be installed before running the worker
transformers==3.0.2

2. data — input data files

3. external_dependency — additional code dependency

We’re going to use to tasks, the first gets two input channels

First task

This task gets two input channels:

Data channel — A local path on ./data that is distributed among the two instances (due to ShardedByS3Key)
persons channel — A public path on S3

The following is demonstrated:

Name the project (-p) and task (-t).
Uses local data folder as input, that is distributed among instances ( — i, ShardedByS3Key). That folder is first synchronized into a directory on S3 that is dedicated to that specific task (according to the project/task names), and then fed to the worker. If you run the same task again, there’ll be no need to upload the entire data set, just to sync it again.
Uses a public s3 bucket as an additional input (--is). A role is automatically added to the used IAM policy to allow that access.
Builds a custom docker image (--df,--repo_name,-- aws_repo_name), in order to make the pandas and sklearn libraries available to the worker. The base image is taken automatically (PyTorch framework is the default), the image is built locally, then uploaded to ECR to be used by the running instances.
Hyperparameter task_type. Any extra parameter is considered as a hyperparameter. Anything after -- (followed by a space) get passed as-is to the executed script command line.
Two instances ( --ic) are launched.
--force_running — to make sure we run the task again (as part of this example).
Use an on-demand instance (--no_spot).
Usage of requirements.txt — as it’s part of the source code folder (-e) it’s being installed automatically before running the worker.
Theinternal_dependecty folder is copied as part of the source code folder.

Pretty cool #3, isn’t it?

The worker code:

The worker can access its configuration by using the WorkerConfig object, or the environment variables. For example:

worker_config.channel_data — the input data
worker_config.channel_persons — the data from the public s3 bucket
worker_config.instance_state — the instance state, maintained between execution of the same task

In this case, the worker “processes” the files from the input channel data into the model output folder, and writes an additional file to the output data folder.

The complete configuration documentation can be found here.

Second task

The second task gets two input channels as well:

ex3_1_model channel — the model output from the first task
ex3_1_state channel — the state of the first task

The following additional features are demonstrated:

Chaining — Using outputs from part 1 ( — iit) as input for this part. Both the model output and the stare are taken.
Uses additional local code dependencies (-d).
Uses the TensorFlow framework as pre-built image (-f).
Tags the jobs ( -- tag).
Defines a Sagemaker metric (-- md).

Pretty cool #4, isn’t it?

The code can access its input data channels using worker_config.ex3_1_state and worker_config.ex3_1_state.

In addition, the score logs get captured by the "Score=(.*?);" regular expression in the ssm command above, then the metrics graphs can be viewed on the AWS console:

The complete code

We can put the two worker in a single file, using the task_type hyperparameter to distinguish between the two types of execution:

Conclusions

I hope I managed to convey the simplicity message, and to convince you to try simple-sagemaker next time you need a stronger HW for your processing script. The examples above, along with the “pretty cool” points that summarizes what you get, talk for themselves :).

Let me know if you liked it by applauding below / staring the github project.