Cloud processing is now simpler and cheaper!
A *very simple* and *cheap* way to run/distribute your *existing* processing/training code on the cloud
It happened to me, and I’m sure it’s happening to you and to many many data scientists, who work on their small/medium size project out there:
You’ve invested a lot in your own training pipeline (pre-processing -> training -> testing), tried it locally a few times using different parameters, and it seems to be great. But… you realize you need much more RAM/CPU/GPU/GPU memory or just all of them together to be able to get the most of out of it?
It can happen for many reasons —
- The training takes too much time with your local setup
- You need the batch size to be larger, and it can’t fit in your local GPU memory
- You’d like to tune the hyperparameters, so many training runs are required
- You’d like to move some of the preprocessing steps to be done during training, e.g. to save disk space / loading time, and the CPU / RAM can’t make it
- …
So, theoretically, you have everything you need, but you just need to run it on a better HW… Should be a non-issue today, shouldn’t it?
Existing solutions
Well, there’re indeed many solutions out there, here’s a a few related technologies / platforms / solutions:
General
- Apache Airflow —”a platform … to programmatically author, schedule and monitor workflows”
- Ray — “fast and simple distributed computing”
Cloud providers AI solutions
- Kubeflow — “the machine learning toolkit for kubernetes” (pipelines)
- GCP AI Platform — ”one platform to build, deploy, and manage machine learning models” (training, pipelines, distributed PyTorch, Distributed TensorFlow)
- Azure Machine Learning — “enterprise-grade machine learning service to build and deploy models faster” (training)
- AWS Sagemaker — “Machine learning for every developer and data scientist” (training, Distributed PyTorch, Distributed TensorFlow)
Distributed training frameworks
- RaySGD — “a lightweight library for distributed deep learning”, built on top of Ray
- Horovod — “distributed deep learning training framework”
TensorFlow
- TensorFlow distributed training documentation
- TensorFlow’s own tutorial on GCP distributed GCP training
- TensorFlow Training (TFJob) for Kubernetes (part of Kubeflow)
PyTorch
- PyTorch distributed training documentation
- PyTorch’s own tutorial on AWS distributed training
- TorchElastic Controller for Kubernetes
- Running a distributed PyTorch training on CPUs or GPU (Using Kubeflow pipeline)
…
Comparison between the pros and cons of all existing solutions deserves its own post (or even a series of them), and I’m sure I forgot to mention many others :)
But, I couldn’t find one that allows you to just run my existing code, with very little or no additional coding, cheaply, and without lot of prior specific platform knowledge. I also just wanted to try having my own open source project, including documentation and a full automated testing and publishing pipeline :)
These are the reasons I came up with simple-sagemaker and this post.
Simple-Sagemaker to the rescue
Simple-sagemaker allows you to take your existing code, as is, and run it on the cloud, with no or very little code changes.
The remaining of this post shows how to use the library for general purpose processing. Follow up posts will demonstrate how to use it for more advanced cases, such as PyTorch distributed training etc.
A more comprehensive documentation and examples can be found on the github project, along with a few more examples, including the source code for all examples in this post.
Requirements
- Python 3.6+
- An AWS account + region and credentials configured for boto3, as explained on the Boto3 docs
Installation
pip install simple-sagemaker
Running a shell command
Now, to get the shell command cat /proc/cpuinfo && nvidia-smi
run on a single ml.p3.2xlarge
spot instance, just run the following ssm
command (documentation of the ssm
CLI is given below):
ssm shell -p ssm-ex -t ex1 -o ./out1 --it ml.p3.2xlarge --cmd_line "cat /proc/cpuinfo && nvidia-smi"
Once the job is completed (a few mins), the output logs get downloaded to ./out1
:
As you may be able to guess, with this single command line, you got:
- A pre-built image to use for running the code is chosen (Sagemaker’s PyTorch framework image is the default).
- An IAM role (with the default name
SageMakerIAMRole_ssm-ex
) with AmazonSageMakerFullAccess policy is automatically created for running the task. - A
ml.p3.2xlarge
spot instance is launched for you, you pay just for the time you use it!
Note: this is a bit more than the execution time, as it include the time to initiate (e.g. download the image, code, intput) and tear down (save outputs) etc. - The shell command get executed on the instance.
- The shell command exist code is 0, so it’s considered to be completed succesfully.
- The output logs get saved to CloudWatch, then downloaded from CloudWatch to the
./out1
folder.
Pretty cool #1, isn’t it?
Distributing Python code
Similarly, to run the following ssm_ex2.py
on two ml.p3.2xlarge spot instances:
Just run the below ssm
command:
ssm run -p ssm-ex -t ex2 -e ssm_ex2.py -o ./out2 --it ml.p3.2xlarge --ic 2
The output is saved to ./out2
:
As you may have guessed again, here you get also the following:
- The local python script is copied to a dedicated path (
[Bucket name]/[Project name]/[Task name]/[Job Name]/source/sourcedir.tar.gz
) on S3 (more details here). - Two spot instances get launched with the code from that S3 bucket.
Pretty cool #2, isn’t it?
High level flow
Here’s a chart representing the flow in high level:
A fully* featured advanced example
And now to an advanced and fully (well, almost :) ) featured version, yet simple to implement. Note: As we customizing the image, docker engine is needed.
The example is composed of two parts, each of them demonstrates a few features. In addition, the two parts are “chained”, meaning that part of the output of the first one is an input to the second one.
In order to exemplify most of the features, the following directory structure is used:
.
|-- code
| |-- internal_dependency
| | `-- lib2.py
| |-- requirements.txt
| `-- ssm_ex3_worker.py
|-- data
| |-- sample_data1.txt
| `-- sample_data2.txt
`-- external_dependency
`-- lib1.py
- code — the source code folder
- internal_dependency — a dependency that is part of the source code folder
- requirements.txt — pip requirements file lists needed packages to be installed before running the worker
transformers==3.0.2
2. data — input data files
3. external_dependency — additional code dependency
We’re going to use to tasks, the first gets two input channels
First task
This task gets two input channels:
- Data channel — A local path on
./data
that is distributed among the two instances (due toShardedByS3Key
) - persons channel — A public path on S3
The following is demonstrated:
- Name the project (
-p
) and task (-t
). - Uses local data folder as input, that is distributed among instances (
— i
,ShardedByS3Key
). That folder is first synchronized into a directory on S3 that is dedicated to that specific task (according to the project/task names), and then fed to the worker. If you run the same task again, there’ll be no need to upload the entire data set, just to sync it again. - Uses a public s3 bucket as an additional input (
--is
). A role is automatically added to the used IAM policy to allow that access. - Builds a custom docker image (
--df
,--repo_name
,-- aws_repo_name
), in order to make thepandas
andsklearn
libraries available to the worker. The base image is taken automatically (PyTorch framework is the default), the image is built locally, then uploaded to ECR to be used by the running instances. - Hyperparameter task_type. Any extra parameter is considered as a hyperparameter. Anything after
--
(followed by a space) get passed as-is to the executed script command line. - Two instances (
--ic
) are launched. --force_running
— to make sure we run the task again (as part of this example).- Use an on-demand instance (
--no_spot
). - Usage of
requirements.txt
— as it’s part of the source code folder (-e
) it’s being installed automatically before running the worker. - The
internal_dependecty
folder is copied as part of the source code folder.
Pretty cool #3, isn’t it?
The worker code:
The worker can access its configuration by using the WorkerConfig
object, or the environment variables. For example:
worker_config.channel_data
— the input dataworker_config.channel_persons
— the data from the public s3 bucketworker_config.instance_state
— the instance state, maintained between execution of the same task
In this case, the worker “processes” the files from the input channel data into the model output folder, and writes an additional file to the output data folder.
The complete configuration documentation can be found here.
Second task
The second task gets two input channels as well:
- ex3_1_model channel — the model output from the first task
- ex3_1_state channel — the state of the first task
The following additional features are demonstrated:
- Chaining — Using outputs from part 1 (
— iit
) as input for this part. Both the model output and the stare are taken. - Uses additional local code dependencies (
-d
). - Uses the TensorFlow framework as pre-built image (
-f
). - Tags the jobs (
-- tag
). - Defines a Sagemaker metric (
-- md
).
Pretty cool #4, isn’t it?
The code can access its input data channels using worker_config.ex3_1_state
and worker_config.ex3_1_state
.
In addition, the score logs get captured by the "Score=(.*?);"
regular expression in the ssm
command above, then the metrics graphs can be viewed on the AWS console:
The complete code
We can put the two worker in a single file, using the task_type
hyperparameter to distinguish between the two types of execution:
Conclusions
I hope I managed to convey the simplicity message, and to convince you to try simple-sagemaker next time you need a stronger HW for your processing script. The examples above, along with the “pretty cool” points that summarizes what you get, talk for themselves :).
Let me know if you liked it by applauding below / staring the github project.