Making Sense of Big Data
Bring your own code to utilize the power of cloud computing in a few steps!

Training a model is tedious, especially when training scripts consume up all of your computer power and you can do nothing but wait. It happens to me all the time – either when I am starting the model building process or I am finalizing the model with parameter tuning. Luckily, big cloud vendors provide solutions to train and deploy your model in the cloud without having to use your local capacity. AWS, Azure, and GCP all have similar offerings and I am using Aws Sagemaker here to show a way of using your own containerized docker to train the model in the cloud.
Cloud training usually a nightmare to me when it comes to configuration. Different cloud vendors have a different structure for storage, instances, and APIs, which means we have to read through manuals and dev guides to make things function. I feel the same way when I started to use Sagemaker. But, instead of looking at their console and trying to find a solution in the UI, I find Sagemaker SDK pretty powerful. The typical or advertised usage of Sagemaker is by their pre-built algorithm. But unless if you are just shopping for a baseline model, you will have to use your own model code. Of course, you can study their manual and learn how to tune or modify their algorithm APIs, yet I believe there are more efficient ways.
So I created this beginner guide to showcase a way of utilizing Sagemaker training instance while remaining the capacity of using Docker to train your own code. I hope the solution will help those who need it!
Prerequisite
This exercise will need some prerequisite setup to start with. If you have used Sagemaker SDK previously, you may skip this part. Otherwise, please follow the setup carefully.
- AWS CLI installation & setup
Check this link to follow the instruction to download and install AWS CLI according to the system you are using. The example in this tutorial is using Mac-OS.
If you don’t have an account with AWS, you can sign up for one for free (note: it will require your credit card info, and be sure to read the free-tier offer in case any charges incurred).
Log in to the console and navigate yourself to IAM. Create a user and attach the SagemakerFullAccess policy. Also, create an access key under Security Credential and download the credential .csv file.

Once the AWS CLI is installed, you can set up with the credential .csv file you just setup. In the terminal, type in the following:
aws configure
If you are not using MFA with your user, simply fill in the info you acquired from the credential file, and you are done. Otherwise please also added your MFA profile in the configure file.

Check your configuration. If your CLI is set up correctly, you should see the bucket listed under your account.
aws s3 ls
- Pipenv Setup
This step is optional since you may use other ways for your environment. However, I will briefly walk through how to set up the Pipenv environment to let you easily follow along.
- Clone my repo here.
- Open up the project in your preferred SDE. I am using VSCode.
- Install pipenv with your pip.
pip install pipenv
- Initialize the pipenv environment with a Python interpreter. I am using version 3.7. Use
pipenv --python 3.7
to initialize. - Install dependencies with
pipenv install
.
Setting up correctly and it should lead to something like this.

Model Building
This example will use House Prices: Advanced Regression Techniques dataset, which contains 79 variables to predict the final price of each home. After comparing a bunch of models, the xgboost regressor stands out and our job is to fine-tune the regressor in the training scripts. RMSE will be the metric for model selection.
Wrap-up in Docker
If you are not familiarized with Docker, check this guide by AWS. What we gonna do here is to build our own docker image and push it to AWS ECR. Elastic Container Registry is a managed service by AWS to store and manage Docker images. To use it with Sagemaker SDK, we need to push the docker image to ECR to allow our training instance to pull images from there.
- Dockerfile

The above diagram illustrates how we setup the Dockerfile. It is recommended to test your docker image locally before pushing it to ECR.
- Local Dockerized Training
This example will use Sagemaker training API to test perform local training in the aforesaid Docker image.
- Push Docker to ECR
AWS ECR provides a console for setting up the Docker repository within a few clicks. Check this link if you want to use the console. Alternatively, you can use the AWS CLI to set up the Docker repository with the command line.
Now our image is set and ready to use.
Training in Sagemaker
Once we have all preceding steps are set up properly, the workflow to kick-off training in Sagemaker is relatively simple. First, we need to store data in a specified S3 bucket. You can create one in the AWS console and upload the data or use Sagemaker SDK. And then we utilize the sagemaker.estimator
to kick-off training.
The above script will kick off a training job with the specified configuration. Check the free-tier instance to avoid charges. I am using "ml.c4.xlarge" which should be the free-tier instance if your account is open within two months. After training is complete, you can view your training job in the Sagemaker console:

You can also find your s3 model artifact from the same page under Output. Generally, the estimator
will create a new bucket for you to store the artifacts unless you specify in the output_path
. Check the link here for clarity.
One thing to notice here, if you would like to have your model or output to be exported from the Docker, simply save the file to opt/ml/model
and the Sagemaker will package your desired output to a model.tar.gz
and upload it to the output bucket. You then can download and extract the file either by the S3 bucket console or Python SDK.
For a complete example, check my repo here. Peace! 🤗