The world’s leading publication for data science, AI, and ML professionals.

How to Create Reusable R Containers for SageMaker Jobs

A guide to creating reusable containers on SageMaker for R developers

Image sourced from unsplash.com
Image sourced from unsplash.com

SageMaker is great in terms of giving you full flexibility to use its services with your own runtime and language of choice. If none of the available runtimes or languages fit your code, you first need to overcome the initial hurdle of creating a compatible docker container that SageMaker can use.

In this blog, we take a deep dive on how to create such R-Containers for use in SageMaker and we try to understand in more depth how SageMaker works. This gives us better clarity on some decisions that we will make during the container build phase. For an end-to-end example of an ML pipeline that utilises these R-containers, check this GitHub example.

Docker containers in a nutshell

There is a slight chance you’ve landed on this article but have no idea what a docker container is. I will not attempt to explain what docker or containers are, since there are already about a million such articles written out there that do it better than I ever could.

In a nutshell, a container is a standard unit of software that packages up code and all its dependencies in a single "object" that can be executed safely and reliably across different systems.

For this blog, you need to be broadly familiar with a few concepts, namely what a Dockerfile, an image, a container registry and a container are. If you are curious about containers and want to learn more, you can start learning more about it here.

Why Containers + SageMaker?

Sagemaker is built in a modular way that allows us to use our own containers with its services. This gives us the flexibility to use the libraries, programming languages and/or runtimes of our choice while still leveraging the full benefits of using its services.

R Container for SageMaker Processing

Creating an R container for Processing jobs is probably the most simple of all the containers we may need on SageMaker. The Dockerfile can be as following:

When the container is created and registered on ECR, Amazon Elastic Container Registry, we can run a processing job. This is similar to how we would usually run a processing job, we just pass in the parameter of _imageuri the uri of the newly created image to the job. An example of such a processing job run (as part of a pipeline as well) can be found in line 33 of pipeline.R in the example shared above. When the processing job runs, SageMaker runs the container with the following command:

docker run [AppSpecification.ImageUri]

Therefore, the entry point command will be run and the script passed into the code argument of the ScriptProcessor will be run. In this case, our entry point is the command Rscript and therefore this container can be reused for all processing jobs that need to execute some arbitrary code, assuming of course the necessary package dependencies are available for it.

Further customisations are possible, and if you are interested to dive deeper into how SageMaker containers work for Processing jobs specifically, feel free to read the relevant documentation page.

R Container for SageMaker Training and Deployment

Creating an R container for Training jobs, which can also be reused for Deploying a model, involves a couple more steps compared to the simple-straightforward example above.

A template Dockerfile can be as following:

You will notice that once we install the necessary packages that our model/code requires, we also copy a run.sh and a entrypoint.R files. Let’s see what these files are and why are needed.

#!/bin/bash
echo "ready to execute"
Rscript /opt/ml/entrypoint.R $1

The run.sh script is a very simple one, that all it does is run the entrypoint.R script passing along the command line argument under $1. We do this, because SageMaker runs the docker container for training and service with the commands:

docker run image train

or

docker run image serve

depending on whether we called the training or the deployment methods. Based on the argument $1 which is either "train" or "serve" we want to differentiate the next step. The bash script is required here to pass this argument down to the Rscript execution as there is no straightforward way to read the docker run arguments from within R code. If you know of a better/simpler way of doing the above please do let me know in the comments!

Let’s now look into the entrypoint.R script:

This is now getting way more SageMaker specific, let’s unpack it! SageMaker has a very well defined file structure where it saves files and expects to find files under /opt/ml/ . Specifically, what we utilise here is the following:

/opt/ml/
    - input/config/hyperparameters.json
    - code/
    - model/
        - <model artifacts>
        - code/

hyperparameters.json file When a training estimator is created, we will want to pass in some custom code to define and train our model. When this is passed, SageMaker will zip those files (it could be a whole directory of files you need to pass for training) into one file called "sourcedir.tar.gz" and will upload it to an S3 location. Once we start a training job, SageMaker creates the file hyperparameters.json in the location /opt/ml/input/config/ that contains any passed hyper parameters but also contains the key "_sagemaker_submitdirectory" with the value of the S3 location where the "sourcedir.tar.gz" file was uploaded. When in training mode, we need to download and unzip our training code. This is exactly what the first section of the above if statement is doing.

code directory Following the convention of how SageMaker downloads and unpacks the training code on the built-in algorithms and managed framework containers, we are extracting the training code in the directory /opt/ml/code/. However, this is not a requirement, but rather a good practice to follow the service’s standards.

model directory This is the directory where SageMaker will automatically download the model artefacts and code relevant to inference. The second section of the if statement in the above snippet is leveraging this, to source the deploy.R script. It is important to note here that this Dockerfile & code sample assumes that our inference code will include a deploy.R file which will be the one that will be run for deployment. If you follow a different convention of how you would like to name this file, feel free to rename it. In this code example, during the training process and once the model is created, the artefacts of the model are saved under the /opt/ml/model folder. We also save the inference code in the subfolder code/ in the same directory. This way, when SageMaker zips the files to create the model.tar.gz file, this file will also include the necessary for deployment code.

The above, is an architectural/design decision taken to bundle the inference code with the model itself. It can be perfectly valid for your use-case to want to decouple these two and keep the inference code separate to the model artefacts. This is of course possible and up to you to decide which approach to follow.

Please also note, that the model artefacts are saved in a single model.tar.gz file on S3, however, during deployment, SageMaker will automatically download and unzip this file, so we don’t have to manually do this ourselves during deployment.

Pro Tip: You may want to have different containers for training and deploying, in which case the above step can be simplified and skip the usage of the run.sh script.

Further customisations are possible, and if you are interested to dive deeper into how SageMaker containers work specifically for training and inference jobs, feel free to read the relevant documentation page.

Building the containers

If you are familiar with building containers, you will realise that there is nothing inherently special about the following process. All we need to do, is build the containers based on the Dockerfiles provided and register the images with ECR, the SageMaker job will pull the image at runtime. If you already know how to build&register an image to ECR, feel free to skip this section of the post.

For users of RStudio on SageMaker or anyone not able or willing to have the docker daemon run on their development environment, I suggest outsourcing the actual building of the container to another AWS service, namely AWS CodeBuild. Luckily for us, we don’t need to actively interact with that service, thanks to the useful utility SageMaker Docker Build that hides all this complexity from us. Install the utility with a command like:

py_install("sagemaker-studio-image-build", pip=TRUE)

and we are good to go. Building the container requires a single command:

sm-docker build . --file {Dockerfile-Name} --repository {ECR-Repository-Name:Optional-Tag}

Conclusion

SageMaker Processing, Training and Hosting capabilities are really versatile and by bringing our own container we can build our model and our application exactly they way we want.

In this blog we explored how we can create our own, reusable, R-enabled docker container that we can use for our processing, training and deployment needs.

The complete example of the code used in this post can be found in this Github repository.

Reach out to me in the comments or connect with me in LinkedIn if you are building your own containers for R on SageMaker would like to discuss about it!


Related Articles