I recently fell in love with SageMaker. Simply because it is so convenient! I really love their approach hiding all the infrastructural needs from the customer and letting them focus on the more important ML aspects of their solutions. Few clicks and typing here and there, voilà, you’ve got a production ready model ready to take on 1000s (if not millions) of requests a day. If you need a good introduction to SageMaker see the following video by non-other than Amazon!
So what can possibly go wrong?
But trouble can strike you when you’re trying to setup and create your own models in your own docker container to perform custom operations! It’s not as straightforward and smooth-flowing as building everything with SageMaker from the beginning.
Why would you need to have your custom models?
There can be many reasons why you need your own custom model. You might be:
- Using some specific python library versions instead of the latest (e.g. TensorFlow)
- Using libraries unavailable on SageMaker
Before continuing …
Before going forward, make sure you have the following.
- Docker installed and running in your OS
- Basic knowledge of how Docker works
How do we do this?
Now with a good context behind us, let us plough through the details of getting things set up for SageMaker. The tutorial is going to have three different sections.
- Create a docker image with your code
- Testing the docker container locally
- Deploying the image on Amazon ECR (Elastic Container Repository)
Let me make flesh these points out here. First you create a docker image with the libraries and code and other requirements (e.g. access to ports). Then you create a contain from that image and run a container. Then you test the code/models with a small chunk of data in the container. After successfully testing, you upload the docker image to the ECR. Then you can specify this image as the ML model and the use it for training/prediction through Amazon SageMaker.
Also, I’ll be using this tutorial/guide as the frame of reference for this blog. It’s a really good tutorial. There are few reasons I thought of reinventing that blog post:
- It’s a good tutorial if you’re relying only on scikit-learn. I thought of creating a container with XGBoost, so we’ll have to do some tinkering around our Docker container.
- I want Python 3 not Python 2 for obvious reasons.
- I also feel like some details are missing here and there (especially when it comes to testing locally).
And to demonstrate this process, I’ll be training a XGBoost classifier on the iris dataset. You can find the Github repository with all the code here.
Overview of Docker
You know what else is amazing than SageMaker? Docker. Docker is extremely powerful, portable and fast. But this is not the place to discuss why. So let’s straight dive into setting things up. When working with Docker you have a clear set of steps you take:
- Create a folder with code/models and a special file called
Dockerfile
that has the recipe to create the docker image - Create a docker image by running
docker build -t <image-tag>
- Run the image by running
docker run <image>
- Push the docker image to some store that will store the image (e.g. dockerhub or a AWS ECR repository) using
docker push <image-tag>
Overview of SageMaker compatible Docker containers
Note that, SageMaker requires the image to have a specific folder structure. The folder structure SageMaker looking for is as follows. Mainly there are two parent folders /opt/program
where the code is, and /opt/ml
, where the artefacts are. And note that I’ve blurred out some file that you probably won’t need to edit (at least for this exercise), and are outside the scope of this tutorial.


Let’s now discuss what each of these entities in detail. First, opt/ml
is where all the artefacts are going to be stored. Let’s talk about each of the subdirectories now.
Directory: /opt/ml
input/data
is the directory where the data for your model is stored. It can be any data related file (given that your python code can read the data and the container has the required libraries to do so). Here <channel_name>
is the name of some consumable input source that will be used by the model.
model
is where the model will reside. You can either have the model in the container it self, you can specify a URL (S3 bucket location) where the model artefacts reside as a tar.gz
file. For example, if you have the model artefacts in a Amazon S3 bucket, you can point to that S3 bucket during model setup on SageMaker. Then these model artefacts will be copied to the model
directory, when your model is up and running.
Finally, output
is the director which will store the reasons for failure of a request/task, if it fails.
Directory: /opt/program
Let’s now dive into the cream of our model; the algorithm. This should be available in the /opt/program
directory of our Docker container. There are three main files that we need to be careful about train
, serve
and predictor.py
.
train
holds the logic for training the model and storing the trained model. If the train
file runs without failures, it will save a model (i.e. pickle file) to /opt/ml/model
directory.
serve
essential runs the logic written in predictor.py
as a web service using Flask, that will listen to any incoming requests, invoke the model, make the predictions, and return a response with the predictions.
Dockerfile
This is the file that underpins what’s going to be available in your Docker container. This means that this file is of uttermost importance. So let’s take a peek inside. It’s quite straight forward if you’re already familiar about how to write a Dockerfile. But let me give you a brief tour anyway.
- The
FROM
instruction specifies a base image. So here we are using an already built Ubuntu image as our base image. - Next using
RUN
command, we install several packages (including Python 3.5) usingapt-get install
- Then again using
RUN
command, we install pip and following that,numpy
,scipy
,scikit-learn
,pandas
,flask
, etc. - Subsequently we set several environment variables within the Docker container using the
ENV
command. We need to append our/opt/program
directory to thepath
variable so that, when we invoke the container it will know where our algorithm related files are. - Last but not least, we
COPY
the folder containing the algorithm related files to the/opt/program
directory and then set that to be theWORKDIR
Creating our own Docker container
First, I’m going to use a modified version (link here) of the amazing package provided at awslabs Github repository. This original repository has all the files we need to run our SageMaker model, so it’s a matter of editing the files to get it to fit our requirements. Download the content found in the original link to a folder called xgboost-aws-container
if you want to start from scratch, othewise, you can fiddle around with my version of the repository.
Note: If you’re a Windows user, and you’re one of those unfortunates to run the outdated Docker toolbox, make sure you use some directory in the
C:Users
directory as your project home folder. Otherwise, you’ll run into a very ugly experience of mounting the folder to the container.
Changes to the existing files
- Rename the
decision-trees
folder toxgboost
- Edit the
train
file as provided in the repository. What I’ve essentially done is, I’ve importedxgboost
and replaced the decision tree model to aXGBClassifier
model. Note that, when ever there is an exception, that will be written to the failure file in the/opt/ml/output
folder. So you are free to include as many descriptive exceptions as you want, to make sure you know what went wrong if the program fails. - Edit the
predictor.py
file as provided in the repository. Essentially, what I’ve done is similar to the changes did ontrain
. I importedxgboost
and changed the classifier to aXGBClassifier
. - Open up your
Dockerfile
do the following edits.
Instead of python
we use python3.5
and also add libgcc-5-dev
as it is required by xgboost.
RUN apt-get -y update && apt-get install -y - no-install-recommends
wget
python3.5
nginx
ca-certificates
libgcc-5-dev
&& rm -rf /var/lib/apt/lists/*
We are going to ask for specific versions of numpy
, scikit-learn
, pandas
, xgboost
to make sure they are compatible with each other. The other best thing with specifying the versions of the libraries you want to use is that, you know it won’t break just because a new version of some library is not compatible with your code.
RUN wget https://bootstrap.pypa.io/3.3/get-pip.py && python3.5 get-pip.py &&
pip3 install numpy==1.14.3 scipy scikit-learn==0.19.1 xgboost==0.72.1 pandas==0.22.0 flask gevent gunicorn &&
(cd /usr/local/lib/python3.5/dist-packages/scipy/.libs; rm *; ln ../../numpy/.libs/* .) &&
rm -rf /root/.cache
Then we’re going to change the COPY command to the following
COPY xgboost /opt/program
Building the Docker image
Now open your Docker terminal (if on Windows, otherwise, the OS terminal) and head to the parent directory of the package. Then run the following command.
docker build -t xgboost-tut .
This should build the image with everything we need. Make sure the image is built by running,
docker images
You should see something like the following.

Running the Docker container to train the model
Now it’s time to run the container, and fire away the following command.
docker run --rm -v $(pwd)/local_test/test_dir:/opt/ml xgboost-tut train
Let’s break this command down.
--rm
: Means the container will be destroyed when you exit it
-v <host location>:<container location>
: Mounts a volume to a desired location in the container. Warning: Windows users, you’ll run into trouble if you choose anything other than C:Users
.
xgboost-tut
: Name of the image
train
: With the start of the container, it will automatically start running the train file from the /opt/program
directory. This is why specifying /opt/program
as a part of the PATH
variable is important.
Things should run fine and you should see an output similar to the following.
Starting the training.
Training complete.
You should also see the xgboost-model.pkl
file in your <project_home>/local_test/test_dir/model
directory. This is because we mounted the local_test/test_dir
directory to the container’s /opt/ml
, so whatever happens to /opt/ml
will be reflected in test_dir
.
Testing the container locally for serving
Next, we’re going to see if the serving (inference) logic is functioning properly. Now let me warn here again, if you missed it above! If you’re a Windows user, be careful about mounting the volume correctly. To avoid any unnecessary issues, make sure you choose a folder within the C:Users
folder, as your project home directory.
docker run --rm --network=host -v $(pwd)/local_test/test_dir:/opt/ml xgboost-tut serve
Let me point to a special option that we specify in the Docker run command.
--network=host
: Means the network stack of the host will be copied to the container. So it will be like running something on the local machine. This is needed to check whether the API calls are working fine.
Note: I’m using
--network=host
, because-p <host_ip>:<host_port>:<container_port>
did not work (at least on Windows). I recommend using -p option (if it works), as shown below. Warning: Use only one of these commands, not both. But I’m going to assume the--network=host
option to continue forward.
docker run --rm -p 127.0.0.1:8080:8080 -v $(pwd)/local_test/test_dir:/opt/ml xgboost-tut serve
serve
: This is the file which calls the inference logic
This should show you an output similar to below.

Now to test if we can successfully ping to the service run the following command (in a separate terminal window).
curl http://<docker_ip>:8080/ping
You can find out the Docker machine’s IP by
docker-machine ip default
This ping command should spawn two messages, on both host side and the server side. Something like below.

If all of this went smoothly (I dearly hope so) until this point. Congratulations! you’ve almost set-up a SageMaker compatible Docker image. Just one more thing we need to do before taking it live.
Now let’s try something more exciting. Let’s try to make a prediction through our web service. For this we’re going to use the [predict.sh](https://github.com/thushv89/xgboost-aws-container/blob/master/local_test/predict.sh)
file located in the local_test
folder. Note that I’ve adapted it to suit my requirements, meaning that it’s different from the one provided in the original awslabs repository. Precisely, I introduced a new user-prompt argument that takes in the IP address and the port in addition to the ones taken in the original file. We make a call to that modified predict.sh
file using the following command.
./predict.sh <container_ip>:<port> payload.csv text/csv
Here we are making a call to the inference web service using the data in payload.csv
and saying it’s a csv file. It should return you the following. Which says it identified the data point in it as belonging to the class setosa
.
* timeout on name lookup is not supported
* Trying <container_ip>...
* TCP_NODELAY set
* Connected to <container_ip> (<container_ip>) port <port> (#0)
> POST /invocations HTTP/1.1
> Host: <container_ip>:<port>
> User-Agent: curl/7.55.0
> Accept: */*
> Content-Type: text/csv
> Content-Length: 23
>
* upload completely sent off: 23 out of 23 bytes
< HTTP/1.1 200 OK
< Server: nginx/1.10.3 (Ubuntu)
< Date: <date and time> GMT
< Content-Type: text/csv; charset=utf-8
< Content-Length: 7
< Connection: keep-alive
<
setosa
* Connection #0 to host <container_ip> left intact
Pushing it up to the ECR
Okey! so the hard work has finally paid off. It’s time to push our image to the Amazon Elastic Container Repository (ECR). Before that make sure you have a repository created in the ECR to push the images to. It’s quite straight forward if you have a AWS account.
Go to the ECR service from the AWS dashboard and click "Create repository"

Once you create the repository, within the repository, you should be able to see the instruction to complete the push to ECR.
Note: You can also use the build_and_push.sh provided in the repository. But I personally feel more comfortable doing things myself. And it’s not really that many steps to push the repository.
First you need to get the credentials to login to the ECR
aws ecr get-login - no-include-email - region <region>
which should return an output like,
docker login ...
copy paste that command and now you should be logged into the ECR. Next you need to re-tag your image to be able to correctly push to the ECR.
docker tag xgboost-tut:latest <account>.dkr.ecr.<region>.amazonaws.com/xgboost-tut:latest
Now it’s time to push the image to your repository.
docker push <account>.dkr.ecr.<region>.amazonaws.com/xgboost-tut:latest
Now the image should appear in your ECR repository with the tag latest
. So the hard part is done, next you need to create a SageMaker model and point to the image, which is straightforward as creating a model with SageMaker itself. So I won’t stretch the blog post with those details.
You can find the Github repository with all the code here.
Conclusion
It was a long journey, but a fruitful one (in my opinion). So we did the following in this tutorial.
- First we understood why we might need to make our own custom models
- Then we examined the structure of the Docker container required by SageMaker for it to be able to run the container.
- We then discussed how to create a Docker image of the container
- This was followed by how to build the image and run the container
- Next we discussed how to test the container on the local computer, before pushing out
- Finally we discussed how to push the image to ECR to be available for consumption through SageMaker.
Special thanks to the contributors who made the original Github repository giving me an awesome starting point! Last but importantly, If you enjoyed this article make sure you leave a few claps 🙂
Want to get better at deep networks and TensorFlow?
Checkout my work on the subject.

[1] (Book) TensorFlow 2 in Action – Manning
[2] (Video Course) Machine Translation in Python – DataCamp
[3] (Book) Natural Language processing in TensorFlow 1 — Packt
New! Join me on my new YouTube channel

If you are keen to see my videos on various Machine Learning/deep learning topics make sure to join DeepLearningHero.