Deploy Custom Deep Learning based Algorithm Docker Container on Amazon Sagemaker

Bhavesh Singh Bisht
Towards Data Science
9 min readJan 6, 2021

--

In this article, I would be covering how we can deploy a custom deep learning container algorithm on Amazon Sagemaker. Sagemaker provides 2 options wherein the first option is to use built-in algorithms that Sagemaker offers that includes KNN, Xgboost, Linear Learner, etc. while the other option is to use your custom docker container from ECR(Elastic Container Registry). In this article, we would see how we can deploy our custom docker container and host it on Sagemaker. For data purposes, I have used the famous Titanic data set from Kaggle. You can refer to my Github repository to find all the files required to deploy and host the container.

Prerequisites:

Docker :
Docker provides a simple way to package arbitrary code into an image that is totally self-contained. Once you have an image, you can use Docker to run a container based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.
In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.
Docker uses a simple file called a Dockerfile to specify how the image is assembled. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.

Container Structure:

File structure of the container. (decision_trees is the name of the algorithm, we can replace this with our algorithm/use case name) (Image from AWS github repository)

Lets take one by one look at all the files:

Dockerfile:
This file describes how to build your Docker container image. It is a text document that contains all the commands a user could call on the command line to assemble an image and also commands to install various packages, set an active working directory, copy directories, etc.

The above picture shows the dockerfile for our titanic use case (Image by author)

The above picture shows the docker file for our titanic use case
In the docker file, I have installed the libraries such as Tensorflow, Scikit-learn, Keras, etc. that will be used for processing, training the model on the training data, and for taking the inference on the test data. I have used Ubuntu 18.04 as the base image.

Build_and_push.sh is a script that uses the Dockerfile to build your container images and then pushes it to ECR.

The above figure shows the files listed in the container (Image by author)
  1. nginx.conf is the configuration file for the nginx front-end. Generally, we take this file as it is.
  2. predictor.py is the program that actually implements the Flask web server and the titanic predictions for this app. We will customize this to perform the preprocessing on the test data and perform the inference.
  3. serve is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in predictor.py. We don’t need to modify this file.
  4. train.py is the program that is invoked when the container is run for training. This is the file where we have defined the preprocessing steps, model training and finally saving the model in the model directory.
  5. wsgi.py is a small wrapper used to invoke the Flask app. We don’t need to alter this file.

The working directory set in the dockerfile will be opt/ml/. Following is the folder hierarchy of this directory.

The above figure shows the folder structure in the opt/ml directory (Image from AWS github repository)

Let’s begin with the changes that needs to be done in various files:

Step1: Clone the Repository
You can clone my repo from github or Clone the official AWS repository

Step 2: Edit the docker file
Secondly, we need to make changes in the dockerfile, where we need to specify all the libraries that we would be requiring for the entire process of training and inference. For deploying deep learning models you may require libraries such as Tensorflow, Keras etc..
Note - Check proper versions of Keras and Tensorflow supported.
We don’t specify the entry point for our container in the dockerfile so docker will run the train.py at the training time and serve.py at serving time.
We also need to copy the container folder to opt/ml directory and set that as the active working directory for the container. Refer the image below:

The figure shows that the Titanic folder is set as the active working directory (Image by author)

Step 3: Edit train.py file
Now our docker file is ready, we need to start editing the training and processing code. For the current data-set preprocessing was not much therefore I have incorporated that in the train.py file itself. We need to make changes particularly in the train() function. Refer to the image below:

Preprocessing required on the training data before it is fed to the model for training (Image by author)

Following are the things that I have implemented as a part of preprocessing on the titanic data set:
1. Splitting the Titles from Names and adding them to different categories such as Rare, Miss, etc.
2. Creating a new feature “Family_members” that is the addition of SibSp and Parch.
3. Handling the missing values in Age column by grouping Title and taking the mean.
4. Handled the missing values in Fare column by grouping Pclass and taking the mean.
5. Handled the missing values in Embarked by replacing them by “S” (mode)
6. Reversing the values of Pclass by creating a new column Pclass_Band to create a positive correlation with the Fare.
7. One Hot Encoding of columns “Sex”, “Embarked” and ”Title”.

Now we have to define our mode architecture.
For the Titanic data set I have defined 2 hidden layers with 8 and 4 neurons respectively with Relu activation function in each layer and an output layer with 1 neuron with Sigmoid activation function.
For compiling the model, I have used Adam optimizer, binary_crossentropy loss and accuracy as the validation metric.
Now, we train this model using model.fit function for around 200 epochs and a batch size of 8. I have not performed any hyperparameter tuning for this model, so these values can be changed for better accuracy.

The figure shows Keras sequential layers used for training the model (Image by author)

Finally, we save the model pickle file in the model directory.

The figure shows saving the trained model in the defined model path (Image by author)

Step 4: Edit predict.py file
Since, the test file would be similar to the training set, we have to perform the same preprocessing steps before we can take the inference from the trained model. We have to add all these steps inside the predict function.

The above screenshot shows the preprocessing steps inside the predict function (Image by author)

Once we have defined all the preprocessing steps, it’s time to load the trained model file. For that we need to define the model_path to locate the file. The path here is opt/ml/model/keras.h5. For loading the model I have used load_model command from keras .models. We also need to initialize the Tensorflow default graph which will be used to take predictions from our trained neural network.

The above figure shows the loading of the trained model (Image by author)

Once the model is loaded and test data is processed, its time to take the predictions. The prediction code is also defined inside the predict function.

The above snapshots shows the command to take predictions from the trained model (Image by author)

Local Testing the Container:

While we are first packaging an algorithm with Amazon SageMaker, we want to test it ourselves to make sure it’s working right. In the directory container/local_test, there is a framework for doing this. It includes three shell scripts for running and using the container and a directory structure:

The scripts are:

1. train_local.sh:

Run this with the name of the image and it will run training on the local tree. We will also need to modify the directory test_dir/input/data/… to be set up with the correct channels and data for the algorithm. Therefore, in this directory we would be placing the Titanic- train data.

The above screenshot shows train_local.sh script (Image by author)

2. serve_local.sh:

We need to run this file with the name of the image once we have trained the model and it should serve the model. This file in-turn runs the serve.py file in the container. This basically starts a python application that listens on port 8080, and starts ngnix and gunicorn.
Nginx is where requests from the internet arrive first. It can handle them very quickly, and is usually configured to only let those requests through, which really need to arrive at the web application.
Gunicorn translates requests which it gets from Nginx into a format which your web application can handle, and makes sure that your code is executed when needed.

3. predict.sh:

This script is run with the name of a payload file and (optionally) the HTTP content type we want. The content type will default to text/csv. For example, we can run $ ./predict.sh payload.csv text/csv. Here, we need to paste the Titanic test file and rename that to payload.csv. We would be passing this file as an input while testing the model locally.

The above snapshot depicts the predict script (Image by author)

Steps for local testing:

Step 1: Build the docker image:
For building the docker image, we need to move to the container directory and run the following command:
sudo docker build . -t titanic_image

The above screenshot shows image successful building of the titanic-image (Image by author)

Step 2: Train the model:
In this step we need to train the model and save its pickle file in the model directory. For this, we need to run the train_local.sh script using the below command.
cd local_test
sudo bash train_local.sh titanic-image

The above snapshots show successful training of our neural network on the titanic dateset (Image by author)

Step 3: Run the serve script
Next step is to run the flask application, for that we need to run the serve_local.sh script using the following command:
sudo bash serve_local.sh titanic-image

The above image shows the running server with 8 workers (Image by author)

Step 4: Run the predict script
We have to keep the server script running in one tab and on another tab we can run the predict script for getting the model inference. Payload.csv file(Titaninc test data-set) will be passed as a parameter to the model and we would pass this file with the bash command to run predict.sh script. By running this we hit the server on port 8080 with the payload file and server inturn responds with the predicted values. This forms a request response paradigm.
sudo bash predict.sh payload.csv

The above snapshots show the output predicted values from the model (Image by author)

Now finally it’s time to push our container image to Amazon ECR!!

Build and Push

Now its time to build and push our docker image to ECR. This final step can be done using a Sagemaker Notebook instance. Make sure that the instance has sagemaker full access, S3 bucket access where the input file and the model file resides, ECR etc.
First we need to upload the titanic folder to Sagemaker Jupyter lab inside the notebook instance. Next, we need build the docker inside the notebook and locall ypush the image to ECR.
The following code can be found inside the build_and_push.sh file or inside Bring_Your_Own-Creating_Algorithm_and_Model_Package notebook.

The following code pushes the docker image to ECR with the tag “latest” (Image by author)

Once we have our image pushed to ECR we can train the model. For that the first step is to upload our training and batch inferencing data to S3. Then we need to create an estimator and fit the model. Finally, we can run the batch inference job to test the model output. We can further deploy the model as a Sagemaker endpoint.
You can find these additional steps in the jupyter notebook inside the repository.

This is how we can deploy our model container on Amazon Sagemaker.

--

--

I am passionate about data science and have a profound history of working in an AI/ML firm