AWS SageMaker is booming and proving to be one of the top services for building ML models and pipelines on the cloud. One of the best features of SageMaker is the wide array of in-built algorithms that it provides for Data Scientists and Developers to quickly train and deploy their models. For those less experienced with model creation in a certain field, these algorithms do all the work behind the scenes while all you have to do is feed your data for training. To check out all the algorithms and uses cases SageMaker provides click here. For this article we’ll be exploring one of the algorithms used for Supervised Learning called Linear Learner and use it to tackle the popular Boston Housing regression dataset.
NOTE: For those of you new to AWS, make sure you make an account at the following link if you want to follow along. I’ll also provide a list of services we’ll be using along with more in-depth definitions. If you’re already familiar with these services, feel free to skip ahead to the code demonstration.
Table of Contents (ToC)
- AWS Services
- Creating Notebook Instance & S3 Bucket
- Data Preparation & Importing to S3
- Training & Understanding Linear Learner
- Endpoint Creation & Model Evaluation
- Entire Code & Conclusion
1. AWS Services
AWS SageMaker: Allows for the building, training, and deployment of custom ML models, has support for both Python and R languages. Also includes various built-in AWS models/algorithms that can be used for specific tasks.
AWS S3: Amazon’s primary storage service, we will be using this service to store our training data and model artifacts/information.
Boto3: AWS Software Development Kit (SDK) for Python developers, can use this within your SageMaker notebook to access different services such as S3.
Identity Access and Management (IAM): Lets you manage access of AWS services through permissions and roles. We will be creating a role for your SageMaker Instance.
2. Creating Notebook Instance & S3 Bucket
To get started, we will create an S3 Bucket. This will store our training data and model artifacts after we have trained our Linear Learner model. Go to the S3 service in the AWS Console and click Create Bucket. We will be naming our bucket "linlearner-housingdata". If you choose to name your bucket otherwise make sure it is all lowercase, as according to S3 Bucket Naming rules. There’s no need to adjust any other settings for our case, so go ahead and click create bucket at the end.

After the bucket has been created, move to the SageMaker service and click Notebook Instances. Here we will create our Notebook Instance, "housing-LinearLearner", which serves as an ML compute instance that allows for us to work with Jupyter Notebooks and the rest of our code. When creating our Notebook Instance, one configuration we need to be aware of is Instance Type. SageMaker offers different instances types that are compute/memory intensive and come at varying prices. For example, if we had a large dataset that required heavy preprocessing a compute-intensive instance such as c4 or c5 would be recommended. For our problem we have a relatively small dataset so we’ll go ahead with the default ml.t2.medium instance. Most importantly what we need to look at is the IAM role we are assigning our Notebook. We want SageMaker to be able to access S3 for training and storing model artifacts so we need to make sure of this. Click on the first drop-down of the Permissions portion and click create a new role. Click access to any S3 bucket and then create the role.

3. Data Preparation & Importing to S3
Once you’ve arrived at your Jupyter Notebook setup, towards the top right you can click new to create a Notebook with the framework and language version you will be using. For this example, we are working with conda_python3. Next we can upload the Boston Housing dataset, clicking the Upload tab. You should now have a Jupyter Notebook and our dataset ready to go in your environment.

Now we can explore our dataset to get a better understanding before we train our model.

The response variable that we are predicting for the dataset is "medv", which is the median value of owner occupied home in thousands. Normally we would do more EDA and exploration, but for this article we’ll head straight to preparing the data for Linear Learner to train on. First we need to split our data into train and test then convert these into numpy arrays. We will need the data in this format to convert it to RecordIO later which is the format Linear Learner/SageMaker expects data for training.
Before we can convert our numpy arrays to RecordIO we need to make sure we can access our S3 Bucket and IAM role we created. We will be using these when we call Linear Learner to train on our data.
Now that we have our data in the proper format, we need to be able to upload the data to our S3 bucket. We create sub-folders for our train, test and model artifacts data (information after training job is done).

You should now have your training and test data stored in your S3 bucket.

4. Training & Understanding Linear Learner
Now that our data is properly configured, we can see the magic of Linear Learner. Linear Learner is one of AWS’s Supervised Learning algorithms and can be used for Regression and Classification problems. For classification it supports both Binary Classification and Multi-Class Classification. For our regression problem we also have a number of hyperparameters that we can provide the algorithm ranging from epochs to batch size. First we need to access the training container of Linear Learner.
Note for the SageMaker estimator our output path is the S3 bucket path for model artifacts we have created. Now that we have created our estimator, we can set our hyperparameters. The most important hyperparameter in this case is feature_dim, which is the number of features we have in our X set that we are using to predict our response. For our dataset we have 13 features we are feeding to evaluate median house prices. For a list of all available hyperparameters click here.
Now we can train our model by providing our S3 training data path and you should see logs with your training progress.

5. Endpoint Creation & Model Evaluation
Now that we have successfully trained our model, let’s deploy it and see how it does on the test data!
Note that we’ve also serialized our model to make sure we can process our results and compare it to the actual Y test results. After your endpoint has been successfully created you should see an exclamation message and an endpoint will appear when you check active Endpoints on the SageMaker service.

Let’s now predict on the test data and compare it to our actual y-test values.

We can see our model was pretty accurate overall, but let’s use Root Mean-Squared Error (RMSE) to evaluate our model numerically.

6. Entire Code & Conclusion
To access all the code for the demonstration, go to the link posted above. SageMaker’s built-in algorithms are extremely useful especially for those not experienced in building their own custom ML models or lacking in time. The key take away with using any AWS service or model is understanding the infrastructure and configuring it properly from basic IAM roles to data format is crucial in being able to properly build a project with AWS. If you are interested in examples of other in-built SageMaker algorithms, I am planning on writing a series of articles/demonstrations over the next few months on all of the available algorithms. For more references, I’m also linking AWS SageMaker examples that you can find here.
I hope that this article has been useful for anyone trying to work with AWS and SageMaker for their ML projects. Feel free to leave any feedback in the comments or connect with me on Linkedln if interested in chatting about ML & Data Science. Thank you for reading!