SparkML on AWS in 6 Easy Steps

Published in

Towards Data Science

4 min readAug 12, 2019

I recently completed a machine learning project that, due to the size of the dataset and computational complexity of the analysis, required the use of Spark and a distributed computing environment. To complete this project, I elected to use Amazon Web Services (AWS) Elastic MapReduce (EMR). As this was my first attempt to use AWS and EMR, the learning curve was steep. Furthermore, I found the instructions and guidance available online to be only somewhat helpful. To address the lack of clear, concise, and useful instructions for AWS EMR implementation, I provide the following six easy steps to getting up and running with AWS EMR.

By following the instructions below, you will be able to:

Set-up an EMR cluster that will run SparkML, and
Create an EMR Notebook (Jupyter) to execute code

Please note, you will need an AWS account in order to use AWS EMR. Instructions for setting up an AWS account can be found here. Also, be aware that there are fees associated with using EMR and other AWS services (e.g., S3 storage and data transfer). Amazon’s fee structure can be found here.

Step 1: Create a Cluster

Begin by logging in to the AWS Console. Once you are logged in, search for EMR.

GIF demonstrating how to search for EMR on AWS Management Console

On the EMR home page, click the “Create cluster” button and then click “Go to advanced options.”

GIF showing the Create Cluster button and Go To Advanced Options link

Step 2: Select Software Configuration

On the software configuration page, you will need to adjust the default setting. Adjust the settings by selecting only Hadoop, JupyterHub, Spark, and Livy, then click the “Next” button at the bottom of the screen.

GIF demonstrating how to change the default software configuration

Step 3: Select Hardware Configuration

Now it is time to select the hardware configuration for your cluster. You can choose the type of master instance as well as the type and number of core and task instances. Each component of the cluster is an Elastic Compute Cloud (EC2) instance. Details on the types EC2 instance from which you may choose can be found here. In addition to selecting instance types, you can specify the amount of Elastic Block Store (EBS) storage assigned to each instance in the cluster. Once you have selected your desired settings, click the “Next” button at the bottom of the page.

Screenshot of AWS EMR hardware configuration

Step 4: Select General Options

In the general options settings, assign a name to your cluster and select the Simple Storage Service (S3) bucket where you want logs recorded. If you are unfamiliar with S3, Amazon provides instructions for creating buckets. Alternatively, you can use the default settings which will create a new bucket for you. Once you are ready, click the “Next” button.

Screenshot of AWS EMR general options set-up page

Step 5: Set Security Options and Create the Cluster

The final step for creating a cluster is to set the security options. For personal use, the default settings should be fine. If you plan to use SSH to access the cluster, then you will need to assign an EC2 key pair. Instructions for creating a key pair are available using the link in the blue tinted box in the lower right portion of the page. Once created, the key pair can be assigned to the cluster using the dropdown menu at the top of the page. When ready, click “Create Cluster.”

It will take several minutes for the cluster to start. While you wait, proceed to Step 6 to create a notebook instance.

Screenshot of AWS EMR security options set-up page

Step 6: Create Notebook

After clicking on “Create Cluster” in Step 5, you will be taken to the screen shown below. Click on “Notebooks” in the menu on the left side of the screen. On the next screen, click the “Create notebook” button.

GIF demonstrating notebook creation steps

Give a name to your notebook and select the cluster you wish to use to run the notebook. When you are finished, click the “Create notebook” button.

GIF demonstrating final notebook creation steps

That’s it! Once the cluster is up and running, you will be able to open the notebook. Type “spark” in the first cell, then run the cell to start a Spark Session. Now you are set to to run machine learning algorithms on an AWS EMR cluster using SparkML. When you are finished, remember to terminate the cluster to avoid incurring additional fees.