Intro to AWS for Machine Learning

Speeding up your Jupyter Notebooks and Gridsearches using the AWS Machine Learning Workbench.

Published in

Towards Data Science

11 min readJun 9, 2020

Creating powerful algorithms couldn’t be easier with scikit-learn and tensor-flow, however, tuning them properly can take time; almost all of the models have thousands of different parameter combinations. In addition, a good rule of thumb to improving your model is to increase the amount of data you are using to train — predictors or observations. Therefore, even if you have the latest MacBook Pro, you will quickly find yourself looking at a parametrisation gridsearches which can take hours, if not days. Luckily it is simple to farm out all of your processing to a computer in an AWS warehouse somewhere in Ohio!

Getting Set up With AWS

The first step will be to set up an AWS account and download your key file to a safe place on your computer. Setting up an AWS account is straightforward, however, there is a good setup video here if you get stuck. Bear in mind you will need to give credit card details even if you are signing up for the 12 month free trial as not everything is included in the trial.

Cost: For my most recent project, I used a medium sized instance for multiple days of processing and the total bill came to about 4 dollars. You can turn the instance off when you are not using it, and the costs per hour are pretty cheap. The cost fully depends on the power of the instance you are renting, which ranges from ~5 cents per hour for an instance with 2cores at 2gb memory to about ~30 dollars per hour for a monster supercomputer with 96cores at 768gb memory. If you want an estimate of costs, just go to the bottom of the Machine Learning Workbench page.

Opening the Machine Learning Workbench

Once you have your AWS account up and running, you are ready to get started. The Machine Learning Workbench is simply an EC2 instance which has been already configured with Jupyter, Jupyter Notebooks, Scikit-learn, Tensorflow and many of the standard python packages like pandas and matplotlib. In order to get it set up, follow this Machine Learning Workbook link.

Once there, follow the steps to continue to subscribe, accept the terms, continue to configuration and then continue to launch. It gives you the option to pay for a years subscription, but you can ignore this for now, if you follow through, it will allow you to simply rent the instance by the hour. Once you are through to the “Launch this software” screen, I find it more straightforward to set up the instance through EC2, so select this option as below.

From here, you will be given the option to choose the computing power you require. There are many additional configurations you can select at this stage, but the simplest is to click straight on “Review and Launch” in the bottom right and then click “Launch” again. At this point it will ask you which key pair you want to set up this instance. This is required in order to be able to remote in to the EC2 instance from your command line. If you already have a key pair you use for AWS, use this, otherwise create a new key pair.

Once you have downloaded the new key pair, it will allow you to click on “Launch Instance”. You have now created your EC2 Machine learning instance and it will start to configure itself in the background!

From the easiest way to track the instance setup is to click directly on the instance id. However if you navigate away from the page, you can always get back to it by going back to your EC2 instances screen. If you go back to your EC2 instances after launching it and you don’t see your instance there, double check that you are in the correct region, you may have set up your workbench in a different region. You can find all of the regions in the top right of the EC2 screen.

Once the instance has a green light in the instance state you are good to open the instance. The easiest way to do this is directly from the browser. In order to do this, in a new browser tab, just type in ‘https://IPv4 Public IP’ and press enter, where IPv4 Public IP is the public IP number underlined below. If you hover on the right of the number you can copy by clicking on the icon. Once you navigate to this IP address you are likely to get a warning from your browser letting you know the connection is not private. Depending on which browser you are using, follow the advanced options to ignore the error and proceed to your instance. If you are struggling to get through, try with another browser, as they may have different security settings.

Once you are through, it should take you to this login screen. The password to login is your Instance ID highlighted above. You can connect directly to the virtual desktop or go straight into Jupyter Notebook to start training your model. It should take a minute of so to login the first time, but then you will have a jupyter notebook session ready to go!

Common Issue — Security Groups

If you got in without any issues, skip this bit! However, if you are struggling to connect to the IP address and it keeps timing out, most likely the security group for the EC2 instance wasn’t automatically set up correctly. This is easy enough to fix, just follow the steps below.

Go back to main EC2 instance screen and where you see your instance scroll across until you see the column header “Security Groups”. Click on the link to whichever security group has been assigned to your instance.
This will bring you to a page which allows you to edit the security settings. In order for this to work you need edit your inbound rules to include HTTPS and for the source to be 0.0.0.0/0 (this means any IP address). See below example.
Once this has been changed you can go back to the IP address in the browser and it should stop timing out. Please feel free to comment or let me know if you encounter any other issues around this.

Loading Data and Notebooks

The simplest method for loading data into your instance is by using the handy upload button, see below. This allows you to quickly and easily upload data directly from your laptop to the remote instance. Once your data and notebooks are uploaded, open up the notebook and start running the code!

It should be noted however that if you need to upload very large files, it can be expensive and slow to do upload this way. Or the files might not be on you computer but be stored safely in a S3 bucket. No fear, it is also possible to load files directly from an S3 bucket. If you haven’t encountered this before, and S3 bucket is simply a storage box for data and files, secure and easy to access from anywhere. This tutorial doesn’t cover setting up S3 buckets, but this is easy to do and there are many instructions already online for this.

If you want to access an S3 bucket directly from your jupyter notebook in your Machine Learning Workbench, you will first need to grant access to this instance to access your S3 bucket. The simplest way to do this is to log directly into the instance from you local computer using the command line. To do this, go back to the EC2 instance screen, select your instance and press connect. It will open up the below screen. Copy the example as highlighted below (yours will be different). Copy this command into the command line, make sure that you replace the 3rd item with the full filepath for your secret key file and replace the word “root” with “ubuntu”. Press enter and you should log directly into your instance. If it asks you if you want to continue, type “yes”.

If you get and error which says that your Private Key File is Unprotected, protect the file by running the command line command below and then try again
chmod 400 New_Key_Pair.pem
Replace the last item in the command with the full filepath for the your secret key file.

Once you are in your command line should look like the above. From here — to authorise your S3 bucket run the below command line command:

aws configure

Then all you need to do is input the Key ID and Secret Key and you will be set up and ready to go! This means that you will be able to import files into your instance directly from S3. There are many more resources on how to do this on AWS, but below are a few common commands for using S3 from an EC2 instance:

List your AWS Buckets:
aws s3 ls

Move file (or whole folder!) into bucket:
aws s3 cp /Documents/data.csv s3://bucket/folder_name

You can copy files the other way round by:
aws s3 cp s3://bucket/file_to_copy file_to_copy

If you wanted instead to give a full pathname that would look like:
aws s3 cp s3://bucket/file_to_copy ~/folder/subfolder/.

Once you are done with configuring the instance in your command line, just type exit in the command line and it will close the connection.

Cores and Processing

A common error I have seen with people moving their notebooks over to AWS for speeding up their processing times is when they start running the code, they forget to adapt the code know to utilise all of the new computer. They copy over their notebook to a more powerful computer with 8 cores and then are surprised that the Gridsearching is taking just as long…

For example, sklearn has a parameter in most of its model functions called n_jobs which allows you to specify how many cores you want to use for training the model. The default for this is None, which means it will only use 1 core. If you don’t change this default, moving to a computer with more cores will make no difference. You can manually set this to the number of cores you want to use, or simply put in -1, which will look to use all available cores.

Be aware that n_jobs might not be the only limiting factor. Some solvers will not support multiprocessing and so will ignore your n_jobs input (e.g. the Sag solver with sklearn LogisticRegression).

Running Scripts in the Background

So you have your data on a the workbench, you are ready to start training your model, but what if, even with the additional AWS computing power, it looks like your gridsearch is going to take days. The drawback to logging into your machine learning workbench jupyter notebook is that if you quit the notebook, it isn’t going to keep running in the background. If you shut the jupyter notebook it will interrupt the process and you will have to start from scratch. You have 3 real options here.

Caffeine: The first and simplest one requires your computer to be on for the entire time, so it isn’t ideal, but I will mention it as it is simpler! Basically, option one is to leave your jupyter notebook running and make sure that your computer doesn’t go to sleep. In order to do this, get an app like Caffeine or Amphetamine for your computer in order to keep it awake while the instance works away. This becomes slightly more complicated if you will need to close your laptop as Macbooks don’t allow you to keep your computer awake when it is closed. There are ways around this — apparently an app called InsomniaX will keep your Macbook running even if you close the screen.
No Hang Up: The second option is more robust. For this option, you will need to log into the instance on the command line and run a script, letting the instance know that you want the script to continue running even if you disconnect from the instance. In order to connect to the instance, review the steps from the previous section (Loading Data and Notebooks). Once you are logged in and you have imported the script you want to run, you can run it in the background using the below command line command:
nohup python script.py &
This will force the instance to run the program in the background and will not ‘hang up’ if you are disconnected and will provide you with a process ID. In order to do this you will have to convert your notebook into a python (.py) file and also make sure that any results from the running of the scripts are pickled or saved down. Some additional commands which can be helpful to make sure that the script is running are:
Check all python running processes
ps aux | grep pythonKill a specific process
kill -9 [insert process number]
Terminal Multiplexer: Your third option — and probably the most reliable — would be to set up a terminal multiplexer. A terminal multiplexer allows you to connect to an instance or server with a client window. It allows you to make sure that the instance will continue running even if the client is disconnected. You can shut down your computer and just reconnect to the client whenever you start back up and the instance will still be running. Think of it like a virtual terminal. As such, you will still need to be running scripts rather than using the jupyter notebook directly. A commonly used terminal multiplexer is called tmux and you can access a simple guide on how to get set up here.

Shutting down the Instance

Remember to do this! If you forget, you can quite easily rack up a chunky bill, so don’t go on holiday before closing down your instances. You have two options here, stopping the instance and terminating it.

In both cases, the instance will be shutdown and your virtual machine will be permanently removed and you will no longer be charged for instance usage. In both cases you lose any data saved on the ephemeral storage of the instance. The key difference between stopping and terminating an instance is that when you stop, the attached bootable EBS volume will not be deleted. This means the instance will still exist in your EC2 instances window, and you can start it up again anytime with the same setup and settings as before. However note if you do start it again, the connection path and IP address for connection will change. Also note that there can be charges for keeping an EBS volume in stopped state (though the charges will be minimal and during the free trial a lot of them are free), but bear in mind that if you plan on leaving it stopped for a long period of time you will be better of terminating the instance.

There you go! The basics to using AWS computing power for your machine learning tasks — a great way to speed up any data science project.