The world’s leading publication for data science, AI, and ML professionals.

Train on Cloud GPUs with Azure Machine Learning SDK for Python.

A getting started guide to running Machine Learning Models on GPU powered compute instances in the Azure Machine Learning Studio cloud.

Photo by Daniel Páscoa on Unsplash
Photo by Daniel Páscoa on Unsplash

I recently worked on a project where I needed to train a set of neural networks quickly. This couldn’t be done on my laptop because the training would have been sequential and there just wasn’t enough time. Besides, I can’t stand how sluggish my 3.1 GHz 2017 MacBook Pro gets when training a model with tensorflow.

There was a fairly substantial barrier to getting started with Azure Machine Learning. Microsoft’s documentation is good but doesn’t provide context, so piecing together the components necessary to get this project running ended up being a bit painful. To save you that pain, here is a getting started guide to running your machine learning models in the Azure cloud with GPU compute instances.

The Data

For this tutorial I am going to use data from Mindy Yang and Gary Thung‘s paper "Classification of Trash for Recyclability Status." The images can be downloaded in both their original size and the downsampled sizes from this google drive folder or the downsampled sizes from kaggle.

To get an idea of the data here are a few of the images with their filenames.

Image by author with components from M. Yang and G. Thung under MIT License
Image by author with components from M. Yang and G. Thung under MIT License

I split the data into a training and a validation set. The code for the split isn’t directly related to using Azure so I haven’t included it here; however, all of the code used to create this project is available in the associated github repository.

benbogart/getting_started_with_azure_for_ml

With that, its time to setup the Azure environment!

Azure Setup

Request GPU quota

If you do not already have an Azure account you can sign up for one and get $200 in free credits (as of the time of this writing). However, in order run the following training on a GPU enabled instance you have to upgrade your account to "pay-as-you-go" and request a quota increase on a GPU compute instance type. Your credit will remain active after you switch to "pay-as-you-go."

The GPU will both speed up the training significantly (i.e. from 18 minutes to 40 seconds per epoch comparing my laptop to the GPU enabled compute instance on a model I benchmarked) and also save money because of that speed increase.

In order to request a quota increase click the "?" (question mark) from the menu bar and select New support request."

  • For Issue type select Service and Subscription Limits (quotas)
  • For Quota type select Machine Learning services

On the following screen

  • select Enter Details
  • select your preferred data center location. Rember the location you choose, it will be needed later. I am using East US for this demo.
  • Select NCPromo from the VM series dropdown.
  • Select how many CPU cores you wold like to be able to run concurrently for this compute type. In this compute type there are 6 CPU cores per GPU, so for one GPU you would enter 6, for two GPUS enter 12, etc.
  • Click Save and continue
  • Click Next: Review + create >>

NOTE: This notebook can be run on the standard DSv2 virtual machines that come provisioned with a new account, your model will just take considerably longer to train.

Install the AzureML python sdk.

While it is still possible to install the AzureML sdk the current recommended best practice is to install only the components you need into your environment.

For this project we only need azureml.core which can be installed with pip like so:

pip install azureml.core

Setup Azure Environment

There are several bits of setup need to happen before we can pass our model over to Azure for training. They are:

  1. Create a Workspace
  2. Define a Computer Target
  3. Create a Run Environment
  4. Upload your data to a datastore (optional)
  5. Define the Dataset

I’ll step through these one at a time with a brief explanation of what each is and how it fits into the overall architecture.

Create a Workspace

A Machine Learning Workspace on Azure is like a project container. It sits inside a resource group with any other resources like storage or compute that you will use along with your project. The good news is that the Workspace and its Resource Group can be created easily and at once using the azureml python sdk.

Like everything in this setup section, the workspace only needs to be created once. Replace [your-subscription-id] with, well… you can figure it out. You can also change the resource group name and workspace name to something creative like test123 if you wish.

This will ask you to authenticate, then store your connection information in the .azureml subdirectory of your present working directory so you don’t have to authenticate again.

Define a Compute Target

The compute target defines the type of virtual machines that will run your training script.

  • compute_name is user defined and be anything that makes sense to you. This is how you will reference the compute cluster in the future.
  • vm_size is the Azure specific name of the compute type you want to use in your compute cluster. We are using Standard_NC6_Promo which corresponds to the compute type we requested quota for above. It has 6 CPU cores and one Tesla K80 GPU.
  • min_nodes is the number of nodes that will remain active even when there is nothing to process. This can reduce startup time for training, but you will be charged for the those nodes whether or not you are using them. I recommend 0 for getting started.
  • max_nodes is the maximum number of nodes that can run at one time in this compute target. max_nodes must be fewer than your cpu quota / number of cpus per node or provisioning will fail.

    Create a Run Environment

The run environments defines the dependencies of your script and the base docker image to use. Microsoft provides several curated environments, but because ImageDataGenerator requires Pillow, but Pillow is not included in the curated environments we have to create our own container.

  • name passed into Environment is a user defined name for your environment. We will use it later to access the environment.
  • CondaDependencies defines the pip and conda and pip package dependencies for your script.
  • env.docker.base_image defines the docker container to use as a base image. This can be from the AzureML-Containers repo. (You can also use a custom container but thats beyond the scope of this article)

    Upload your data to a datastore

One important detail to understand in the AzureML Studio architecture is the difference between a datastore and a dataset. A Datastore is storage. It comes in many flavors, but it stores your data.

A Dataset is a reference to data stored either in a Datastore or at any publicly accessible url.

Each workspace comes with a default blob Datastore which is what we will use for this demo.

  • We grab a reference to the default workspace datastore with get_default_datastore()

Then we upload the files by calling upload on our datastore object with:

  • src_dir is the local path to upload from
  • target_path is the path where your data will be uploaded in the Datastore

    Define the Dataset

Now that we have our data uploaded, we need to create the dataset definition that points to our data.

  • To create the Dataset call Dataset.File.from_files and pass in a tuple with our datastore object and the path to our data in the Datastore.
  • Then we register the datastore so we can access it later without having to recreate it.

    Thats the end of the one time setup. Now we have a workspace which we can view at ml.azure.com that contains a compute_target, and a registered Environment and Dataset.

Train a model

At this point in the github repository I start a new jupyter notebook. This is to demonstrate how straightforward loading in all of the assets we have created is. Check it out:

Define an experiment name

Each run of a model lives in an experiment. An experiment is like a folder for your runs that allows you to keep your runs organized. For example you might create an experiment for your CNN iterations, and another for your LSTM iterations. For this demo we will create an experiment called "recycling." You can group your runs in expiriments in any way that makes sense to you.

Create a Control Script

A control script is a standalone python file (or files) that is uploaded to Azure and run on your AMLcompute instances to train your model. You send parameters to your script as command line arguments which we will see below.

Here is the control script I am using for this demo. It does the following:

  • parses the arguments passed to the script
  • Creates ImageDataGenerators for the data
  • Builds a model
  • Fits a model
  • Saves the model and its history

    Send it to Azure!

And now we send it to Azure to run in the cloud!

One important detail is how the control script gets the data path. We pass the Dataset object with a mount type to the script. It will be automatically mounted for you and the mount path will take the place of the Dataset in your arguments. This is very convenient and explained no where (at least not that I could find).

Arguments are passed in the order you would write them on the command line. The argument names and the values are passed separately but sequentially as a list.

The first time you run a model, Azure will build its compute container. This can take quite some time, but azure keeps the container cached so subsequent runs will start much faster.

Check in on the training

To see the training and monitor the training logs you can login to the Azure Machine Learning Studio.

  1. Go to https://portal.azure.com/ and login
  2. Search for "Machine Learning" and select "Machine learning"
  1. Select the "azure_trash_demo"
  • Select "Launch studio"
  • Select Experiments from the left menu
  • Select recycling

And there is your model running in the cloud.

At this point we didn’t record store any metrics so all you will see here is runtime. In another article I will dig into how to monitor your run in realtime using this dashboard.

Monitor the logs in Machine Learning Studio

To view the logs in realtime as the model is training click on the Run ID, then on Outputs + logs.

You can read through all the TensorFlow console vomit to to see that we are indeed training on a GPU. The log will update periodically without interaction from you.

Download your model and logs

The model doesn’t do us much good where it is. The last step is to get that trained model and model history back onto our local computer so we impress our friends with amazing recycling predictions.

Once the model training has completed we can download all the files produced during the training including logs, and anything saved to the outputs/ path by our training script with the following code.

Load the model

Now that we have the files we can load in the model with the trained weights.

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, None, None, 32)    416       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, None, None, 32)    0         
_________________________________________________________________
flatten (Flatten)            (None, None)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 3096774   
=================================================================
Total params: 3,097,190
Trainable params: 3,097,190
Non-trainable params: 0
_________________________________________________________________

Plot the training history

We can plot the training history.

Or anything else your heart desires.

Wrapping it up

I hope this leads to faster training for you in the future! Figuring this all out can be cumbersome, but hopefully this article will speed things up for you.

Did you know that you can clap for an article on medium more than once? Go ahead, try it! Give it a few claps like you were clapping for real. I’ll hear it and appreciate appreciate the applause.

Github repo

You can view the full github repository or jump straight to the first notebook below:

benbogart/getting_started_with_azure_for_ml

Now go do good.


Related Articles