
I recently worked on a project where I needed to train a set of neural networks quickly. This couldn’t be done on my laptop because the training would have been sequential and there just wasn’t enough time. Besides, I can’t stand how sluggish my 3.1 GHz 2017 MacBook Pro gets when training a model with tensorflow.
There was a fairly substantial barrier to getting started with Azure Machine Learning. Microsoft’s documentation is good but doesn’t provide context, so piecing together the components necessary to get this project running ended up being a bit painful. To save you that pain, here is a getting started guide to running your machine learning models in the Azure cloud with GPU compute instances.
The Data
For this tutorial I am going to use data from Mindy Yang and Gary Thung‘s paper "Classification of Trash for Recyclability Status." The images can be downloaded in both their original size and the downsampled sizes from this google drive folder or the downsampled sizes from kaggle.
To get an idea of the data here are a few of the images with their filenames.

I split the data into a training and a validation set. The code for the split isn’t directly related to using Azure so I haven’t included it here; however, all of the code used to create this project is available in the associated github repository.
With that, its time to setup the Azure environment!
Azure Setup
Request GPU quota
If you do not already have an Azure account you can sign up for one and get $200 in free credits (as of the time of this writing). However, in order run the following training on a GPU enabled instance you have to upgrade your account to "pay-as-you-go" and request a quota increase on a GPU compute instance type. Your credit will remain active after you switch to "pay-as-you-go."
The GPU will both speed up the training significantly (i.e. from 18 minutes to 40 seconds per epoch comparing my laptop to the GPU enabled compute instance on a model I benchmarked) and also save money because of that speed increase.
In order to request a quota increase click the "?" (question mark) from the menu bar and select New support request."

- For
Issue type
selectService and Subscription Limits (quotas)
- For
Quota type
selectMachine Learning services

On the following screen
- select
Enter Details
- select your preferred data center location. Rember the location you choose, it will be needed later. I am using
East US
for this demo. - Select
NCPromo
from the VM series dropdown. - Select how many CPU cores you wold like to be able to run concurrently for this compute type. In this compute type there are 6 CPU cores per GPU, so for one GPU you would enter 6, for two GPUS enter 12, etc.
- Click
Save and continue
- Click Next:
Review + create >>

NOTE: This notebook can be run on the standard DSv2 virtual machines that come provisioned with a new account, your model will just take considerably longer to train.
Install the AzureML python sdk.
While it is still possible to install the AzureML sdk the current recommended best practice is to install only the components you need into your environment.
For this project we only need azureml.core which can be installed with pip like so:
pip install azureml.core
Setup Azure Environment
There are several bits of setup need to happen before we can pass our model over to Azure for training. They are:
- Create a Workspace
- Define a Computer Target
- Create a Run Environment
- Upload your data to a datastore (optional)
- Define the Dataset
I’ll step through these one at a time with a brief explanation of what each is and how it fits into the overall architecture.
Create a Workspace
A Machine Learning Workspace on Azure is like a project container. It sits inside a resource group with any other resources like storage or compute that you will use along with your project. The good news is that the Workspace and its Resource Group can be created easily and at once using the azureml python sdk.
Like everything in this setup section, the workspace only needs to be created once. Replace [your-subscription-id]
with, well… you can figure it out. You can also change the resource group name and workspace name to something creative like test123
if you wish.
This will ask you to authenticate, then store your connection information in the .azureml subdirectory of your present working directory so you don’t have to authenticate again.
Define a Compute Target
The compute target defines the type of virtual machines that will run your training script.
compute_name
is user defined and be anything that makes sense to you. This is how you will reference the compute cluster in the future.vm_size
is the Azure specific name of the compute type you want to use in your compute cluster. We are usingStandard_NC6_Promo
which corresponds to the compute type we requested quota for above. It has 6 CPU cores and one Tesla K80 GPU.min_nodes
is the number of nodes that will remain active even when there is nothing to process. This can reduce startup time for training, but you will be charged for the those nodes whether or not you are using them. I recommend 0 for getting started.-
max_nodes
is the maximum number of nodes that can run at one time in this compute target.max_nodes
must be fewer than your cpu quota / number of cpus per node or provisioning will fail.Create a Run Environment
The run environments defines the dependencies of your script and the base docker image to use. Microsoft provides several curated environments, but because ImageDataGenerator requires Pillow,
but Pillow
is not included in the curated environments we have to create our own container.
name
passed intoEnvironment
is a user defined name for your environment. We will use it later to access the environment.CondaDependencies
defines the pip and conda and pip package dependencies for your script.-
env.docker.base_image
defines the docker container to use as a base image. This can be from the AzureML-Containers repo. (You can also use a custom container but thats beyond the scope of this article)Upload your data to a datastore
One important detail to understand in the AzureML Studio architecture is the difference between a datastore and a dataset. A Datastore is storage. It comes in many flavors, but it stores your data.
A Dataset is a reference to data stored either in a Datastore or at any publicly accessible url.
Each workspace comes with a default blob Datastore which is what we will use for this demo.
- We grab a reference to the default workspace datastore with
get_default_datastore()
Then we upload the files by calling upload
on our datastore object with:
src_dir
is the local path to upload from-
target_path
is the path where your data will be uploaded in the DatastoreDefine the Dataset
Now that we have our data uploaded, we need to create the dataset definition that points to our data.
- To create the Dataset call
Dataset.File.from_files
and pass in a tuple with ourdatastore
object and the path to our data in the Datastore. -
Then we register the datastore so we can access it later without having to recreate it.
Thats the end of the one time setup. Now we have a workspace which we can view at ml.azure.com that contains a
compute_target
, and a registeredEnvironment
andDataset
.
Train a model
At this point in the github repository I start a new jupyter notebook. This is to demonstrate how straightforward loading in all of the assets we have created is. Check it out:
Define an experiment name
Each run of a model lives in an experiment. An experiment is like a folder for your runs that allows you to keep your runs organized. For example you might create an experiment for your CNN iterations, and another for your LSTM iterations. For this demo we will create an experiment called "recycling." You can group your runs in expiriments in any way that makes sense to you.
Create a Control Script
A control script is a standalone python file (or files) that is uploaded to Azure and run on your AMLcompute instances to train your model. You send parameters to your script as command line arguments which we will see below.
Here is the control script I am using for this demo. It does the following:
- parses the arguments passed to the script
- Creates ImageDataGenerators for the data
- Builds a model
- Fits a model
-
Saves the model and its history
Send it to Azure!
And now we send it to Azure to run in the cloud!
One important detail is how the control script gets the data path. We pass the Dataset
object with a mount type to the script. It will be automatically mounted for you and the mount path will take the place of the Dataset in your arguments. This is very convenient and explained no where (at least not that I could find).
Arguments are passed in the order you would write them on the command line. The argument names and the values are passed separately but sequentially as a list.
The first time you run a model, Azure will build its compute container. This can take quite some time, but azure keeps the container cached so subsequent runs will start much faster.
Check in on the training
To see the training and monitor the training logs you can login to the Azure Machine Learning Studio.
- Go to https://portal.azure.com/ and login
- Search for "Machine Learning" and select "Machine learning"

- Select the "azure_trash_demo"

- Select "Launch studio"

- Select
Experiments
from the left menu - Select
recycling

And there is your model running in the cloud.
At this point we didn’t record store any metrics so all you will see here is runtime. In another article I will dig into how to monitor your run in realtime using this dashboard.

Monitor the logs in Machine Learning Studio
To view the logs in realtime as the model is training click on the Run ID, then on Outputs + logs.

You can read through all the TensorFlow console vomit to to see that we are indeed training on a GPU. The log will update periodically without interaction from you.
Download your model and logs
The model doesn’t do us much good where it is. The last step is to get that trained model and model history back onto our local computer so we impress our friends with amazing recycling predictions.
Once the model training has completed we can download all the files produced during the training including logs, and anything saved to the outputs/
path by our training script with the following code.
Load the model
Now that we have the files we can load in the model with the trained weights.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, None, None, 32) 416
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, None, None, 32) 0
_________________________________________________________________
flatten (Flatten) (None, None) 0
_________________________________________________________________
dense (Dense) (None, 6) 3096774
=================================================================
Total params: 3,097,190
Trainable params: 3,097,190
Non-trainable params: 0
_________________________________________________________________
Plot the training history
We can plot the training history.

Or anything else your heart desires.
Wrapping it up
I hope this leads to faster training for you in the future! Figuring this all out can be cumbersome, but hopefully this article will speed things up for you.
Did you know that you can clap for an article on medium more than once? Go ahead, try it! Give it a few claps like you were clapping for real. I’ll hear it and appreciate appreciate the applause.
Github repo
You can view the full github repository or jump straight to the first notebook below:
Now go do good.