The world’s leading publication for data science, AI, and ML professionals.

Computer Vision at Scale with Dask and PyTorch

A tutorial demonstrating how to run batch image classification in parallel with GPU clusters and Pytorch, using dog images

Disclaimer: I’m a Senior Data Scientist at Saturn Cloud – a platform enabling easy to use parallelization and scaling for Python with Dask.

Applying Deep Learning strategies to computer vision problems has opened up a world of possibility for data scientists. However, to use these techniques at scale to create business value, substantial computing resources need to be available – and this is just the kind of challenge Saturn Cloud is built to solve!

In this tutorial, you’ll see the steps to conducting image classification inference using the popular Resnet50 deep learning model at scale using GPU clusters on Saturn Cloud. Using the resources Saturn Cloud makes available, we can run the task 40x faster than a non-parallelized approach!

Photo by Victor Grabarczyk on Unsplash
Photo by Victor Grabarczyk on Unsplash

For today’s example, we’ll be classifying photos of dogs!

What you’ll learn here:

  • How to set up and manage a GPU cluster on Saturn Cloud for deep learning inference tasks
  • How to run inference tasks with Pytorch on the GPU cluster
  • How to use batch processing to accelerate your inference tasks with Pytorch on the GPU cluster

Setup

To begin, we need to ensure that our image dataset is available, and that our GPU cluster is running.

In our case, we have stored the data on S3 and use the [s3fs](https://s3fs.readthedocs.io/en/latest/) library to work with it, as you’ll see below. If you would like to use this same dataset, it is the Stanford Dogs dataset, available here: http://vision.stanford.edu/aditya86/ImageNetDogs/

To set up our Saturn GPU cluster, the process is very straightforward.

[2020–10–15 18:52:56] INFO – dask-saturn | Cluster is ready

We are not explicitly stating it, but we are using 32 threads each on our cluster nodes, making 128 total threads.

Tip: Individual users may find that you want to adjust the number of threads, reducing it down if your files are very large – too many threads running large tasks simultaneously might require more memory than your workers have available at one time.

This step may take a moment to complete, because all the AWS instances that we are requesting need to be spun up. Calling client at the end there will monitor the spin-up process and let you know when things are ready to rock!

GPU Capability

At this point, we can confirm that our cluster has GPU capabilities, and make sure we have set everything up correctly.

First, check that the Jupyter instance has GPU capability.

torch.cuda.is_available()

True

Awesome- now let’s also check each of our four workers.

client.run(lambda: torch.cuda.is_available())

Here then we’ll set the "device" to always be cuda, so we can use those GPUs.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Note: If you need some help establishing how to run a single image classification, I have an expanded code notebook available on github that can give you those instructions as well as the rest of this content.


Inference

Now, we’re ready to start doing some classification! We’re going to use some custom written functions to do this efficiently and make sure our jobs can take full advantage of the parallelization of the GPU cluster.

Preprocessing

Single Image Processing

This function allows us to process one image, but of course we have a lot of images to work with here! We’re going to use some list comprehension strategies to create our batches and get them ready for our inference.

First, we break the list of images we have from our S3 filepath into chunks that will define the batches.

s3fpath = 's3://dask-datasets/dogs/Images/*/*.jpg'
batch_breaks = [list(batch) for batch in toolz.partition_all(60, s3.glob(s3fpath))]

Then we’ll process each file into nested lists. Then we’ll reformat this list setup slightly and we’re ready to go!

image_batches = [[preprocess(x, fs=s3) for x in y] for y in batch_breaks]

Notice that we have used the Dask delayed decorator on all of this- we don’t want it to actually run yet, but to wait until we are doing work in parallel on the GPU cluster!

Format Batches

This little step just makes sure that the batches of images are organized in the way that the model will expect them.


Run the Model

Now we are ready to do the inference task! This is going to have a few steps, all of which are contained in functions described below, but we’ll talk through them so everything is clear.

Our unit of work at this point is batches of 60 images at a time, which we created in the section above. They are all neatly arranged in lists so that we can work with them effectively.

One thing we need to do with the lists is to "stack" the tensors. We could do this earlier in our process, but because we are using the Dask delayed decorator on the preprocessing, our functions actually do not know that they are receiving tensors until later in the process. Therefore, we’re delaying the "stacking" as well by putting it inside this function that comes after the preprocessing.

So now we have our tensors stacked so that batches can be passed to the model. We are going to retrieve our model using pretty simple syntax:

Conveniently, we load the library torchvision which contains several useful pretrained models and datasets. That’s where we are grabbing Resnet50 from. Calling the method .to(device) allows us to assign the model object to GPU resources on our cluster.

Now we are ready to run inference! It is inside the same function, styled this way:

We assign our image stack (just the batch we are working on) to the GPU resources and then run the inference, returning predictions for that batch.


Result Evaluation

The predictions and truth we have so far, however, are not really human readable or comparable, so we’ll use the functions that follow to fix them up and get us interpretable results.

This takes our results from the model, and a few other elements, to return nice readable predictions and the probabilities the model assigned.

preds, labslist = evaluate_pred_batch(pred_batch, truelabels, classes)

From here, we’re nearly done! We want to pass our results back to S3 in a tidy, human readable way, so the rest of the function handles that. It will iterate over each image because these functionalities are not batch handling. is_match is one of our custom functions, which you can check out below.


Put It All Together

Now, we aren’t going to patch together all these functions by hand, instead we have assembled them in one single delayed function that will do the work for us. Importantly, we can then map this across all our batches of images across the cluster!

On the Cluster

We have really done all the hard work already, and can let our functions take it from here. We’ll be using the .map method to distribute our tasks efficiently.

With map we ensure all our batches will get the function applied to them. With gather we can collect all the results simultaneously rather than one by one. With compute(sync=False) we return all the futures, ready to be calculated when we want them. This may seem arduous, but these steps are required to allow us to iterate over the futures.

Now we actually run the tasks, and we also have a simple error handling system just in case any of our files are messed up or anything goes haywire.


Evaluate

We want to make sure we have high quality results coming out of this model, of course! First, we can peek at a single result.

_{‘name’: ‘n02086240_1082’, ‘groundtruth’: ‘Shih-Tzu’, ‘prediction’: [(b"203: ‘West Highland white terrier’,", 3.0289587812148966e-05)], ‘evaluation’: False}

While we have a wrong prediction here, we have the sort of results we expect! To do a more thorough review, we would download all the results files, then just check to see how many have evaluation:True.

Number of dog photos examined: 20580 Number of dogs classified correctly: 13806 The percent of dogs classified correctly: 67.085%

Not perfect, but good looking results overall!


Comparing Performance

So, we have managed to classify over 20,000 images in about 5 minutes. That sounds good, but what is the alternative?

Technique / Runtime

  • No Cluster with Batching / 3 hours, 21 minutes, 13 sec
  • GPU Cluster with Batching / 5 minutes, 15 sec

Adding a GPU cluster makes a HUGE difference!


Conclusion

As this demonstrates, you are certainly able to do deep learning inference in single node computation, such as on a laptop, but you’re in for a lot of waiting when you do. GPU clusters offer you a way to dramatically accelerate your workflow, making it possible for you to iterate faster and improve business or academic practices. Just think: what if you want to run inference on your images hourly? In that case, single node just won’t work at all, as the job won’t even finish in one hour!

GPUs are often seen as an extreme option for Machine Learning tasks, but in reality, at the speed-ups possible with GPUs, you can often reduce the overall costs in both human time and in dollars. A CPU instance from AWS isn’t free, and the cost for three hours of computation on that instance can add up. In many situations it could be more than what you’d spend for five minutes of a GPU cluster for the same results.

At Saturn Cloud, we want to make top quality computing resources available to the data science community, and we want to help everyone to be more effective machine learning practitioners. You can learn more about us and about the tools described above(and try our platform for free!) at our website.


Related Articles