“low angle photography of yellow hot air balloon” by sutirta budiman on Unsplash

Speed Up your Algorithms Part 1 — PyTorch

Speed Up your PyTorch models

Puneet Grover
Towards Data Science
8 min readSep 23, 2018

--

(Edit -28/11/18) — Added torch.multiprocessing section.

Index:

  1. Introduction
  2. How to check the availability of cuda?
  3. How to get more info on cuda devices?
  4. How to store Tensors and run Models on GPU?
  5. How to select and work on GPU(s) if you have multiple of them?
  6. Data Parallelism
  7. Comparison of Data Parallelism
  8. torch.multiprocessing
  9. References

1. Introduction:

In this post I will show how to check, initialize GPU devices using torch and pycuda, and how to make your algorithms faster.

PyTorch is a Machine Learning library built on top of torch. It is backed by Facebook’s AI research group. After being developed recently it has gained a lot of popularity because of its simplicity, dynamic graphs, and because it is pythonic in nature. It still doesn’t lag behind in speed, it can even out-perform in many cases.

pycuda lets you access Nvidia’s CUDA parallel computation API from python.

2. How to check the availability of cuda?

“brown dried leaves on sand” by sydney Rae on Unsplash

To check if you have cuda device available using Torch you can simply run:

3. How to get more info on your cuda devices?

“black smartphone” by rawpixel on Unsplash

To get basic info on devices, you can use torch.cuda. But to get more info on your devices you can use pycuda , a python wrapper around CUDA library. You can use something like:

Or,

I wrote a simple class to get information on your cudacompatible GPU(s):

To get current usage of memory you can use pyTorch's functions such as:

And after you have run your application, you can clear your cache using a simple command:

However, using this command will not free the occupied GPU memory by tensors, so it can not increase the amount of GPU memory available for PyTorch.

These memory methods are only available for GPUs. And that’s where they are actually needed.

4. How to store Tensors and run Models on GPU?

The .cuda magic.

“five pigeons perching on railing and one pigeon in flight” by Nathan Dumlao on Unsplash

If you want to store something on cpu, you can simply write:

This vector is stored on cpu and any operation you do on it will be done on cpu. To transfer it to gpu you just have to do .cuda:

Or,

And this will select the default device for it which can be seen by the command:

Or, you can also do:

You can also send a Model to the GPU device. For example consider a simple module made from nn.Sequential:

To send this to GPU device, simply do:

You can check if it is on GPU device or not, for that you will have to check if its parameters are on GPU or not, like:

5. How to select and work on GPU(s) if you have multiple of them?

“selective focus photography of mechanics tool lot” by NeONBRAND on Unsplash

You can select a GPU for your current application/storage which can be different from the GPU you selected for your last application/storage.

As already seen in part (2) we can get all our cuda compatible devices and their Id's using pycuda, we will not discuss that here.

Considering you have 3 cuda compatible devices, you can initialize and allocate tensors to a specific device like this:

When you do any operation on these Tensors, which you can do irrespective of the selected device, the result will be saved on the same device as the Tensor.

If you have multiple GPUs, you can split your application’s work among them, but it will come with a overhead of communication between them. But if your doesn’t need to relay messages too much, you can give it a go.

Actually there is one more problem. In PyTorch all GPU operations are asynchronous by default. And though it does make necessary synchronization when copying data between CPU and GPU or between two GPUs, still if you create your own stream with the help of the command torch.cuda.Stream() then you will have to look after synchronization of instructions yourself.

Giving a example from PyTorch's documentation, this is incorrect:

If you want to use multiple GPUs to its full potential, you can:

  1. use all GPUs for different tasks/applications,
  2. use each GPU for one model in an ensemble or stack, each GPU having a copy of data (if possible), as most processing is done during fitting to the model,
  3. use each GPU with sliced input and copy of model in each GPU. Each GPU will compute result separately and will send their results to a destination GPU where further computation will be done, etc.

6. Data Parallelism?

“photography of tree in forest” by Abigail Keenan on Unsplash

In data parallelism we split the data, a batch, that we get from Data Generator into smaller mini batches, which we then send to multiple GPUs for computation in parallel.

In PyTorch data parallelism is implemented using torch.nn.DataParallel.

But we will see a simple example to see what is going under the hood. And to do that we will have to use some of the functions of nn.parallel, namely:

  1. Replicate: To replicate Module on multiple devices.
  2. Scatter: To distribute the input in the first dimension among those devices.
  3. Gather: To gather and concatenate the input in first dimension from those devices.
  4. parallel_apply: To apply a set of distributed inputs, which we got from Scatter, to corresponding set of distributed Modules, which we got from Replicate.

Or, simply:

7. Comparison of Data Parallel

“silver bell alarm clock” by Icons8 team on Unsplash

I don’t have multiple GPU’s but I was able to find and a great post by Ilia Karmanov here and his github repo comparing most frameworks using multiple GPUs here.

His results:

[last updated: (Jun, 19 2018)] i.e. his github repo. Launch of PyTorch 1.0, Tensorflow 2.0 and also new GPUs might have changed this …

So, as you can see Parallel Processing definitely helps even if has to communicate with main device in beginning and at the end. And PyTorch is giving results faster than all of them than only Chainer, only in multi GPU case. Pytorch makes it simple too by just one call to DataParallel.

8. torch.multiprocessing

Photo by Matthew Hicks on Unsplash

torch.multiprocessing is a wrapper around Python multiprocessingmodule and its API is 100% compatible with original module. So you can use Queue's, Pipe's, Array's etc. which are in Python’s multiprocessing module here. To add to that, to make it faster they have added a method, share_memory_(), which allows data to go into a state where any process can use it directly and so passing that data as argument to different processes won’t make copy of that data.

You can share Tensors, model’s parameters, and you can share them on CPU or GPU as you like.

You can use methods above in “Pool and Process” section here, and to get more speedup you can use share_memory_() method to share a Tensor(say) among all processes without being copied.

You can also work with a cluster of machines. For more info see here.

9. References:

  1. https://documen.tician.de/pycuda/
  2. https://pytorch.org/docs/stable/notes/cuda.html
  3. https://discuss.pytorch.org/t/how-to-check-if-model-is-on-cuda
  4. https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
  5. https://medium.com/@iliakarmanov/multi-gpu-rosetta-stone-d4fa96162986

Signed:

--

--