The world’s leading publication for data science, AI, and ML professionals.

Speed Cubing for Machine Learning

Episode 3: Using Multi-GPU, Dask and CuPy.

Photo by Bryce Barker on Unsplash
Photo by Bryce Barker on Unsplash

Introduction

Previously on "Speed Cubing for Machine Learning"…

In episode 1 **** [1], we described how to generate 3D data as fast as possible to eventually feed some Generative Adversarial Networks, using CPUs, multithreading and Cloud resources. We reached a rate of 2 billion data points per second.

In episode 2 [2], we used a local GPU, a framework called RAPIDS, and libraries such as Cupy and VisPy. We went even faster reaching almost 5 billion data points per second!

In this final episode, we will still focus on speed but, this time, we are going to perform actual computation with the help of several GPUs.

Technical Setup

All the experiments will be conducted via a JupyterLab notebook on a virtual machine (VM) instantiated in the Cloud. The VM has 16 x Intel(R) Xeon(R) vCPUs @2.30GHz, 60 GB of RAM and 8 x NVIDIA Tesla K80 GPUs (Figure 1). The VM is pre-configured with all the necessary NVIDIA drivers and CUDA 11.0 toolkit. The operating system is Linux (Debian 4.19 64-bit), with Python 3.7 installed.

Figure 1: NVIDIA Tesla K80, 24GB GDDR5, 4992 CUDA cores (Source: nvidia.com).
Figure 1: NVIDIA Tesla K80, 24GB GDDR5, 4992 CUDA cores (Source: nvidia.com).

We will also need the following extra librairies / packages:

  • Dask: an open source library for parallel computing in Python [3]. Installation is made using the command: python -m pip install "dask[complete]".
  • CuPy: an open source array library accelerated with NVIDIA CUDA [4]. It is the equivalent of NumPy for the GPU. Installation is made using the command: pip install cupy-cuda110(to match our CUDA version).
  • Dask CUDA: an experimental open source library hosted by rapidsai [5], helping the deployment and management of Dask workers in multi-GPU systems. Installation is made using the command: pip install dask-cuda.

Once installed, we can import the libraries like this:

import dask.array as da
import cupy
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

Let’s Get Started!

I must admit, it has been quite a journey to successfully run a multi-GPU numerical simulation. As a matter of fact, most of the online material is focused on built-in methods specialized for speeding up the training of deep learning models. However, when computations are performed outside this framework, I experienced lack of thorough documentation. Furthermore, multi-GPU computing is trickier than multi-CPU computing: memory allocation and data managment can be difficult to handle. This is where Dask comes in.

Storing the data: the Dask arrays

Dask arrays coordinate many arrays (that are sufficiently "NumPy-like" in API, such as CuPy arrays), arranged into chunks within a grid (Figure 2). Cutting up large arrays into many small arrays lets us perform computations on arrays larger than the available memory, using all of our cores. Dask arrays benefit from multi-dimensional blocked algorithms and task scheduling.

Figure 2: Structure of a Dask array.
Figure 2: Structure of a Dask array.

1st Experiment: Single CPU

We first generate a two-dimensional Dask array, where data are random samples drawn from a Gaussian distribution. We set the mean to 10 and the standard deviation to 1. The distribution size is sz=100000 and the chunks size is ch=10000. This leads to a 80 GB Dask array containing 100 chunks, each of which being a 800 MB numpy array. When you print a Dask array in a Notebook, Dask automatically displays a graphical representation of your data along with information regarding its volume, shape and type (Figure 3). This process is entirely "lazy", meaning that no data are loaded into memory yet. Note that Dask will need to perform 100 tasks to create that dask array.

Figure 3: Dask array of chunked random numpy arrays.
Figure 3: Dask array of chunked random numpy arrays.

For our benchmark, we want to perform the following simple calculation: take a subset of the data, increment all values by one, then compute the mean.

The Figure 4 shows the information about the subset we are going to compute the mean on. We can see that Dask now needs to perform 300 tasks to obtain that subset, which is a 20GB Dask array having 100 chunks, each of which being a 200 MB numpy array.

Figure 4: Subset of the original data (type is numpy.ndarray).
Figure 4: Subset of the original data (type is numpy.ndarray).

Now, the actual computation can be done using the .compute()method. We also specify the scheduler parameter to'single-threaded', thus Dask will execute the task on a single CPU. The whole process takes 11 minutes and 41 seconds. The final value is, as expected, close (1.0e-5) to the value of the shifted mean (Figure 5).

Figure 5: Computation on a single CPU: takes 701 seconds.
Figure 5: Computation on a single CPU: takes 701 seconds.

2nd Experiment: Multi-CPU

Going from a single CPU to many CPUs is pretty straightforward in Dask: all you have to do is to change the parameter of your scheduler to 'threads'. This way, Dask will use all the available cores to perform a parallelized computation (Figure 6). This time, it only takes 55 seconds to complete the task using the 16 cores of our VM. The calculated mean is still close (1.7e-5) to the expected value of 11 (Figure 7).

Figure 6: The 16 cores in action (using the 'htop' command).
Figure 6: The 16 cores in action (using the ‘htop’ command).
Figure 7: Computation on 16 cores: takes 55 seconds.
Figure 7: Computation on 16 cores: takes 55 seconds.

3rd Experiment: Single GPU

It’s very easy to switch from CPU to GPU. There is just a little trick to know and this is where the CuPy library comes in. In episode 2 [2], we moved our numpy array into our GPU device (we called this operation a data transfer). Here, we specify the type of array directly in the random state with the parameter cupy.random.RandomState. Thus, we tell Dask that we use chunks of cupy arrays instead of numpy arrays, to ensure that computation will be done on the GPU (Figure 8). Once again, this is a "lazy" operation and no data is loaded into the video memory yet.

Figure 8: Dask array of chunked random cupy arrays.
Figure 8: Dask array of chunked random cupy arrays.

The subset is exactly the same, only the data type of the chunks has changed, it is now cupy.ndarray instead of numpy.ndarray as before (Figure 9).

Figure 9: Subset of the data (type is cupy.ndarray).
Figure 9: Subset of the data (type is cupy.ndarray).

For the actual computation, the same scheduler parameter from the first experiment (single CPU) is used. On a single GPU, it takes nearly 12 seconds, hence almost 4.6 times faster than the 16 cores. The calculated mean is still good (1.8e-5) (Figure 10).

Figure 10: Computation on a single GPU (1x Tesla K80): takes almost 12 seconds.
Figure 10: Computation on a single GPU (1x Tesla K80): takes almost 12 seconds.

4th Experiment: Multi-GPU

For this last experiment, we first have to define a GPU-enabled cluster, this is where the Dask CUDA library comes in. We import LocalCUDACluster from the Dask CUDA library and set the argument CUDA_VISIBLE_DEVICES such that we can select specific GPUs. Then, we build a distributed client for that cluster (Figure 11). To sum up, Dask creates two objects:

  • The Cluster: composed of 8 workers and 8 cores (all our GPUs).
  • The Client: accessible via a local url (here, http://127.0.0.1:8787/status). You can monitor the whole process using a very rich dashboard.
Figure 11: The local CUDA cluster and its DASK distributed client.
Figure 11: The local CUDA cluster and its DASK distributed client.

From there, we can execute the same expression as before (no need to specify the scheduler parameter here). Finally, the computation on the 8 GPUs is very fast and takes less than 2 seconds, hence a 6.26x speedup compared to a single GPU. The calculated mean is also very close (2.2e-6) to the expected value (Figure 12).

Figure 12: Computation using the full set of GPUs (8x Tesla K80): takes only 2 seconds.
Figure 12: Computation using the full set of GPUs (8x Tesla K80): takes only 2 seconds.

Figure 13 depicts the GPU activity for our full set of GPUs. Actually, Google Cloud virtually splits the video memory from the Tesla K80 in half. This is why we only have 12 GB of RAM available instead of 24 GB. This also means that we "only" have 2496 CUDA cores per GPU.

Figure 13: Tesla K80s are rockin.
Figure 13: Tesla K80s are rockin.

Conclusion

Through simple numerical simulations on Dask arrays, we went from almost 12 minutes of computation time with the use of a single CPU to a lightning speed of almost 2 seconds on 8 Tesla K80 GPUs, hence a 360x speedup! The extraordinary gains in computing time are not at the expense of accuracy. Those experiments are largely inspired by the post from Matthew Rocklin [6], initial Dask author and currently CEO of Coiled computing. We wanted to be able to reproduce the speed results for a simple approach before using multi-GPU systems for more complex tasks like very fast training of generative models we are working on. Multi-GPU computing may seem intimidating at first but when using the right tools, it becomes quite easy.

This concludes our trilogy, I hope you enjoyed it, I will be back soon.

Acknowledgments

I would like to thank Christophe, Matteo, Philippe and Perceval for their support during the realization of this study and for their feedback on the draft of this article.

References

[1] N. Morizet, "Speed Cubing for Machine Learning – Episode 1", Towards Data Science, September 15th, 2020. [2] N. Morizet, "Speed Cubing for Machine Learning – Episode 2", Towards Data Science, November 20th, 2020. [3] Dask: an open source library for parallel computing in Python. [4] CuPy: an open source array library accelerated with NVIDIA CUDA. [5] Dask CUDA: an experimental open source library, helping the deployment and management of Dask workers in multi-GPU systems. [6] M. Rocklin, "GPU Dask Arrays, first steps throwing Dask and CuPy together", matthewrocklin.com, 2019.

About Us

Advestis is a European Contract Research Organization (CRO) with a deep understanding and practice of statistics, and interpretable Machine Learning techniques. The expertise of Advestis covers the modeling of complex systems and predictive analysis for temporal phenomena. LinkedIn: https://www.linkedin.com/company/advestis/


Related Articles