: Threading the Needle
Part 2 of 4: Threading the Needle
Introduction
In the first installment of this series we discussed how to run embarrassingly parallel algorithms using the GPU. Embarrassingly parallel tasks are those whose tasks are completely independent from each other, such as summing two arrays or applying any element-wise function.

In This Tutorial
Many tasks, although not embarrassingly parallel, can still benefit from parallelization. In this issue of Cuda by Numba Examples we will cover some common techniques for allowing threads to cooperate on a computation.
Click here to grab the code in Google colab.
This tutorial is followed by two more parts: Part 3 and Part 4.
Getting Started
Import and load libraries, ensure you have a GPU.
Thread Cooperation
Simple Parallel Reduction Algorithm
We’ll start this section with a very simply problem: summing all elements of an array. Serially, this algorithm is extremely simple. Without resorting to NumPy, we could implement it as:
I know, this doesn’t look very Pythonic. But it does emphasize that s
is keeping track of all elements in the array. How can we parallelize this algorithm if s
depends on every element of the array? Well, first, we need to rewrite the algorithm to allow for some parallelization. If there are parts that we cannot parallelize, we should allow threads to communicate with each other.
So far, however, we have not learned how make threads talk to each other… in fact we stated previously that threads in different blocks do not communicate. We could consider only launching one block only, but remember that blocks can only have 1024 threads in most GPUs!
How do we overcome this? Well, what if we split our array into chunks of 1024 (or an appropriate number of threads_per_block
) and sum each of those chunks separately? Then at the end, we can just sum the results from the sum of each chunk. Figure 2.1 shows a very simple example of this for a 2-chunk split.

How do we do this on the GPU? First we need to split the array into chunks. Each chunk will just correspond to a block, with a fixed number of threads. Within each block, each thread may sum more than one array element (grid-stride loop). Then, we must these per-thread values over the entire block. This bit requires threads to communicate. We’ll go over how to do that in the next example.
Since we are parallelizing over blocks, the output of the kernel should be sized as a block. To finalize the reduction, we copy this to the CPU and finish the job there.
WARNING: The shared array must
- _Be "small". The exact size depends on the compute capability of the GPU, typically between 48 KB and 163 KB. See item "Maximum amount of shared memory per thread block" in this table._
- _Have a known size at compile time (which is why we size our shared array
threads_per_block
and notblockDim.x
). It is true that we can always define a factory function to shared array of any size… but be wary of compile time for these kernels._ -
Have
dtype
specified by Numba type, not a Numpy type (don’t ask me why!).On the Google Colab I ran this, we get a 10x speedup. Pretty nice!
A Better Parallel Reduction Algorithm
You may be wondering now why we named everything "naive". That implies that there is some non-naive way of doing this same function. In fact there are a bunch of tricks to speed up this kind of code (see this great Optimizing Parallel Reduction in CUDA presentation for benchmarks).
Before we show a better way of doing this, let us recall the last bit of the kernel:
We parallelized almost everything, but at the end of the kernel, we are making a single thread take care of summing all of the threads_per_block
elements of the shared array s_block
. Why don’t we parallelize this sum as well?
Sounds good, how? Figure 2.2 shows how one can achieve this for a threads_per_block
size of 16. We start with 8 threads working, the first will sum values in s_block[0]
and s_block[8]
. The second, values in s_block[1]
and s_block[9]
, until the last thread which will sum values s_block[7]
and s_block[15]
.
On the next step, only the first 4 threads need to work. The first thread will sum s_block[0]
and s_block[4]
; the second, s_block[1]
and s_block[5]
; the third, s_block[2]
and s_block[6]
; the fourth and last, s_block[3]
and s_block[7]
.
On the third step, we now only need 2 threads, to take care of the first 4 elements of s_block
. The fourth and final step will use one thread to sum 2 elements.
Since the work has been divided between the threads, it is parallelized. Sure, it is not an equal division by each thread, but it is an improvement. Computationally, this algorithm is O(log2(threads_per_block
)), whereas the first algorithm is O(threads_per_block
). In our example, thats 1024 operations for the naive algorithm vs. only 10 for the improved algorithm!
There is one final detail. At every step, we need to ensure that all threads have written to the shared array. So we have to call cuda.syncthreads()
.

On my machine, this is about 25% faster than the naive method.
WARNING: You may be tempted to move syncthreads
to inside the if
block, since after every step, kernels beyond half of current number of threads will not be used. However, doing this will make CUDA threads which called syncthreads
to stop and wait for all others, whereas all the other ones will just keep going. Consequently, the stopped threads will be forever waiting for threads that will never stop to synchronize. The takeaway is: if you synchronize threads, ensure that cuda.syncthreads()
is called in all threads.
i = cuda.blockDim.x // 2
while (i > 0):
if (tid < i):
s_block[tid] += s_block[tid + i]
cuda.syncthreads() # don't put it here
cuda.syncthreads() # instead of here
i //= 2
Numba Reduce
Because the above reduction algorithm is non-trivial, Numba offers a convenience cuda.reduce
decorator that converts a binary function into a reduction. The long and complex algorithm above can then be substituted by:
Personally, I have found that hand-written reductions are generally much faster (2x at least), but Numba recursions are dead easy to use. With that said, I would encourage reading over the reduction code in the Numba source code.
It is also important to note that by default, the reduction copies to host, which forces a synchronization. To avoid this, you can call the reduction with a device array as output:
2D Reduction Example
The parallel reduction technique is great, but it is not obvious how it can be extended to higher dimensions. While we can always call the Numba reduction with an unraveled array (array2d.ravel()
), it is important to understand how we can reduce multidimensional arrays manually.
In this example, we will combine what we learned about 2D kernels with what we learned about 1D reductions to compute a 2D reduction.
Device functions
So far we have only talked about kernels, which are special GPU functions that launch threads. Kernels often rely on smaller functions, defined in the GPU and which only have access to GPU arrays. These are called device functions. Differently from kernels, they can return values.
To end this part of the tutorial, we will show an example of using a device function across different kernels. The example will also stress how important it is to synchronize threads when using shared arrays.
INFO: In newer versions of CUDA, it is possible for kernels to launch other kernels. This is called dynamic parallelism and is not yet supported by Numba CUDA.
2D Shared Array Example
In this example, we will create a ripple pattern in a fixed-size array. We first need to declare the number of threads we will use as this is required by the shared array.

Conclusion
In this tutorial you learned how to develop kernels that required a reduction pattern to process 1D and 2D arrays. In the process we learned how to exploit shared arrays and device functions.