How does sparse convolution work?

A quite different convolution concept and GPU calculation schema.

Zhiliang Zhou

Published in

Towards Data Science

7 min readDec 27, 2020

The way to Forstbotanischer Garten Tharandt, 2020. Photo by my friend Yan Wang, usage in this article is permitted

1. Introduction and Related Works

Sparse Convolution plays an essential role in LiDAR signal processing. This article describes how the sparse convolution works, which used a quite different concept and GPU calculation schema compared with traditional convolution.

In this article, the theory part is based on the paper “3D Semantic Segmentation with Submanifold Sparse Convolutional Networks” [1]. The implementation part, i.e. SpConv() is based on the paper “SECOND: Sparsely Embedded Convolutional Detection” [2]. In [3], the author made a more general discussion about sparse convolution.

2. Motivation

Convolutional Neural Network(CNN) has been proved very effective for 2D image signal processing. However, for 3D point cloud signals, the extra dimension Z increases the calculation significantly. On the other hand, unlike regular images, most of the voxels of the 3D point cloud are empty, which makes point cloud data in the 3D voxels often sparse signals.

Fig.1 **(Left)**: a 3D grid with sparse voxels by Xuesong Li [3]. **(Right)**: an image with sparse pixels, where dark gray pixels are all zeros, and light gray pixels stand for the non-zero data points. The red square stands for the convolution kernel by Graham, Benjamin [1].

The question is whether we can only calculate the convolution with the sparse data efficiently instead of scanning all the image pixels or spatial voxels.

One intuitive thinking is, regular image signals are stored as matrix or tensor. And the corresponding convolution was calculated as dense matrix multiplication. The sparse signals are normally represented as data lists and index lists. We could develop a special convolution schema that uses the advantage of sparse signal representation.

3. Sparse Convolution Model

In a short, the traditional convolution uses FFT or im2col [5] to build the computational pipeline. Sparse Convolution collects all atomic operations w.r.t convolution kernel elements and saves them in a Rulebook as instructions of computation.

Below is an example, which explains how sparse convolution works.

3.1 Input Data Model

In order to explain the concept of sparse convolution step by step and let it easier to read, I use 2D sparse image processing as an example. Since the sparse signals are represented with data list and index list, 2D and 3D sparse signals have no essential difference.

Fig.2 an example sparse image, all elements are zero except P1, P2. Image by Author

As Fig.2 illustrated, we have a 5x5 image with 3 channels. All the pixels are (0, 0, 0) except two points P1 and P2. P1 and P2, such nonzero elements are also called active input sites, according to [1]. In dense form, the input tensor has a shape [1x3x5x5] with NCHW order. In sparse form, the data list is [[0.1, 0.1, 0.1], [0.2, 0.2, 0.2]] , and the index list is [[1,2], [2, 3]] with YX order.

3.2 Convolution Kernel

Fig.3 an example kernel. Image by Author

The convolution kernel of sparse convolution is the same as traditional convolution. Fig.3 is an example, which has kernel size 3x3. The dark and light colors stand for 2 filters. In this example, we use the following convolution parameters.

conv2D(kernel_size=3, out_channels=2, stride=1, padding=0)

3.3 Output Definitions

The output of the sparse convolution is quite different from traditional convolution. The sparse convolution has 2 kinds of output definitions [1]. One is regular output definition, just like ordinary convolution, calculate the output sites as long as kernel covers an input site. The other one is called the submanifold output definition. the convolution output will be counted only when the kernel center covers an input site.

Fig.4 Two kinds of output definition. Image by Author

Fig.4 illustrates the difference between these two kinds of outputs. A1 stands for the active output sites, i.e. the convolution result from P1. Similarly, A2 stands for the active output sites which are calculated from P2. A1A2 stands for the active output sites which are the sum of outputs from P1 and P2. Dark and light colors stand for different output channels.

So in dense form, the output tensor has a shape [1x2x3x3] with NCHW order. In sparse form, the output has two lists, a data list, and an index list, which are similar to the input representation.

4. Computation Pipeline

Traditional convolution normally uses im2col [5] to rewrite convolution as a dense matrix multiplication problem. However, sparse convolution [1] uses a Rulebook to schedule all atomic operations instead of im2col.

4.1 Build the hash table

The first step is to build hash tables.

Fig.5 build hash tables for active input sites and active output sites. P_out is the position index. Image by Author

In fig.5, the input hash table stores all active input sites. Then we estimated all possible active output sites, considering either regular output definition or submanifold output definition. At last, using the output hash table to record all involved active output sites.

Note, in fig.5 my convention is not very clear, the v is more like a hash key, and the key is more like a hash value. However, neither of them has duplicate elements. Thus which one should be the key depends on the program.

4.2 Build Rule Book

The second step is to build the Rulebook. This is the key part of sparse convolution. The purpose of Rulebook is similar to im2col [5], which converts convolution from mathematic form to an efficient programmable form. But unlike im2col, Rulebook collects all involved atomic operation in convolution, then associate them to corresponding kernel elements.

Fig.6 build rule book to schedule all atomic operations based on kernel index. Image by Author

Fig.6 is an example of how to build Rulebook. Pᵢₙ has the input sites index. In this example, we have two nonzero data at position (2, 1) and (3, 2). Pₒᵤₜ has the corresponding output sites index. Then we collect the atomic operations from the convolution calculation process, i.e. consider the convolution process as many atomic operations w.r.t kernel elements. At last, we record all atomic operations into Rulebook. In Fig.6 the right table is an example of the Rulebook. The first column is the kernel element index. The second column is a counter and index about how many atomic operations this kernel element involves. The third and fourth column is about the source and result of this atomic operation.

4.3 Computation Pipeline for GPU

The last step is the overview of the programming calculation pipeline.

Fig.7 Calculation of sparse convolution. Image by Author

As Fig.7 illustrates, we calculate the convolution, not like the sliding window approach but calculate all the atomic operations according to Rulebook. In Fig.7 red and blue arrows indicate two examples.

In Fig.7 red and blue arrows indicate two calculation instances. The red arrow processes the first atomic operation for kernel element (-1, -1). From Rulebook, we know this atomic operation has input from P1 with position (2, 1) and has output with position (2, 1). This entire atomic operation can be constructed as Fig. 8.

Fig. 8 Atomic Operation. Image by Author

Similarly, the blue arrow indicates another atomic operation, which shares the same output site. The result from the red arrow instance and the blue arrow instance can be added together.

However, in practice, the neural network calculates the sum of the linear transformation with activation function like f(∑ wᵢ xᵢ +b). So we don’t really need the hash table for active output sites, just sum all of the atomic operation results together.

The computation w.r.t the Rulebook can be parallel launched in GPU.

5. Conclusion and Summary

So the sparse convolution is quite efficient because we do not need to scan all the pixels or spatial voxels. We only calculate convolution for the nonzero elements. We use Rulebook instead of im2col to rewrite the sparse convolution into a compact linear transformation problem, or we say a compact matrix multiplication problem. One additional computation cost is building the Rulebook. Fortunately, this building work can be also parallel processed with GPU.

At last, I wrote a SpConv Lite library [4], which was derived from [2]. It was implemented by CUDA and packaged into a python library. I also made a real-time LiDAR object detector [6] based on SpConv Lite.

Real-time 3D voxel-based LiDAR Object Detector [6] with SpConv Lite [4] using KITTI dataset [7]. Green boxes are cars, red boxes are pedestrians, yellow boxes are cyclists, and blue boxes are van. Results and Video both owned by Author

I wrote this article when my living city Dresden is under lockdown. Merry Christmas 2020 and Happy new year 2021. Hope everyone keeps healthy. Hope 2021 will be a better year.

Reference

[1] Graham, Benjamin, Martin Engelcke, and Laurens Van Der Maaten. “3d semantic segmentation with submanifold sparse convolutional networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[2] Yan, Yan, Yuxing Mao, and Bo Li. “Second: Sparsely embedded convolutional detection.” Sensors 18.10 (2018): 3337.
[3] Li, Xuesong, et al. “Three-dimensional Backbone Network for 3D Object Detection in Traffic Scenes.” arXiv preprint arXiv:1901.08373 (2019).
[4] SpConv Lite by the author. https://github.com/masszhou/spconv_lite
[5] im2col, Convolution implementation as Matrix Multiplication https://cs231n.github.io/convolutional-networks/
[6] Real-time 3D voxel-based LiDAR object detector by the author. https://github.com/masszhou/second_lite
[7] Geiger, Andreas, Philip Lenz, and Raquel Urtasun. “Are we ready for autonomous driving? the kitti vision benchmark suite.” 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012.