LiDAR point-cloud based 3D object detection implementation with colab {Part-1 of 2}

Published in

Towards Data Science

6 min readSep 12, 2020

we will understand the concepts required to implement the VoxelNet algorithm for 3D vehicle detection using KITTI LiDAR point cloud data

Environmental perception plays an integral part while building autonomous vehicles, self-navigation robots, and other real-world applications. The sensors like camera, RADAR, and LiDAR are used to perceive the 360-view of the environment. The data obtained from the sensors is interpreted to detect static and dynamic objects like vehicles, trees, and pedestrians, etc.

The need for the 3D object detection

State-of-the-art techniques in computer vision detect the object with high accuracy on 2D data like images, video (sequence of image frames) in real-time. But using the camera sensor for activities like localization, measuring the distance between the objects, and calculating depth information may not be effective and are computationally expensive.

Pic credits: engin Bozkurt with KITTI point cloud viewer

LiDAR is one of the prominent sensors to provide the 3D information of the object in terms of the point cloud to localize the objects and characterize the shapes.

Recently, many state-of-the-art 3D object detectors like VeloFCN, 3DOP, 3D YOLO, PointNet, PointNet++, and many more were proposed for 3D object detection. But in this article, we shall discuss the VoxelNet a 3D object detection algorithm that has outperformed all of the above mentioned state-of-the-art models*.

VoxelNet: End-to-end learning for point cloud-based 3D object detection.

Authors: Yin Zhou, Oncel Tuzel - Apple Inc

PDF of the paper can be downloaded from here, Date of publishing: 17 Nov 2017

Challenges addressed in VoxelNet

Replacing manual feature extraction: In the manual feature extraction, the point clouds are projected in top-view, and then apply the image-based feature extraction methods are used for the detection. But these techniques cause an information bottleneck and cannot extract the 3D information required to detect the tasks.

In this paper, Feature learning network (FLN), a machine-learned feature extractor introduced to effectively extract the 3D shape information.

Reducing computation and focus on memory constraints: Voxel grouping and random sampling technique is used to process the voxels containing more than a T number of points in the voxel.
End-to-end 3D detection architecture: that simultaneously learns the feature representation from the raw point cloud data and predicts accurate 3D bounding boxes in an end-to-end fashion.

VoxelNet in a nutshell “ Divide the point clouds into equally spaced 3D voxels, encode each voxel via stacked VFE layers into a vector, and then aggregate (combine) the local voxel features using 3D convolutional layers, transform the point cloud into high dimensional volumetric representation. Finally, the modified RPN network which intakes the volumetric representation and provides the detection result”.

VoxelNet architecture

The VoxelNet architecture mainly contains three blocks

Feature learning network
Convolutional middle layers
Region proposal network

Feature learning network:

A feature learning network is used to extract descriptive features from the voxel grid by processing the individual point clouds in the voxel to obtain point-wise features and then aggregating these point-wise features with locally aggregated features. The feature learning network is applied to all the voxels containing more than a T number of points.

Voxel partition: Subdivide the 3D space into equally spaced voxels

Voxel partition: Partitioning the 3D space with the voxels. Image by author

Here is the code snippet for voxel partition:

Note: The measurements defined for the voxel grid will vary depending on the class of the object

__cfg__.MAX_POINT_NUMBER indicates the point cloud threshold value, the threshold value used to process the selective voxel grids containing more than 35 point clouds
X, Y, Z min, max values define the 3D space in meters
Voxel_X_size indicate the voxel grid dimensions (fixed)

Voxel partition

Random sampling

Point clouds are bucketed into the voxel grids. Since processing all the points are computationally expensive and increases memory usage which in turn increases the load on the computational devices. To address this challenge voxels grids containing ‘T’ more point clouds are sampled.

With this Strategy, we could achieve

Computational savings
Reduces the imbalance of points between the voxels which reduces the sampling bias, and adds more variation to training.

Stacked voxel feature encoding

Feature learning network applied to an individual voxel. pic credits

Point-wise input to Fully connected NN: we consider the voxel grids containing more than a T number of points. Each point cloud in a voxel is represented with 4 coordinates [x, y, z, r]; where x,y,z indicate the coordinates, and r indicates the reflectance. We compute the local mean as the centroid of all the points within the voxel grid(V). Then we augment each point in a voxel with the offset of the local mean to obtain the point-wise input feature set.

Fully connected neural net (FCN Net): The point-wise input feature set is fed to a fully connected neural network to aggregate all the point-wise features to encode the shape of the surface contained with the voxel.

FCN network consists of a Linear layer, batch normalization, and ReLu.

Element-wise max-pooling: Element-wise max pooling is used to obtain the locally aggregated features from the point-wise input features

Finally, point-wise concatenate is used to aggregate the point-wise feature with locally aggregated features.

VFE layer code snippet defining the point-wise input, aggregated features using batch normalization and max-pooling:

Convolutional middle layers

This layer converts voxel features to dense 4D feature maps and reduces the feature map size to one-fourth of the original using Convolution, batch normalization, ReLU.

ConvMD(cin, cout, k, s, p) to represent an M dimensional convolution operator where cin and cout are the number of input and output channels, k, s, and p represents kernel size, stride size, and padding size respectively.

Code snippet showing the 3D convolutional layers:

Region proposal network

The modified region proposal network has three blocks of fully convolutional layers. The first layer of each block downsamples the feature map by half via convolution with a stride size of 2, followed by a sequence of convolutions of stride 1 (×q means q applications of the filter). After each convolution layer, BN and ReLU operations are applied. We then upsample the output of every block to a fixed size and concatenate to construct the high-resolution feature map.

Finally, this feature map is mapped to the desired learning targets:

(1) probability score map and

(2) a regression map.

Code snippet defining convolutional and deconvolution layers: