LiDAR point-cloud based 3D object detection implementation with colab {Part-1 of 2}

Gopalakrishna Adusumilli
Towards Data Science
6 min readSep 12, 2020

--

we will understand the concepts required to implement the VoxelNet algorithm for 3D vehicle detection using KITTI LiDAR point cloud data

Environmental perception plays an integral part while building autonomous vehicles, self-navigation robots, and other real-world applications. The sensors like camera, RADAR, and LiDAR are used to perceive the 360-view of the environment. The data obtained from the sensors is interpreted to detect static and dynamic objects like vehicles, trees, and pedestrians, etc.

The need for the 3D object detection

State-of-the-art techniques in computer vision detect the object with high accuracy on 2D data like images, video (sequence of image frames) in real-time. But using the camera sensor for activities like localization, measuring the distance between the objects, and calculating depth information may not be effective and are computationally expensive.

Pic credits: engin Bozkurt with KITTI point cloud viewer

LiDAR is one of the prominent sensors to provide the 3D information of the object in terms of the point cloud to localize the objects and characterize the shapes.

Recently, many state-of-the-art 3D object detectors like VeloFCN, 3DOP, 3D YOLO, PointNet, PointNet++, and many more were proposed for 3D object detection. But in this article, we shall discuss the VoxelNet a 3D object detection algorithm that has outperformed all of the above mentioned state-of-the-art models*.

VoxelNet: End-to-end learning for point cloud-based 3D object detection.

Authors: Yin Zhou, Oncel Tuzel - Apple Inc

PDF of the paper can be downloaded from here, Date of publishing: 17 Nov 2017

Challenges addressed in VoxelNet

  • Replacing manual feature extraction: In the manual feature extraction, the point clouds are projected in top-view, and then apply the image-based feature extraction methods are used for the detection. But these techniques cause an information bottleneck and cannot extract the 3D information required to detect the tasks.

In this paper, Feature learning network (FLN), a machine-learned feature extractor introduced to effectively extract the 3D shape information.

  • Reducing computation and focus on memory constraints: Voxel grouping and random sampling technique is used to process the voxels containing more than a T number of points in the voxel.
  • End-to-end 3D detection architecture: that simultaneously learns the feature representation from the raw point cloud data and predicts accurate 3D bounding boxes in an end-to-end fashion.

VoxelNet in a nutshell “ Divide the point clouds into equally spaced 3D voxels, encode each voxel via stacked VFE layers into a vector, and then aggregate (combine) the local voxel features using 3D convolutional layers, transform the point cloud into high dimensional volumetric representation. Finally, the modified RPN network which intakes the volumetric representation and provides the detection result”.

VoxelNet architecture

The VoxelNet architecture mainly contains three blocks

  1. Feature learning network
  2. Convolutional middle layers
  3. Region proposal network
VoxelNet architecture
VoxelNet architecture. a) Input point cloud data subdivided into equally spaced voxels b) Feature learning network, transforms a group of points in a voxel to a new feature representation as a 3D tensor c) 3D convolution d) RPN network to draw 3D bounding boxes. Pic credits: https://arxiv.org/abs/1711.06396

Feature learning network:

A feature learning network is used to extract descriptive features from the voxel grid by processing the individual point clouds in the voxel to obtain point-wise features and then aggregating these point-wise features with locally aggregated features. The feature learning network is applied to all the voxels containing more than a T number of points.

Voxel partition: Subdivide the 3D space into equally spaced voxels

Voxel partition: Partitioning the 3D space with the voxels. Image by author

Here is the code snippet for voxel partition:

Note: The measurements defined for the voxel grid will vary depending on the class of the object

  • __cfg__.MAX_POINT_NUMBER indicates the point cloud threshold value, the threshold value used to process the selective voxel grids containing more than 35 point clouds
  • X, Y, Z min, max values define the 3D space in meters
  • Voxel_X_size indicate the voxel grid dimensions (fixed)
Voxel partition

Random sampling

Point clouds are bucketed into the voxel grids. Since processing all the points are computationally expensive and increases memory usage which in turn increases the load on the computational devices. To address this challenge voxels grids containing ‘T’ more point clouds are sampled.

With this Strategy, we could achieve

  1. Computational savings
  2. Reduces the imbalance of points between the voxels which reduces the sampling bias, and adds more variation to training.

Stacked voxel feature encoding

Feature learning network applied to an individual voxel. pic credits

Point-wise input to Fully connected NN: we consider the voxel grids containing more than a T number of points. Each point cloud in a voxel is represented with 4 coordinates [x, y, z, r]; where x,y,z indicate the coordinates, and r indicates the reflectance. We compute the local mean as the centroid of all the points within the voxel grid(V). Then we augment each point in a voxel with the offset of the local mean to obtain the point-wise input feature set.

Fully connected neural net (FCN Net): The point-wise input feature set is fed to a fully connected neural network to aggregate all the point-wise features to encode the shape of the surface contained with the voxel.

FCN network consists of a Linear layer, batch normalization, and ReLu.

Element-wise max-pooling: Element-wise max pooling is used to obtain the locally aggregated features from the point-wise input features

Finally, point-wise concatenate is used to aggregate the point-wise feature with locally aggregated features.

VFE layer code snippet defining the point-wise input, aggregated features using batch normalization and max-pooling:

Convolutional middle layers

This layer converts voxel features to dense 4D feature maps and reduces the feature map size to one-fourth of the original using Convolution, batch normalization, ReLU.

ConvMD(cin, cout, k, s, p) to represent an M dimensional convolution operator where cin and cout are the number of input and output channels, k, s, and p represents kernel size, stride size, and padding size respectively.

Code snippet showing the 3D convolutional layers:

Region proposal network

The modified region proposal network has three blocks of fully convolutional layers. The first layer of each block downsamples the feature map by half via convolution with a stride size of 2, followed by a sequence of convolutions of stride 1 (×q means q applications of the filter). After each convolution layer, BN and ReLU operations are applied. We then upsample the output of every block to a fixed size and concatenate to construct the high-resolution feature map.

Finally, this feature map is mapped to the desired learning targets:

(1) probability score map and

(2) a regression map.

Code snippet defining convolutional and deconvolution layers:

Modified RPN architecture pic credits

Thanks for reading!!!

In the next article, we will implement the voxelNet code for the 3D object detection

Here are some of the results obtained by implementing the model in colab

Prediction results on the KITTI validation dataset

Special thanks:

Dr.Uma K Mudenagudi, KLE Technological University for the project mentorship.

References

  1. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection
  2. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
  3. A survey of the deep learning-based object detection
  4. KITTI raw dataset: @ARTICLE{Geiger2013IJRR,author = {Andreas Geiger and Philip Lenz and Christoph Stiller and Raquel Urtasun},title = {Vision meets Robotics: The KITTI Dataset}, journal = {International Journal of Robotics Research (IJRR)},year = {2013}

--

--