The state of 3D object detection

A review of the state of the art based upon the KITTI leaderboard

Lance Martin
Towards Data Science

--

Preface

An update to this post is available here.

Intro

3D object detection is a fundamental challenge for automated driving. The KITTI vision benchmark provides a standardized dataset for training and evaluating the performance of different 3D object detectors. Here, I use data from KITTI to summarize and highlight trade-offs in 3D detection strategies. These strategies can be generally broken down into those that use LIDAR and those that use LIDAR + Image (RGB). I analyze these categories separately.

LIDAR

CNN machinery for 2D object detection and classification is mature. But, 3D object detection for automated driving poses at least two unique challenges:

  • Unlike RGB images, LIDAR point clouds are 3D and unstructured.
  • 3D detection for automated driving must be fast (< ~100ms).

Several 3D detection methods have tackled the first problem by discretizing the LIDAR point cloud into a 3D voxel grid and applying 3D convolutions. However, 3D convolution suffers from greater computational cost and thus higher latency than 2D convolution. Alternatively, the point cloud can be projected to a 2D image in the top-down Bird Eye View (BEV) or the LIDAR’s native Range View (RV). The advantage is that the projected image can be efficiently processed with faster 2D convolutions, yielding lower latency.

I picked a few methods from the KITTI BEV leaderboard to highlight a few trade-offs between RV, BEV, and methods that operate on voxel features. This plot shows the reported inference latency (ms) versus vehicle AP:

Detector (LIDAR only) latency vs vehicle AP

Key takeaways from the results:

  • BEV projection preserves object size with distance, providing a strong prior for learning. The Z-axis is treated as a feature channel for 2D convolutions. Hand-crafted Z-axis binning (e.g., PIXOR) can be improved using PointNet to consolidate the Z-axis into learned features (e.g., PointPillars). Also, the ground height can be used to flatten points in the Z-axis (e.g., HDNet), mitigating the effect of translation variance due to the slope of the road.
  • BEV with learned (PointNet) features to consolidate the Z-axis achieves strong performance. SECOND does this with a voxel feature encoding layer and sparse convolutions; new versions of SECOND (v1.5) report better AP (86.6%) and low latency (40ms). PointPillars applies a simplified PointNet on Z-axis pillars, resulting in a 2D BEV image that is fed into a 2D CNN.
  • RV projection suffers from occlusion and object size variation with respect to distance. RV detector (e.g., LaserNet) performance lags BEV detectors on KITTI’s ~7.5k frame train dataset. But, LaserNet performance on the 1.2M frame ATG4D dataset is at parity with BEV detectors (e.g., HDNet).
  • RV projection has low latency (e.g., LaserNet), likely due to the dense RV representation relative to the sparser BEV. VoxelNet pioneered use of voxel features, but suffered from high latency due to 3D convolutions. Newer approaches (e.g., SECOND) can use the same voxel feature encoding layer, but avoid 3D convolution using sparse convolutions to reduce latency.

LIDAR + RGB

LIDAR + RGB fusion improve 3D detection performance, particularly for smaller objects (e.g., pedestrians) or at long range (>~50m-70m) where LIDAR data is often sparse. A few fusion approaches are summarized below. Proposal based methods generate object proposals in RGB (e.g., F-Pointnet) or BEV (e.g., MV3D). Dense fusion methods fuse LIDAR and RGB features directly into a common projection and often at various levels of resolution.

General approaches for LIDAR+RGB fusion. Images are adapted from MV3D (Chen et. at. 2016), F-Pointnet (Qi et. al. 2017), ContFuse (Liang et. al. 2018), and LaserNet (Meyer et. al. 2018).

This plot is showing the reported inference latency (ms) versus vehicle AP:

Detector (LIDAR+RGB fusion labeled) latency vs vehicle AP

Key takeaways from the results:

  • RV dense fusion has the lowest latency of all the approaches, and proposal based methods generally have higher latency than dense fusion . RV dense fusion (e.g., LaserNet++) is fast since RGB and LIDAR features are both in RV. LIDAR features can be directly projected into the image for fusion. In contrast, ContFuse does BEV dense fusion. It generates a BEV feature map from RGB features, which is fused with the LIDAR BEV feature map. This is challenging, because not all pixels in BEV are observable in the RV RGB image. Several steps solve this. For an example unobserved BEV pixel, K nearby LIDAR points are fetched. The offset between each point and the target BEV pixel is computed. The points are projected to RV to retrieve the corresponding RGB features. The offsets and RGB features are fed to continuous convolution, which interpolates between the RGB features to generate the unobserved feature at the target BEV pixel. This is done for all BEV pixels, generating a dense interpolated BEV map of RGB features.
  • Fusion methods in general have the most performance gain at longer range where LIDAR is sparse, and on small objects. The AP improvement for LIDAR + RGB feature fusion (LaserNet++) versus LIDAR (LaserNet) is modest on vehicle (+~1% AP at 0–70m), but more substantial on smaller classes especially at longer range (+~9% bike AP in 50–70m). LaserNet++ has strong performance on ATG4D, but its KITTI performance is not reported.

Summary

There are trade-offs between BEV and RV projections. BEV preserves metric space, keeping object size consistent with respect to range. In contrast, RV suffers from scale variation with respect to range and occlusion. As a result, BEV detectors (e.g., PointPillars) achieve superior performance to RV (e.g., LaserNet) on small datasets (e.g., KITTI at ~7.5k frames) with similar latency (e.g., 16ms for PointPillars vs 12ms for LaserNet). However, RV performance is at parity with BEV on larger (e.g., 1.2M frame ATG4D) datasets. Despite this drawback, dense feature fusion is faster in RV than BEV. LaserNet++ reports impressive latency (38ms) and better performance than dense BEV fusion detectors (e.g., ContFuse at 60ms). These trade-offs are summarized in the figure below. New LIDAR + RGB fusion architectures may find ways to move between projections, taking advantage of the benefits of each one.

Trade-offs between RV and BEV projections

--

--

Perception for automated trucks @ikerobotics. Previously @UberATG / @Otto, @stanford PhD