Drawing a million boxes around objects on the roads of Palo Alto

Building semantic point clouds for the Kaggle Lyft 3D Object Detection for Autonomous Vehicles Competition

Simon Grest

Published in

Towards Data Science

11 min readNov 26, 2019

Introduction

This post details the approach that Stefano Giomo and I used for our entries into the recent Kaggle Lyft 3D object detection for autonomous vehicles competition (https://www.kaggle.com/c/3d-object-detection-for-autonomous-vehicles).

This competition used data captured by Lyft vehicles equipped with multiple cameras and LIDAR sensors. The vehicles captured hundreds of 20 second scenes on the roads of Palo Alto. The aim of the competition was to place 3D bounding volumes around different classes of objects from these scenes.

We trained a UNet model on a 2D birds eye view representation of the data. The 2D representation was created by a number of preprocessing steps that combined each LIDAR point cloud with semantic information derived from the cameras as well as from a street map of Palo Alto.

To transform the 2D predictions from the model to 3D bounding volumes we performed a number of post-processing steps involving computer vision techniques, and building a terrain map using the LIDAR point clouds.

Along the way to our bronze medal finish, we faced a number of challenges. The data set for this competition was large, approximately 120Gb, and storing and processing the data took a long time. Learning to correctly transform data between different reference coordinate systems was complicated and required care.

Deciding how to combine the LIDAR data and multiple camera images and how to transform the data into inputs to a neural network was also challenging. While tackling these challenges we learnt a lot and developed a couple of interesting approaches that we think are worth sharing.

In this post, we summarize how we performed preprocessing steps to combine the different data types, how we created inputs to and trained our model and how we post-processed our predictions to compensate for some of the limitations of our model.

The data sets

Sensor data from a scene in the Lyft Level 5 data set

The data for the competition is taken from Lyft’s Level 5 data set (https://level5.lyft.com/) which follows the nuScenes data format (https://www.nuscenes.org/). Metadata for each of the entities in the data schema is represented using JSON.

There is an SDK built by Lyft to facilitate working with the data structures: it allows the user to navigate through scenes, manipulate boxes, visualize the sensor data and so on. The SDK is repository is here (https://github.com/lyft/nuscenes-devkit) and can be installed as a pip package.

Scenes, samples and sensors

The data consisted of 20 second long scenes. Each scene was made up of a number of samples. Samples in turn consisted of data from a number of sensors that has been captured at the same time. Each scene consisted of about 125 samples. Associated with each sample was information about the position of the vehicle relative to the scene as well as the position of the sensors relative to the vehicle. This information allowed the data points to be synthesized and placed correctly within the scene. All the vehicles had at least one LIDAR sensor and six cameras.

The training set had annotations consisting of bounding volumes around all objects of interest in each sample. The objects of interest fell into nine classes. As can be seen in the bar chart, the classes were pretty imbalanced (note the log scale). There were more than half a million cars and fewer than 200 emergency vehicles.

The bounding volumes were defined relative to the scene’s coordinate system. They were parallel to the XY plane and were specified by seven values: the three coordinates of the centre; the three dimensions of the volume as well as the angle of rotation (yaw) about a vertical axis.

A map derived from OpenStreetMap (https://www.openstreetmap.org/) covering a region that included all the scenes in the data was also provided.

Training and testing data sets

The training and testing data sets together contained more than 350,000 images and nearly 60,000 point clouds. Altogether the data sets used 120 gigabytes of disk space!

The task and the evaluation criterion

The set of test scenes had no annotations and competitors were required to predict the class, and the bounding volumes (the seven values) around each object of interest in the scene.

Predictions were scored using a Mean Average Precision metric. The Average Precision of the predictions was calculated for a range of intersection over union (IOU) thresholds, and the metric was computed as the Mean of these Average Precisions.

The calculation of the Average Precision works as follows. Given an IOU threshold of 0.6 for example, all volumes that overlapped the ground truth volumes by at least 60% were considered hits while others were considered false positives. The Average Precision for that threshold was then calculated as the hits divided by the number of predicted volumes. Thresholds of 0.5, 0.55, …, 0.9, 0.95 were used to calculate the Mean of the Average Precisions.

Segmenting the camera images

Forward facing cameras with semantic segmentation masks

To leverage the information contained in the camera images we created a semantic segmentation of each image to associate semantic classes with points in the LIDAR point cloud. We did this to give our network information about what object each LIDAR point had reflected off.

We used a pre-trained PyTorch network published by MIT’s computer science and artificial intelligence lab (https://github.com/CSAILVision/semantic-segmentation-pytorch). The model has been trained on the ADE20K scene parsing data set (http://groups.csail.mit.edu/vision/datasets/ADE20K/) which contains a large number of scenes including scenes of public roads in the US along with segmentation ground truth for 150 classes of objects.

We ran inference on each of the images in the test and training sets saving the result as GIF images to conserve disk space. Of the 150 classes we chose a subset of 24 classes based on their frequency and relevance to the competition. For example, we excluded classes for household items like sofas and beds. We chose 24 classes to reduce the embedding size needed to represent the categorical information.

Projecting classes to point clouds

The Lyft SDK provides functionality to visualise the point cloud on the camera image. We took inspiration from this code to do the reverse operation: mapping image and semantic data onto points in the point cloud.

Associating LIDAR points with pixels in a camera image (drawing by Stefano Giomo)

The figure shows how we projected both the camera image and the semantic map onto the relevant points in the point cloud. For example, P was a point on the LIDAR point cloud. We projected P into the camera image along the line connecting P to the LIDAR sensor, finding the 2D point Q. We used Q to sample the semantic segmentation mask and find the semantic class for point P.

A camera image and semantic classes projected onto the corresponding points in the point cloud

Merging semantic and camera data from multiple sensors

We projected the segmentation classes and camera image data for each image in each sample. Combining all the projections we had a semantic point cloud where each point has an associated semantic class label.

The images from adjacent cameras overlapped at the edges. For LIDAR points in the overlapping region we defined a rule to select the final label based on the classes’ relevance to the domain.

Multiple camera images and segmentation classes projected onto the entire LIDAR cloud

The visualization above shows the entire semantic point cloud expressed in spherical coordinates where the x axis is the azimuth, and the y axis is the elevation of the ray connecting the point with the LIDAR sensor.

Birds Eye View model

We looked at various possible network architectures, including architectures with 3D convolutions and region proposal networks. These models were complex and computationally expensive and to ensure we could make submissions in time we took a pragmatic decision and chose to use the reference model provided by the Lyft team as a starting point (https://github.com/lyft/nuscenes-devkit/blob/master/notebooks/Reference%20Model.ipynb .

The reference model was based on a top-down (or birds eye view) of the LIDAR point cloud. The idea was to voxelize the point cloud, partitioning the Z dimension into three regions yielding a three channel 2D representation. The normalised number of points in each voxel was then used as the pixel intensity, with the first channel being assigned to the lowest Z region and the third channel to the highest Z region. The Birds Eye View could be treated like an RGB image.

Original Birds Eye View model input and target data

The model predicted the probability that each pixel of the input image belonged to one of the ten classes (nine output classes and a background class).

Improving the Birds Eye View model

Our data preprocessing pipeline to create inputs to the neural network (drawing by Stefano Giomo)

Inspired by the PointPillars paper (https://arxiv.org/abs/1812.05784) we extended this reference model by using multiple channels to feed more information to the model.

Specifically, we added information from the street map and from our semantic point cloud. To include the semantic information we used 5 dimensional embedding in the style of the Entity Embedding paper from the Rossmann Store Sales Kaggle competition (https://arxiv.org/pdf/1604.06737.pdf).

We modified the PyTorch UNet implementation used in the reference model and trained it using Jeremy Howard’s fastai library.

Post-processing

Generating 2D rectangles

Once we had completed inference on the test set we followed the approach used in the reference model to translate our predictions to 2D bounding boxes. This approach first thresholded predictions to obtain a binary mask for each class. Then an erosion was performed followed by a dilation (known as opening). This removed small artefacts in the binary masks. We found contours for each of the resulting binary masks and the minimum bounding rectangles of those contours. These bounding rectangles were the basis for our bounding volumes.

An example of predictions and extracted contours for class ‘car’

After this process was complete we had five of the seven parameters that specified a bounding volume. We had calculated the x and y location and scale as well as the rotation on the xy plane (or yaw). We still required the z location and scale (i.e. elevation and height).

Height and elevation

The reference model used some simplifying assumptions in order to convert the predicted 2D boxes into 3D volumes. The first assumption was that all objects of a particular class had the same height — specifically, the average height for that class in the training data. The second assumption was that all objects were at the same elevation as the vehicle with the sensors (the ego vehicle).

We attempted to improve on both these assumptions using post processing on the LIDAR point clouds. Here we detail how we improved the elevation assumption.

Building a terrain elevation map

Point cloud a scene with a hill, the spikes are roadside trees

We reasoned that we should be able to use the minimum elevation in the LIDAR point cloud to find the level of the ground for a scene. We transformed all the 125 available point clouds per scene to the same reference coordinate system and merged them to build a point cloud view of the entire scene. We used this combined point cloud to build our minimum elevation or terrain map of the scene.

Minimum elevation map for a scene with a hill

One problem we faced was that any objects such as vehicles or pedestrians in the scene would lead to an overestimate of the ground level. We avoided this problem by taking the minimum elevation through space and time. By taking the minimum value in a region of space around an object we found lower points on the ground next to the object. By taking the minimum value over different points in time, we found the ground under any moving objects after they moved away.

The point clouds above were taken from a scene in the training set that contained a hill. The video below shows the cameras and annotations from this scene.

Youtube video showing a scene with a hill.

At the beginning of the scene, the vehicle is travelling down the hill with several vehicles following behind it. The animation below contrasts bounding volumes placed at the same elevation as the ego vehicle with volumes placed on the terrain map we created.

Bounding volumes of vehicle on a hill placed at ego vehicle elevation vs at terrain elevation

Initially while the vehicle descended, the bounding volume at the same level as the ego vehicle is clearly worse than the bounding volume placed on the terrain map. However as the road surface levels out the difference disappears, so this technique improves scores on scenes with hills or other differences in the elevation of the road.

Conclusion

Most of the effort we expended for this competition was on the pre-processing of input data and the post-processing of predictions. We made two key improvements to the reference model. By projecting the semantic information derived from the camera images into the point cloud we passed information about the classes of objects to our model. Secondly, by building a map of the terrain we improved the elevation accuracy of our predicted volumes.

As with most Kaggle competitions we ran out of time before we ran out of ideas. We tried training our network on multiple successive samples (sweeps) without much success, but given more time we’re sure this route would have given us a boost. Later we realised that we should have tried using higher resolution birds eye view images.

If we’d had much more time we would have liked to try using a more advanced end-to-end approach with a better representation of the point cloud and a region proposal network for better bounding volumes. We would have liked to extend the PointPillars network with semantic points and we also would have liked to train a model on the spherically mapped semantic point cloud.

We will publish our GitHub repository for this project soon. We are working on several more articles that will go into greater detail on the more technical elements behind our preprocessing, post-processing and training our network.

Thanks to Stefano Giomo for the great drawings! And thanks to you for reading!

If you’d like to look at more cool videos we generated from the data set and at our semantic segmentation masks here’s a youtube playlist:

Playlist of youtube videos generated from the Lyft Level 5 data