What is Data Fusion?
Data fusion, in the abstract sense, refers to combining different sources of information in intelligent and efficient ways such that the system processing the data performs better than had it just been given a single data source.
In this article, we will discuss how and why data fusion is leveraged for a variety of intelligent applications, specifically for self-driving cars. Then, we’ll dive into a specific case study on "sparse" data fusion for self-driving cars to see how data fusion is used in action.
High-Level Idea of Data Fusion: If I have two or more sources of data, and each source of data offers novel and predictive information to help me make better predictions or control decisions for my intelligent system, then I can combine these sources of data to improve the performance of my system.

Note that we aren’t restricted to combining just two types of data sources: in general, we can perform data fusion across any N different data sources, so long as they are co-referenced/calibrated.

What Disciplines Can Data Fusion Be Used In?
Certainly not an exhaustive list, but data fusion is leveraged in fields such as:
- Signal processing (Matched Filtering, Kalman Filtering, State-Space Control).
- Robotics (perception, visual-inertial odometry (VIO), bundle adjustment [1]).
- Machine Learning (semantic segmentation, object detection and classification, embeddings).
More generally, data fusion can be used in any field in which we use data to make predictions or decisions.
What Are Some Example Domains Where Data Fusion is Used?

Again, this list is not exhaustive, but hopefully paints a picture of the landscape of possibilities for data fusion:
- Self-driving cars (discussed in further detail below!)
- Remote sensing: Different forms of sensor data, such as RGB and lidar, can be fused together for tasks such as automated terrestrial surveying.
- Robotic manipulation: Vision-based data, such as RGB or stereo, can be fused with odometry data from manipulators and actuators in order to improve performance in robotic manipulation tasks.
How Can We Effectively Utilize Data Fusion For Intelligent Applications?
Some forms of data fusion may be more effective than others. Some considerations to think about when using data fusion as part of any intelligent system that utilizes data to make predictions or decisions:
- What steps of data fusion can be performed online? What steps can be performed offline? Will data fusion cause significant inference latency, and if so, is an X factor improvement in performance worth this additional latency?
- How much online compute power (CPUs, GPUs, etc.) will preprocessing and postprocessing for my data fusion pipeline require?
- How much additional compute power (CPUs, GPUs, etc.) is needed to perform inference using data from my data fusion pipeline?
- How much additional memory (RAM) is required to process and store data from my data fusion pipeline?
- If I am using sensors for data fusion (e.g. RGB cameras, stereo cameras, lidar sensors, inertial measurement units (IMUs), etc.), is any additional calibration required between my sensors?
- How "orthogonal" are my datasets? How hard would it be to predict one dataset given the other(s), and vice versa? The harder this is, the more aggregate information you can extract from fusing these different sources of data.
These questions can help you to weigh trade-offs for different data fusion methodologies, and allow you to design a data fusion setup for optimal system performance.
Data Fusion for Self-Driving Cars
![Example of data fusion for self-driving cars: fusion of RGB pixels and lidar point clouds. Image source: [2].](https://towardsdatascience.com/wp-content/uploads/2021/02/0u3jDSxhkW1pmUWNt.png)
Now that we’ve introduced data fusion, let’s think about how it can be applied to the self-driving car domain.
Data fusion is a crucial sub-routine in perception, object detection, semantic segmentation, and control for autonomous vehicles and Robotics. These and other intelligence-demanding tasks rely on combining data sources in meaningful ways that enable self-driving cars to make informed predictions and decisions as quickly as possible.
What Types of Data Sources are used in Data Fusion For Self-Driving Cars?
Some sources of data that are combined for self-driving cars include:
- RGB pixels: These are Red, Green, and Blue pixel intensities from cameras mounted on self-driving cars.
- Stereo point clouds: These are clouds of points denoting the (x, y, z) locations of objects in space using stereo depth cameras. These points are found by estimating the depth of points using a pair of stereo cameras.
- Lidar point clouds: Similarly, these are clouds of points denoting the (x, y, z) location of objects in space using lidar sensors. These points are found by calculating the return time from lidar laser scans back to the lidar sensor.
- Odometry: Odometry data is typically produced by an inertial measurement unit and/or an accelerometer. This data contains information about the kinematics and dynamics of the car, such as velocity, acceleration, and GPS coordinates.
How Are These Data Sources Combined?
Some examples of how the above sources of data are utilized for data fusion-based self-driving car applications include fusion of:
- Odometry data with point cloud data for localization.
- RGB data with odometry data for solving the bundle adjustment problem [1].
- RGB and point cloud data to create a 2D dense image with four channels (R, G, B, D) – this is usually called a Digital Surface Model (DSM). Tasks such as object detection and classification and semantic segmentation can utilize this DSM.
- RGB and point cloud data to create a 3D sparse point cloud, ** where each point has (x, y, z, R, G, B**) data. Tasks such as object detection and classification and semantic segmentation can utilize this point cloud with RGB features.
Now that we’ve introduced some examples of data sources and applications for/of data fusion for self-driving cars, let’s dive deeper into a case study for data fusion for autonomous vehicles: sparse data fusion between lidar point clouds and RGB pixels for semantic segmentation.
Case Study: Sparse Data Fusion
Rather than fusing RGB and depth information into a dense, 2D array with multiple channels, in this case study, we take the opposite approach, augmenting an existing point cloud corresponding to depth information with RGB values from corresponding, co-referenced 2D pixels.
The figure below illustrates this sparse data fusion process.

Why Sparse Fusion?
A crucial question for every intelligent system: Why does this matter? A few proposed reasons:
- Because we now perform inference over a sparse, 3D point cloud, rather than a dense, 2D image, we do not have to predict on dense, computationally-intensive inputs. Though GPUs and other hardware acceleration technologies have enabled for significant speedup in dense convolutional operations over images, performing semantic segmentation inference on large images in real-time is still a non-trivial problem to solve.
- In "dense fusion", ** when we transform our point cloud, a point set of 3D points, into a depth image (sometimes known as a Digital Surface Model (DSM)), we inherently _lose information on the exact 3D structure of these lidar point clouds. Performing "sparse fusion"**_ preserves this exact 3D structure and information by retaining a point-based representation of our data.
Note that one trade-off we acknowledge with this "sparse fusion" approach is that we don’t utilize all RGB pixels.
Problem Statement
With some background information on the type of fusion task to perform, we are now ready to formally define our data fusion-based semantic segmentation task. Mathematically:
Here, N refers to the number of points in our point cloud, and D refers to the number of dimensions for each point; in this case, D = 6 (3 for x, y, z, and 3 for R, G, B).
Diagrammatically, this data fusion-based semantic segmentation problem can be visualized as:

Now that we’ve set up our data fusion problem, let’s introduce the autonomous driving dataset we’ll be utilizing for our case study.
Audi Autonomous Driving Dataset (A2D2)
To test this data fusion approach, we leverage the Audi Autonomous Driving Dataset (A2D2) [2]. This dataset captures a direct mapping between RGB pixels and point cloud returns through calibrated depth and optical sensors mounted on the authors’ car.
![An example sample from the A2D2 dataset. Left: lidar Point cloud representation of scene. Middle: RGB image representation of scene. Right: Semantic label representation of scene - note that different colors indicate different ground truth classes of objects in the scene, such as vehicles, roads, trees, or traffic signals. Image source [2].](https://towardsdatascience.com/wp-content/uploads/2021/02/1xYNjrMtHOkbLDU4zmdMzjA.png)
More importantly, we utilize this dataset because there exists a direct mapping between pixels and point cloud returns, thanks to the calibration work from the authors of [2]. It’s important to note that our ability to perform data fusion is dependent on this calibration.
Performing Data Fusion
This direct mapping allows for us to concatenate point cloud returns with RGB features, such that each point cloud point is no longer just represented as p = (x, y, z), but rather as p = (x, y, z, R, G, B), where the R, G, B values are taken from the pixel co-referenced with the given lidar point.
To perform this data fusion operation, we defined a custom DataLoader
**** class in Python torch
that concatenates (x, y, z)
and (R, G, B)
features when retrieving items from the dataset using the __get_item__
method. For brevity, below only the constructor and sampling methods are provided, but you can find the full class in the Appendix, as well as in the linked GitHub repository:
You can read more about how the creators of A2D2 produced this mapping through their calibration paper here.
Learning Architecture: PointNet++
To learn the desired mapping f for this semantic segmentation problem, we will use PointNet++, a point set-based Deep Convolutional Neural Network. Rather than performing convolutions over matrices and tensors corresponding to dense spatial images, however, this network instead performs high-dimensional point set convolutions in the point set space of the point cloud.
Below, the architecture of PointNet++, as presented in [3], is provided. Since our task focuses on semantic segmentation, we will utilize the segmentation-based variant of this neural network architecture.
![PointNet++ architecture. This architecture improves upon the original PointNet architecture through the use of a hierarchical set of point set convolutional layers [3]. Image Source: [3].](https://towardsdatascience.com/wp-content/uploads/2021/02/0nH4z9phSNzzo9Pu--scaled.jpg)
Applying Sparse Data Fusion to PointNet++
For this data fusion task, we combine our spatial (x, y, z) features and our (R, G, B) features at each point and use these concatenated features as input to the PointNet++ network. Note that the (R, G, B) features are not treated in a spatial manner – i.e. we do not compute "spatial" correlations between pixel values.
Our data fusion-based semantic segmentation system is depicted in the diagram below:

To train the network, the data transformations were performed offline to reduce online computational time. However, for real-time semantic segmentation settings, this sparse data fusion preprocessing step would need to be performed in real or near-real-time.
Experiment Settings and Evaluation
To test the efficacy of this framework, we evaluate the network’s ability to correctly classify the semantic labels of each lidar point cloud return. Specifically, we measure the segmentation performance of this system using semantic segmentation metrics such as mean Intersection over Union (mIoU) and accuracy metrics, defined below for a series of N points for which we predict the semantic class c, and for which we have C distinct classes:
Intersection over Union (IOU) for class c:
Mean Intersection over Union (mIOU) over all classes:
Accuracy over all classes:
We evaluate this sparse data fusion approach for:
- Two semantic classes: Segmenting the road vs. non-road features.
- Six classes: This includes, in order of frequency, "other", "road", "vehicle", "pedestrian", "road/street signs", "lanes". Please note that the true class of "other" is not "other", it is one of the other classes that we aggregate into a single "other" category for this experiment.
- 55 classes: This extends our semantic segmentation problem to a much more diverse set of semantic classes.
For reference, the relative proportions of each major class in the six-class semantic segmentation problem are provided below:

Results
Our results from applying the PointNet++ approach with our sparse data fusion technique are given in the table below. Note that as we decrease the number of ground classes, our accuracy and mIoU metrics begin to significantly increase.

Limitations and Future Work
Though this work demonstrates working, preliminary viability of this framework, there is still significant additional research that can be done to expand this topic.
One such proposed continuation of this work aims to quantify the performance differences between dense and sparse data fusion along the following metrics:
- Runtime performance
- Semantic segmentation metrics (e.g. accuracy, mIoU)
- Online pre and post-processing time
If you have any ideas for a continuation of this work that you would like to share, please leave a response below!
Recap and Summary

In this article, we introduce the concept of data fusion as a means of improving the performance of an intelligent system by fusing distinct sources of data. We discussed some example fields and domains where data fusion is used, focusing particularly on the applications of data fusion for self-driving cars. We then dove deep into a data fusion case study with self-driving cars.
In the case study, we walked through why and how data fusion was performed, as well as discussed why we were able to perform it. We then discussed the role data fusion played in an example semantic segmentation system and concluded with system performance and future work with this case study.
If you would like to learn more about this case study, please check out my GitHub repository on sparse fusion scene segmentation with A2D2.
I hope you enjoyed this article and thanks for reading 🙂 Please follow me for more articles on reinforcement learning, computer vision, programming, and optimization!
Acknowledgments
Thank you to the A2D2 team for open-sourcing their dataset, as well as to the authors of [3] for open-sourcing the PointNet++ neural network architecture.
References
- Triggs, Bill, et al. "Bundle adjustment – a modern synthesis." International workshop on vision algorithms. Springer, Berlin, Heidelberg, 1999.
- Geyer, Jakob, et al. "A2d2: Audi autonomous driving dataset." arXiv preprint arXiv:2004.06320 (2020).
- Hao, Charles R. Qi Li Yi, and Su Leonidas J. Guibas. "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space." arXiv preprint arXiv:1706.02413 (2017).