
Nauto is a leading provider of advanced driver assistance systems that improve the safety of commercial fleets today and the autonomous vehicles of tomorrow. To that end, we process terabytes of driving data a month, collected by windshield-mounted devices from vehicles around the world. This data is used to continuously improve the models that power our vehicle safety stack, from the real-time predictive collision alerts deployed to our devices, to the safety analytics that run on the cloud. Data drives everything.
Beyond providing immediate safety value to the drivers, our features play the important role of shaping their own evolution.
If we want to improve the vehicle detection powering Forward Collision Warning (FCW), the first place we will look is the false positives triggered by FCW. The same goes for pedestrians and PCW. Over time, each feature builds out its own tailor-made dataset of relevant examples and interesting edge cases, which are then fed right back into the underlying model during training. This closed loop forms the basis of a very powerful data engine that drives continuous improvement across all our features.
The Problem
The problem is that multiple features are often powered by a single model. For us, a single multi-class object detector powers Tailgating, FCW, and PCW. While this is awesome from a compute standpoint, it complicates training a bit.
Let’s take FCW as an example. Say it’s been deployed for some time, and has already collected a sizable dataset of risky events involving a lead vehicle. We notice that its false positives mostly stem from failing to localize oddly-shaped vehicles, so we sample these events, annotate them with high-quality vehicle bounding boxes, and build a new batch of training data consisting of 50k images.
The catch is that we can’t simply append this new batch of vehicles to our existing multi-class dataset. Doing so would unfairly penalize the model for correctly detecting, say a pedestrian or a red light, in an example that was only labeled for vehicles. The model is ultimately optimized by a loss function which compares its predictions to the ground truth. The loss function makes no distinction between an object that doesn’t exist and an object that just didn’t happen to be labeled in the example. We can’t just concatenate a bunch of partially-labeled datasets together – or that would lead to unfair, confusing training.
Luckily, we have a few options.
Annotate all classes for all datasets.
The most obvious solution is just to fully-label everything by hand. If our model detects pedestrians, vehicles, and traffic signals, then each new batch of training data must be annotated across the entire class taxonomy – regardless of which feature it was sourced from.
The first problem with this is scalability. At Nauto, our core object detector detects over 20 classes across 5 major tasks – and growing. With this approach, we would need to run that single batch of FCW data not just through the vehicle annotation workflow, but the pedestrian, traffic signals, lane detection, etc. workflows as well.
The total annotation workload scales multiplicatively, as a function of:
#_annotations = (#_examples) x (#_tasks)
This quickly becomes impractical, especially when you consider what happens when we decide to add a new task, like traffic cone detection. Not only does the new traffic cone data need to be annotated with classes from all 5 of our existing tasks, but every single example from our existing dataset must be back-labeled with traffic cones.

The second problem is irrelevancy. Our new batch of 50k images sourced from FCW will likely contain a lot of interesting examples of oddly-shaped vehicles that should significantly improve vehicle detection. But there’s a good chance it won’t contain many pedestrians. And even if it does, it most certainly won’t contain as many relevant edge cases of, say, weirdly-shaped pedestrians – because these images were sourced from FCW, not PCW. Yet, we must run these 50k images through both annotation workflows all the same.
With finite budget and resources for each labeling task, this brute force approach will quickly chew up our entire labeling bandwidth on examples that are irrelevant for all but one of our detection tasks.
And because the volume of total annotations grows multiplicatively, while the number of relevant, task-specific annotations grows linearly, the overall fraction of irrelevant annotations will only grow as we add more tasks over time.
Leverage Auto-Labeling.
A more scalable solution would be to leverage our existing models to fill in missing labels. In our new batch of FCW data, we would only label the relevant vehicle classes by hand, and call upon our collection of existing detectors to fill in the pedestrian and traffic signals classes.
What’s great is that these "auto-labelers" are not limited by any of the constraints our production model faces on the device. They can run on powerful cloud GPUs, they can run offline without latency constraints, and they can leverage data collected before and after a point in time. They can also be specialists; each task can have its own specialized auto-labeler if that means better performance than a multi-task model.

The advantage over the brute-force approach is clear; humans can focus their bandwidth on producing high-quality labels for only the most relevant examples per task, while machines can fill in the remaining, less-relevant data. With a budget to label 100k images this month across vehicles and pedestrians, we could label 50k FCW examples with interesting vehicles and 50K PCW examples with interesting pedestrians – rather than spending it all on the former. Our models will take care of the rest.
The disadvantage is that we introduce noise into our labels. While humans aren’t perfect labelers, models are generally much less consistent. With this approach, our dataset will always contain a subset of noisy, machine-made labels.
What’s more, the fraction of noisy labels to clean labels will increase across all tasks for each new task added.
This is because we aren’t actually reducing the multiplicative complexity of the first approach, but simply shifting the load from human to machine. Substitute "human" for "relevant" and "machine" for "irrelevant" and the relationship becomes clear.
Here’s an example. Say we’d like to merge a new cone detection task into our existing vehicle & pedestrian dataset of 100k images, so we add a new batch of 10k images that are hand-labeled with traffic cones. Now, we must machine-label the missing vehicles and pedestrians in the new 10k batch. But we must also do the same for the unlabeled traffic cones across all 100k images of our original dataset. By the end of it all, we’ve increased the number of human labels in our training set – but not without increasing the volume of machine labels by an order-of-magnitude more.
Use a Partial-Loss function.
Perhaps the most elegant approach is not to mold partially labeled data to be compatible with our loss function, but to mold our loss function to be compatible with partially labeled data.

Such a loss function would need to distinguish between objects that don’t exist and objects that just didn’t happen to be labeled in each example. Then, it would simply backpropagate losses only for the classes that were labeled. Any correct detection of an object that happened to be unlabeled for a given example would not be unfairly penalized.
The clear advantage of modifying the loss function rather than the data is that we shrink the complexity faced by the first two approaches from multiplicative to additive.
With our new batch of FCW data, we can just label the vehicles and be done! When those examples are sampled during training, the loss function will only penalize the model for its vehicle predictions – nothing else. When a new detection task is added, we don’t need to back-fill our existing dataset with new labels. Our dataset is 100% relevant and 100% noiseless.
The only challenge is how to build such a loss function. Surprisingly little literature covers this topic, and even less open-source code. There are also questions to consider, such as how to normalize the loss – which can now vary widely in magnitude due to some examples being minimally labeled and others being highly labeled.
Building a Data-Unification Framework
With all that in mind, my goal back in April was to build a framework to support the fast-growing complexity of our data. Our team at Nauto has always been aggressive about continuous model improvement, but at that time we were especially active – preparing to integrate 3 brand new detection tasks into our core object detector.
Being the owner of that model, I was just starting to realize how quickly my existing workflow would become unscalable. By then I’d already been leveraging auto-labeling to unify our datasets, and had just put together my first working version of a partial-loss function. But I was still doing everything manually. I had no formal process in place to track these partial datasets, the resulting unified datasets, or the models used as auto-labelers. Every time we needed to merge a new batch of labeled data, I scrambled to figure out what needed to be auto-labeled with what, and manually schedule the inference jobs.
What I really wanted was a flexible, user-friendly framework that:
- Supported both partial-loss & auto-labeling.
- Abstracted away the complexity of the unification process.
- Tracked and maintained our ever-growing collection of task-specific datasets, auto-labelers, and unified datasets.
What I ended up putting together looked something like this:

I’ll briefly go over the core pieces:
Component Datasets
These are the partially-labeled, task-specific datasets that form the building blocks of the unification framework. During unification, we can specify not only which tasks to include, but also which batches of which tasks to include.
When a fresh batch of PCW data rolls off our Pedestrian labeling queue, we just register it under the "Pedestrian" section of the config file. It’s now ready to be unified!
Auto-Labelers
If auto-labeling is chosen as the unification method, these are the specialist models available for our use. For each task, we must pick one auto-labeler from a collection of trained models, and either use its set of pre-optimized thresholds or override them with new values. During data unification, these thresholds will determine which of the model’s predictions will be used as ground-truth, and which will be discarded.
Notice that for each task we include from the component datasets, we must also pick a corresponding auto-labeler. For instance, we can’t choose Vehicle:[b6] and Pedestrian:[b1–4, b5] as datasets while only choosing a Vehicle auto-labeler, as Vehicle:b6 will have no way of filling in its missing Pedestrian labels.

By design, the most accurate, recently trained models for each task will always be pushed to the top of the stack. Each model will also list the data that it itself was trained on. Both these measures ensure that we are always maximizing the accuracy of machine-labels at any point in time, and that we have full visibility into the original training data that ultimately generated these machine-labels. Notice that we can even use past production models trained on unified data as auto-labelers for the current iteration.
Compile Data
To compile a new unified dataset, all one must do is specify the desired component datasets to include, the unification method (partial-loss or auto-labeling), the models to be used as auto-labelers should that option be chosen, and the output path. That’s it! The framework will unify the data and dump it out in the form of TFrecords – ready for immediate production model training.
What happens under the hood, though, varies drastically depending on which unification method is chosen.
With partial-loss, the component datasets are pretty much just concatenated and spit out. The only caveat is that for each example, the labeled classes are noted down and registered in the TFrecords. This is so that later on, when the training loop pulls up that particular example, it will know to only backpropagate losses for detections belonging to that set of labeled classes – penalizing nothing else.

By contrast, auto-labeling is a bit more involved due to its complexity. For each component dataset, we must run inference with the auto-labelers for all but one of the tasks – the task the dataset belongs to. Say we’d like to unify 5 component datasets spanning 4 tasks (and therefore 4 auto-labelers). For each component, 3 separate auto-labelers must run inference to generate the missing labels – totaling to 5 x 3 inference jobs. More generally, the inference workload scales as:
#_inference_jobs = (#_component_datasets) x (#_tasks - 1)
This should feel very familiar, because it basically rephrases the complexity equation of the fully human-labeled approach. What’s incredible to me is that even for models generating 1000’s of predictions a minute on server-class GPUs, the sheer scale of this workload can already be felt with just a handful of tasks. My most recent compilation of ~100k examples over 5 tasks took several GPU-hours to complete. Just imagine doing that instead with human power!
Results
The first question I had when I started this project – and I’m sure you also have by now – is which approach leads to a better model? This is actually quite a complex question due to how many variables are at play: the number of tasks compared to the number of component datasets, the accuracy of the auto-labelers, the thresholds used, the implementation of the loss function, etc. I haven’t had nearly enough time to run enough controlled experiments to truly answer this question, and so I won’t pretend to. But I will share some of my limited findings which I found interesting – and which I hope you will as well.
From a theoretical perspective, I was almost certain that partial-loss would strictly outperform auto-labeling. Partial-loss allows for a 100% noiseless, human-labeled dataset which should train a more accurate model than one tainted by noisy machine labels. From a practical standpoint, it also has the benefit of compiling in minutes rather than hours.
This is why I was so surprised to find out the opposite to be true:

The first thing to notice is that using a partial-loss function yields much better accuracy than the default loss when training on a partially-labeled dataset. This was awesome to see for a few reasons. For one, it validates the central problem that this whole framework was conceived to solve: naively training on a dataset of partial labels is suboptimal, as it unfairly penalizes correct predictions of unlabeled objects. From a practical standpoint, it also validates my implementation of the partial loss function – which to this point I’d only minimally tested.
What was interesting to see was that auto-labeling outperformed partial-loss across all three of the classes. Intrigued, I dug a little deeper:

Here, I’m plotting the "auto-labeled ratio" of each class – the ratio of noisy machine labels to clean human labels. While auto-labeling outperforms partial-loss across all classes, its lead progressively shrinks as the ratio of machine labels increases. For Pedestrian, the class with the lowest fraction of machine labels (8%), the auto-labeling lead is over 2% F1. But for Other Vehicle, a class with a much higher fraction of machine labels (21%), the gap shrinks to 0.67% F1. One hypothesis is that the greater the fraction of auto-labels, the smaller the benefit of having complete labels over partial labels due to increasing noise in the ground truth. And the natural question is whether the lead would flip if the auto-label ratio is pushed high enough.

It turns out that the answer is yes! I was able to dramatically boost the auto-label ratio across all classes by expanding the dataset from 2 tasks to 5. And in the new trial, partial-loss ended up outperforming auto-labeling across all classes.
While this finding does add some support to our hypothesis, it also raises some other questions. For instance, why doesn’t the partial-loss lead increase with higher auto-label ratios? Partial-loss has a 4% lead for the Pedestrian class – which has the lowest percentage of auto-labels – and only a 1% lead on the Other Vehicle class – which has more than double the percentage of auto-labels. If our hypothesis is true, then shouldn’t we also see the benefit of partial-loss become more obvious with increasingly noisy datasets?
One explanation is that we’re overlooking an important variable at play: the auto-labeler itself. The noise in an auto-labeled dataset is a function of both the accuracy of the auto-labeler and the fraction of labels it contributes to the total set of ground truth. With that in mind, our findings make a little more sense when we consider that my Pedestrian auto-labeler was much less accurate than my Vehicle auto-labeler. This is partly because pedestrian detection is a much newer task at Nauto, and partly because pedestrians are just a lot harder to localize than vehicles.
We may also be overlooking another relevant variable: the dataset size. The first experimental dataset was fairly small, with Pedestrian having the smallest share of human labels by far, followed by Vehicle, then Other Vehicle. The second dataset was much larger, and most notably contained a huge infusion of Pedestrian labels. An argument could be made that in the first trial, auto-labeling provided an outsized benefit to the label-starved Pedestrian task – increasing the total number of labels and bringing it closer to convergence. By the second trial, that benefit shrank due to a more substantial base of human labels, and in fact became outweighed by the growing amount of noise introduced by the Pedestrian auto-labeler.
Going Forward
It should be obvious now that there are still a ton of open research questions to explore. Thankfully, such a unification framework makes these kinds of experiments painless to run and trackable over time. And the great thing is that because the unification process itself is abstracted away, we can tinker with different optimizations under the hood without disrupting or complicating the user workflow. Here are just a few improvements I’d love to have implemented going forward:
Soft Auto-Labeling
There’s actually a 3rd unification method. Rather than treating any auto-label the same as a human label, we can weight it by its confidence score. This turns the object labels from binary [0 or 1] to continuous [0 to 1], in which human labels will always be given a weight of 1, where machine labels will fall in a distribution below that.
The intuition behind this approach is that not all auto-labels are created equal; some will be more confident and thus accurate, while others will be less confident and thus noisy. While the default "hard" auto-labeling will treat all machine predictions above a threshold the same, the "soft" approach will scale each prediction by its likelihood of being accurate. So during training, a model that misses an auto-labeled object weighted at 0.95 will be penalized almost as if it had missed a human-labeled object. A model that misses an auto-label weighted at 0.45, by contrast, will not be penalized nearly as hard, as there’s a higher chance that the label is unreliable.
We may still implement thresholds to filter out junk predictions, but the selection is no longer as critical as the labels are no longer binary. We may also implement a normalization function when assigning weights to account for variations in confidence score distributions across different model architectures.
This approach feels promising, as it allows us to leverage the high-quality, useful annotations produced by our auto-labelers, while baking in a mechanism to mitigate the impact of noisy, hurtful labels.
Caching
Currently, every time a new unified dataset is compiled, it is done from scratch. For auto-labeling, this can be wasteful as our data is expanded incrementally over time; there is a very good chance that most of the scheduled inference jobs for a new compilation have been run before. The solution is to simply cache each inference job as a building block of fully-annotated data – using the IDs of the auto-labelers and the component dataset as a signature. If that exact same signature is scheduled in the future, we just pull from the cache instead of re-running inference.
At the cost of storing a handful of text annotations, such a cache could save us a ton of time – shrinking the turnaround to compile a new auto-labeled dataset from hours to minutes.
Parallelization
Likewise, the auto-labeling process can easily be parallelized across many GPUs – either by batch or by inference job. This should be self-explanatory and simple to implement, so I won’t go into detail. Again, the benefit is shrinking down the compilation time so that training can begin as soon as possible.
Final Thoughts
I initially started this project just to automate a manual task that was quickly becoming unscalable: unifying a growing number of task-specific datasets into something our model could be trained on. Looking back now, there is no way I would have been able to support the launch of 3 new detection tasks without investing in such a framework 6 months ago. In that sense, I already see this investment as having paid itself back.

Looking forward, my hope is for this unification framework to support the Active Learning engine that drives our ability to continuously improve existing features and launch new ones. In this expanded loop, not only are we iterating on the data we collect and the models we deploy, but also the process by which we unify that data, and the models behind the scene that power that process. What’s key is that this process is robust, self-documenting, and inherently iterative. Nauto must be able to continue expanding its safety offerings and AI capabilities at an aggressive pace – unfettered by scaling and complexity concerns.
In the field of AI, continuous improvement is a necessity, and data is the core of it all. Having an engine that drives both efficient data collection and data curation is crucial. When I started my career, I assumed AI was about building the best models. I then started to realize it’s about building the best data. But I’m now coming to realize that it’s actually about building the best processes. Because once you have a process that makes iteration effortless, everything else follows.