ARTIFICIAL INTELLIGENCE | NEWS

This article is the second part of a 4-part series:
2. Training Data. How Does a Car Learn?
On the 19th of August Tesla hosted one of the most important AI events of 2021; the Tesla AI day (you can watch the entire thing here ).
The leading researchers and engineers of the company presented the latest developments in hardware, software, AI, robotics, computing, and self-driving cars. The event was focused on attracting the attention of potential candidates to work in current and future projects.
The talk was divided into four big sections. I’ll use the same outline to separate the articles of this series:
- Tesla Autopilot. How to make the car fully autonomous solving vision, planning, and control.
- Training data generation. How to create the large datasets needed to train the networks: Manual labeling, auto labeling, and simulations.
- Project Dojo and D1 chip. The next generation of AI training computers.
- Tesla bot. The promise of an autonomous humanoid robot that would carry out "dangerous, repetitive, boring tasks," Musk said. "In the future physical work will be a choice."
Let’s go for the second part: Training data. How does a car learn?
Disclaimer: Due to the impossibility to contact Tesla, I’ll link to the relevant visuals directly on the YouTube presentation at the corresponding time frame. I recommend clicking the links as your read to get a better grasp of the explanations.
When people first get into contact with Artificial Intelligence, they tend to focus on algorithms. How they recognize pictures of cats and dogs, learn to play chess, or compose music and write poetry amaze people because it feels like magic. There are many kinds of algorithms, but most newsworthy milestones are generated by just one type – that the media loves so much – neural networks.
People care about what deep neural networks are capable of, but they forget these "black-box" models are nothing more than empty casings without the large datasets that train them into becoming powerful predictors and classifiers.
Practice makes perfect and it’s no different for deep neural nets. How do they do it? Learning from data. It’s only because people have created, organized, cleaned, labeled, and curated large datasets that deep learning has had the chance to become an AI superstar paradigm.
Andrew Ng, a leading deep learning expert, likes to emphasize the importance of data in contrast to algorithms. He rejects the trend of model-centric machine learning. He proposes we should take a more data-centric approach – developers modifying datasets instead of tuning algorithms.
He argues that this shift in focus could give the AI community a better perspective on the importance of the dataset; the sometimes forgotten engine of modern AI.
The challenge of real-world data
Self-driving cars are one of the highest exponents of the importance of data in AI. Not only do the systems that drive the car need incredibly huge amounts of data, but the data needs to be real-world data. To understand just how much difficult it is to predict from real-world data, let’s see what happens when ConvNets face the ImageNet challenge.
In 2012, Geoffrey Hinton’s team won the challenge with a ConvNet model, obliterating their (non-deep learning) rivals by a +10% error margin – achieving 63.30% top-1 accuracy. This was an amazing result at the time, but deep learning has improved a lot since then.
Today, the best ConvNets achieve +90% top-1 accuracy on the same benchmark. These models are better than humans at classifying images from the ImageNet dataset. Yet, when these exact models are tested on ObjectNet, the real-world counterpart of ImageNet, they experience a 40–45% drop in performance. They are simply unable to handle the vast complexity of reality.
That’s the challenge Tesla faces when training the visual and planner neural networks. The only viable approach is to train the algorithms with enough real-world data so they end up experiencing almost every possible scenario.
The challenge of unusual scenarios
However, because they decided to change from image space predictions to vector space predictions – as I explained in the first part of this series -, they need vector space datasets. How can they acquire these datasets so that they contain enough clean and diverse examples? The only option they have is to create the datasets themselves.
For a first approximation, they established a team of 1,000 people who do manual labeling. Because this labeling process is as slow as it is precise, they decided to use their car fleet to do auto labeling. The idea is to have the cars drive around – either on Autopilot or not – capturing videos that are then sent to a supercomputer that labels them for further training.
But is this enough? No. One of the main reasons why FSD is still under development is that the world offers an unlimited number of situations. This could prevent the cars from achieving perfect accuracy. Despite the amount of data the cars learn from, no one can ensure they’ll encounter in training every one of these possibilities.
That’s why they built a simulation. They can generate rare scenarios that, although most likely would never occur, still have a very small probability of happening. Cars can then be trained with this data to prepare for the most unexpected situations.
From here, I’ll divide the article into four parts:
- Manual labeling – From 2D images to 4D vectors
- Auto labeling – Humans out-of-the-loop
- Simulation – A video game for Autopilot
- Insights.
1. Manual labeling – From 2D images to 4D vectors
One of the first things Tesla engineers realized was that acquiring data from third-party companies wasn’t enough – dubious quality and high latency. They couldn’t trust others for such a high-stake mission. Being loyal to their principle of vertical integration, they decided to create a "1,000-person in-house data labeling team."
This team is in charge of manually labeling the initial training datasets in vector space. At first, they labeled images one by one, which looks like this. Now, they label 4D vector space reconstructions that look quite better. Because the vector space is built from the eight cameras and multiple frames, all at once, each unitary labeling effort is now worth 100x what was previously.
For most problems, manual labeling is the only possible solution. But Tesla engineers knew they’d never get enough data to reliably train the cars only through manual labeling – not even in vector space. That’s why they looked for ways to leverage the power of computers and the advantage that having a distributed fleet provides them. They designed a pipeline for auto labeling to complement the team of manual labelers which would check for noise and errors afterward.
2. Auto labeling – Humans out-of-the-loop
They bet on removing humans from the loop. Although Tesla’s auto labeling process is more like "semi-auto labeling," for the most part the cars could train without human intervention. To generate auto labeled data they ideated a three-step pipeline:
- Capturing clips: They collect a clip that consists of different pieces of data – videos, IMU, GPS, odometry, etc. The cars drive around, capturing the data as they would normally but this time with the intention of using it for further training. Whether Autopilot is enabled or not is largely irrelevant for this step. This allows Tesla to use any of its cars, capturing immense amounts of data.
- Feature extraction: This data is passed through offline NNs that extract different types of intermediate features – depth, segmentation masks, point matching, etc. The advantage of using offline networks (as I’ll explain later) is that the networks have access not only to past data but also to Future data, which improves the accuracy of the labels.
- Labeling: The intermediate features are passed through other algorithms that generate the labels for the data. Auto labeling shines in three cases: Restructuring the road, occlusions, and kinematics.
Let’s take some perspective to understand how auto labeling works and why it’s so important. Let’s imagine that our Tesla car passes by a pedestrian. As it moves forward, the front, lateral, and back cameras capture the person and generate single-cam videos from different angles. These videos can then be passed through a neural network that generates a vector space representation of the environment and the pedestrian.
Because the info is now in vector space, the subsequent processing algorithms have a richer perception of the pedestrian than what a single camera/frame can provide. Occlusions, illumination, angles, positions… all potential mislabeling factors are reduced or removed. This allows the networks to generate high confidence labels without the need for a human checker – most of the time. These labels could then be "backpropagated" to the individual cameras, resulting in a completely automated labeling process, with labels at the level of image and vector space, and single/multi-camera.
Auto labeling provides a crucial advantage in the sense that no amount of human labelers could ever generate such a large set of labels. Creating perfect bounding boxes and semantic classes onto thousands of different objects in frames depicting a wide array of possible scenarios is incredibly laborious.
By following this novel pipeline Tesla could generate millions of labels in no time. A 1000-car fleet with eight cameras each filming at 36 fps for 1 hour would result in the striking amount of ~1 billion frames in total, with thousands of objects all labeled with high confidence. It’s a game-changer.
Let’s see the cases in which auto labeling makes a difference.
Restructuring the road
One of the tasks the auto labeling pipeline can do is reconstructing the road surface. Doing it by hand is tedious and imprecise, so they decided to take advantage of a very recent technique called NeRFs (Neural Radiance Fields). The idea is to have a neural network implicitly represent the road surface. Then, they input x,y image coordinates to the network which extracts both the height (z) and various semantics that define what’s at that exact location – lane lines, curbs, crosswalks…
These 3D points can be reprojected into the eight cameras, and because the network has the coordinates and the semantic class, it knows where to locate the objects in each camera view. After doing this process for thousands of queries, each single-cam frame would be entirely populated with points referring to the different semantic classes (line, curb, crosswalk…). Each frame populated with points could then be jointly optimized with the image space segmentation of the same scene to generate precise labels of what’s in the road.
A car could then sweep out many miles, generating data on the go that could be used to label entire areas. Repeating this process for several cars going through the same locations from different angles would provide super-precise labels, as it’s shown here. After the labels are generated for the surface, then for the first time in the whole process a human could go into the loop and check them to correct noise.
Apart from the road, this same technique could be used to reconstruct 3D static objects, such as parked cars, houses, walls, etc.
Occlusions and kinematics
As I mentioned above, one crucial advantage of processing the data in offline neural networks is that they have access not only to past data but also to future data. At test time, the cars need to predict the velocity and acceleration of other vehicles to plan the best route. The auto labeling pipeline allows the network can generate perfect labels by cheating and looking ahead.
This same trick can be used for occlusions. Partial or temporal occlusions can be resolved by looking ahead at future frames in which the occluded objects appear again. The generated labels include this info which eases the learning process for the guiding neural networks.
A practical example of auto labeling
The auto labeling pipeline provides an extremely rich set of labels that include the road surface, static objects, and the kinematics of moving objects – even those occluded. Here’s a visual example of what the labels look like. The datasets aren’t just more efficiently created and labeled, but more precise and richer in information. Tesla wants to generate 1 million clips of this quality to train their fleet.
They recently put in practice the auto labeling process for the "remove the radar" project. One criticism Tesla has gotten in the last years is that they should’ve never dismissed the importance of LiDAR-like Technology. However, it seems they’ve managed to improve the predictions in extreme conditions. For this project, they wanted to capture thousands of clips in extreme conditions in which the front cameras were getting filled with noise from the vehicles in front. In a week, they got 10k clips labeled and radically improved the performance.
3. Simulation – A video game for Autopilot
Apart from manual and auto labeling, Tesla has developed a simulated environment to train Autopilot in special situations. They wanted to recreate the world as realistically as possible so that Autopilot could smoothly transfer its learnings from the simulation to the real world.
What makes the simulation a unique resource is that it starts from vector space so it has perfect labels. All the information it needs to detect and recognize objects is already imbued into the computer. It provides bounding boxes, kinematics, surface, depth, segmentation. There are two main cases in which the simulation reaches there where the other two labeling approaches fail.
- Difficult sources: This case refers to scenes that are so unlikely to happen that no car in the fleet has ever encountered it before – for example, a moose running along with the cars on the highway. Autopilot needs to know how to react in front of the most surreal scenes in case they eventually happen.
- Difficult labeling: In a scene where there are hundreds of people walking in different directions and velocities – for example, the city center at rush hour – humans would take many hours to label and auto labeling wouldn’t work appropriately, introducing too much error in the training data. In the simulation, because the data is already there, it’s trivial to provide the labels.
Now, what did they need to create such a high-quality representation of the real world? Five elements:
- Accurate sensor simulation: What Autopilot sees in the simulation has to be extremely similar to what it’d encounter in the real world. They had to model the properties of the cameras that the real cars use so that the simulation faithfully recreates the environment.
- Photorealistic rendering: The simulation is built with spatiotemporal anti-aliasing to reduce noise. They also use neural rendering techniques which allow for a reliable and controlled generation of video, combining machine learning and computer graphic design knowledge.
- Diverse actors and locations: To avoid the networks from overfitting a very limited set of assets from the simulation, they need a diverse mix of cars, people, scenes, and objects. The same applies to the specific locations that are simulated. In total, they’ve simulated 2,000+ miles of road.
- Scalable scenario generation: Most of the data in the simulation is created by algorithms and not artists. This allows for a procedural scalable creation of new scenarios. They also apply adversarial techniques to test the networks more strongly on their weak points.
- Scenario reconstruction (failures): They can recreate scenarios with the auto labeling process so that Autopilot can face its failures again. They take the videos and recreate the scene in the simulation to make Autopilot generate new outcomes until it improves its performance.
4. Insights
Tesla going its own way
There are other companies working on FSD cars, like Waymo – Google’s self-driving spin-off – and Cruise. But none of them has chosen the same path as Tesla. While Musk dislikes LiDAR and any other type of sensor that could make autonomous cars less affordable, other companies bet it won’t be possible to achieve FSD without them. In 2019, auto analyst Brad Templeton criticized Tesla’s stubbornness to do what others had discarded years ago.
Yet, the combined three approaches Tesla uses to generate large, reliable, precise, and high-quality real-world datasets provide them a huge advantage over their competitors. Why Tesla can leverage such a powerful mix of techniques? The secret sauce is having a huge car fleet already navigating the world.
At the beginning of 2020, Tesla had delivered just short of 1 million vehicles – 80% of them with the latest hardware version of Autopilot. Tesla’s closest competitor, Waymo, had in October 2020 just 600 self-driving cars roaming around a 100-square-mile area in Phoenix. The Tesla fleet has driven above 3 billion miles on Autopilot to date, whereas Waymo’s fleet has driven around 20 million miles. That’s 150x more data in the hands of Tesla – increasing exponentially.
It’s hard to know which approach is better – vision-based or combining other sensors – or even if both could eventually provide satisfactory results. What’s clear is that Tesla has the upper hand in terms of numbers and they’ll exploit this resource to its limit. Whether it’s enough to make the technology work in ways others can only dream of is yet to be proved.
Overfitting the world
The fact that Waymo cars drive only around Phoenix carries an issue that not even Tesla can escape: Overfitting. If the data captured corresponds to a specific area, with particular conditions, people, and environments, it’d be inherently biased.
There’s no other way of capturing real-world data, so this is an unavoidable problem. Until Tesla’s – and its competitors’— cars spread everywhere, the companies won’t have enough diversity in their datasets to learn from the extreme variability in the world.
A month ago, a user on Twitter pointed out the issue: "Drove around San Francisco on FSD Beta 9.2. […] It definitely seems to work better in California than it does in Rhode Island." To which Elon Musk responded:
Will Tesla consider flawless autonomy in the SF Bay Area to be sufficient to claim FSD is ready? Will they manage to achieve better-than-human self-driving for every road on the planet? The most likely outcome will lie in between those two possibilities. Yet, finding consensus on whether FSD has been achieved or not will be hard. As with every other technology that emerges in Silicon Valley, it’ll get to the most remote parts of the world considerably later – if ever.
If you liked this article, consider subscribing to my free weekly newsletter Minds of Tomorrow! News, research, and insights on Artificial Intelligence every week!
You can also support my work directly and get unlimited access by becoming a Medium member using my referral link here! 🙂