The world’s leading publication for data science, AI, and ML professionals.

Tomorrow’s car silicon brain, how is it made?

Neural Network Accelerators for Autonomous Vehicles – GPU, FPGA, or ASIC?

Notes from Industry

Autonomous driving systems are incredibly complex. They tightly integrate multiple state-of-the-art technologies like perception and decision making. Only carefully designed hardware can support these particularly resource-hungry tasks. Furthermore, autonomous driving is one of the first embedded applications that heavily rely on machine-learning algorithms. Thus, massive research effort is put into developing neural network accelerators that meet specific requirements, like redundancy and power efficiency.

Photo by Laura Ockel on Unsplash
Photo by Laura Ockel on Unsplash

The development of autonomous vehicles is undoubtedly one of the most challenging tasks in the current Artificial Intelligence field. An autonomous driving machine has to precisely perceive its environment and plan an adequate set of actions to navigate the road safely. It has to deal with a vast spectrum of situations. Road condition, weather, complex road intersections, pedestrian and other road users are sources of uncertainty that complexifies scene understanding. Nonetheless, this task is critical. A complete understanding of the surrounding of the car is necessary to safely and efficiently navigate the world. Autonomous vehicles are loaded with sensors that collect a large amount of data to achieve this goal. However, raw data are not of great use; it has to be analyzed.

Given the complexity of the task, scene understanding requires learning algorithms and, most notably, neural networks. Note that the training process of such algorithms can take up to several days on the most advanced hardware platform – multiple powerful Graphical Processing Unit (GPU) or Tensor Processing Unit (TPU) [1] -. Thus this task is obviously not performed while driving. Embedded hardware only has to compute the forward propagation of data in neural networks, also called inference. Still, the inference is also resource-intensive. Especially that high refresh rate must be reached to reduce delays of perception. Fortunately, forward propagation in most neural networks can be computed as dot products, heavily parallelizable operations. Central Processing Units (CPUs) are thus clearly not adapted due to their limited number of cores. Nowadays, GPUs are used in many machine learning applications, both for training and inferring.

Nonetheless, it still lacks efficiency and reliability for critical Embedded Systems. For this reason, the industry is developing alternatives. Neural Network Accelerators (NNA) are hardware systems specialized in the computation of neural networks. We will mainly discuss two of them: Field Programmable Gate Array and Application-Specific Integrated Circuit like Tesla’s FSD Computer.

Autonomous vehicle embedded computers requirements

Today state-of-the-art autonomous systems mainly rely on computer vision. Indeed, the main computing task of an autonomous vehicle is the extraction of features from images. The system has to make sense of what it perceives. Convolutional Neural Networks (CNNs) have proven to be extremely powerful at this task. They generally consist of multiple layers of convolution, activation functions, pooling, and deconvolutions. Data flows across all layers of a network to extract interesting information from images or even videos. The storage and computation costs of such algorithms are high. For example, a classical classifier of 224×224 images as ResNet152 takes up to 11.3 billion floating-point operations (TFLOP) to perform the inference process and 400MB of memory to store model parameters [2]. Furthermore, current self-driving cars have multiple cameras with much higher resolution. Take Tesla Model 3, for example. It has 8 cameras with a resolution of 1280 × 960. All 8 video feeds are analyzed in real-time. We can easily imagine the tremendous amount of computational power required. However, convolution operations generally account for more than 98% of the performed operations through a CNN inference process. ReLU and pooling are effortless logic functions. They only account for less than half a percent of operations. Since convolution is based on dot products, the hardware must be designed to improve dot product efficiency, which eventually translates into several parallel multiply/add operations.

Embedded systems for autonomous vehicles also have specific constraints associated with security, reliability, and real-time requirements. There are the three main challenges to overcome:

  • The processing pipeline must be fast enough to digest the large amount of sensor data collected. The faster the system is, the more data it can analyze in a given time frame. Note that the frame rate at which the system runs is critical. The refresh rate of the perception system must be high to allow the system to react quickly to unexpected situations, even (and especially) at high speed.
  • Single points of failure must be avoided. The system has to be robust enough to recover from a fail part. Not only does the system have to operate under reduced resources, but it also has to detect malfunctions. This issue is generally tackled through redundancy and comparison of results. Two independent computing pipelines run simultaneously. An error is detected when results do not match.
  • The system must be as energy-efficient as possible. Autonomous vehicles generally are electric-powered. Thus energy efficiency is critical to achieving long ranges. Furthermore, high power consumption implies additional weight and cost in thermal solution and power supply.

The three main types of computation platforms

Graphics processing unit (GPU)

The Nvidia Drive platform is currently leading the market of GPU-based embedded systems for autonomous driving. This general-purpose computing solution integrates the Drive Software stack designed to let automakers focus on their autonomous driving solutions’ software implementation. The latest – and most powerful – iteration of the DrivePX architecture presents two Tegra X2 SOCs. Each chip contains four ARM A57 CPUs coupled with a Pascal GPU. Both GPUs have dedicated memory and optimized instructions for DNN acceleration. To accommodate a large amount of transferred data, each Tegra is directly connected to a Pascal GPU using PCI-E Gen2 × 4 bus for a total bandwidth of 4.0 GB/s. The optimized Input/Output architecture and DNN acceleration allow each Tegra chip to perform 24 TFLOP/s.

However, such a system consumes up to 250 W. This kind of power consumption really becomes a deal-breaker for embedded systems. So that even the GPU specialist is incorporating ASICs on its new platform available for production in 2022. The Nvidia Drive AGX Orin is announced to deliver 200 TFLOP/s through a combination of six different types of processors, including CPU, GPU, Deep Learning Accelerator (DLA), Programmable Vision Accelerator (PVA), Image Signal Processor (ISP), and Stereo/Optical flow accelerator.

Field-programmable gate array (FPGA)

In recent years, FPGAs have become an excellent option for algorithm acceleration. In opposition to CPU or GPU, an FPGA is specifically configured to run the targeted algorithm. The task at hand is thus executed with much higher efficiency. One can hardly estimate FPGA’s floating-point performance without empirical measurement, but several TFLOP/s for a few tens of watts’ consumption is easily achievable. A pure FPGA must work with a host system through PCIe connections to feed the data, in our case, images and other sensor outputs. The FPGA is usually only used as a neural network accelerator for inference purposes. The chip is configured according to the neural network structure, and model parameters are stored in memory. FPGA internal memories rarely exceed more than a few hundred megabits, which is too small to store most CNN parameters. External memory like DDR SDRAM is needed. The bandwidth and power consumption of such external memory is a bottleneck to the high system performance reacquired.

Still, high-end FPGA can achieve good performance regarding our application. For example, take the Zynq UltraScale MPSoC from Xilinx. It is designed with autonomous driving tasks in mind and can outperform the Tesla K40 GPU by more than three times in both speed and efficiency (14 FPS/W vs. 4 FPS/W) [3] running CNN inference. It reaches 60 FPS in a live 1080p video stream for object tracking tasks. Several hardware-level techniques are used to design neural network accelerator on FPGA to improve high performance and high efficiency. Computation Unit designs are particularly critical. Indeed, low-level components (gates, flip-flops) available on an FPGA are limited. Thus more computation units and higher peak performance can be achieved with a smaller computation unit design. Furthermore, the working frequency can be increase for a carefully designed computation unit array. Guo et al. [3] describe three main techniques to improve performance by optimizing Computation Unit designs:

  • Low Bit-width Computation Unit: The bit-width of input arrays directly impacts the size of computation units. The smaller the bit-width, the smaller the computation unit. Most state-of-the-art FPGA designs for neural network applications replace the 32-bit floating-point units with fixed-point units. While 16-bit units are widely adopted, it is possible to achieve good results for units as low as 8- bit [5]. In general, CNN and Neural net are very tolerant to reduced precision [4].
  • Fast Convolution Methods: The convolution operation can be accelerated through a wide range of algorithms. For example, Discrete Fourier Transformation or Winograd method can lead to massive performance gain (4×) [3] with reasonable kernel sizes.
  • Frequency Optimization Methods: Routing between on-chip SRAM and Digital Signal Processing (DSP) units may introduce limitations in the peak working frequency. Different working frequencies for DSP units can be used with neighbor slices used as local RAMs to separate clock domains. This technique is implemented as part of the CHaiDNN-v2 project from Xilinx [5]. They achieved double the peak working frequency.

The ZynqNet FPGA Accelerator [6] is a fully functional proof-of-concept CNN accelerator that implements these techniques and much more. As its name suggests, this framework is developed for Xilinx Zynq boards. It accelerates CNN inference with nested-loop algorithms, which minimizes the number of arithmetic operations and memory accesses.

Application-specific integrated circuit (ASIC) – Tesla FSD computer

Application-Specific Integrated Circuits allows complete flexibility in hardware implementation. Specific requirements can thus be met to achieve extremely high performances for a specific task. This solution is undoubtedly the way to go for car manufacturers with sufficient resources to develop such complex systems. While it offers the best possible performances for autonomous vehicles, development time and price are equally massive.

"I asked Elon [Musk] if he was willing to spend all the money it takes to do full custom design, and he asked, ‘are we going to win?’. I said, ‘Yeah, of course!’ so he said, ‘I’m in!’." – Pete Bannon, VP of Silicon Engineering at Tesla during Autonomy Day, 2019.

Tesla Full Self-Driving (FSD) computer was presented during Autonomy Day on 22 April 2019. This system runs on all Tesla cars since its release and has proven to work exceptionally well. According to Elon Musk, the FSD computer will eventually power a level 5 autonomous driving system. This ASIC is the response to the following requirements:

  • The computer must operate under 100 W.
  • It must at least handle the 50 TFLOP/s of the Neural Network model.
  • A modest GPU still remains necessary for data preprocessing and post-processing. However, with the advancement of software and artificial intelligence techniques, classical algorithms running on general-purpose hardware might become obsolete.
  • Safety and security are critical in the design. The percentage of failure has to be lower than the probability of humans losing consciousness while driving. This requirement implies complete hardware redundancy: two power supply and two independent computing units for each FSD computer.
  • Each image and set of data are processed independently (batch size of one) to reduce delay.
  • Extensive connectivity ports to accommodate the multiple sensors of the car.

Due to the nature of the application, the FSD computer is heavily adapted to image processing. Its interface presents a 2.5Gpixel/s serial input to accommodate the height camera surrounding the car and LPDDR4 DRAM interface for other sensors like radars. Besides, an independent image signal processor takes charge of noise reduction, tone mapping – bringing-out details in shadows – and an external H.265 video encoder module is used for data export. This unusual module is a critical element of Tesla’s software development process. Data is the founding stone of most machine learning algorithms. Tesla’s fleet of cars is a massive source of incoming video data to train the self-driving model. The database that built Tesla over the years is a crucial element that contributes to its success. Light data processing operates on a GPU that supports both 32 and 16 bits floating points and runs at 1 GHz to achieve 600 GFLOP/s. 12 ARM CPUs running at 2.2GHz are used for several minor tasks.

Nonetheless, the primary purpose of the FSD computer is its Neural Network Accelerator (NNA). For safety purposes, each computer is equipped with two independent NNA. Each NNA has 32 Mb of SRAM to hold temporary results and model parameters. Note that a DRAM memory is way more energy costly than SRAM (100×). During each clock cycle, 2056 byte of activation data and 1028 byte of parameters are combined in a 96 × 96 (9216) multiply adds array per NNA. This dataflow requires at least 1TB/s of SRAM bandwidth (per accelerator). The multiply/add array has in-place accumulation, which allows up to 10,000 operations per cycle. Hence, each NNA delivers 36 TFLOP/s operating at 2 GHz. However, NNAs are optimized for dot products. Nonlinear operations perform poorly or not at all on these chips. Dedicated modules are added for ReLU and pooling operations.

The current version of Tesla software requires 35 GOP per image analyzed. The FSD computer could thus analyze a total of 1050 frames per second. While running a beta version of fully autonomous self-driving software, the FSD computer consumes 72W – 15W for the NNA – .

System comparison

The evaluation of embedded systems is becoming increasingly difficult due to their complexity. The most effective way to attest improvement is a standard benchmark suite to represent the workload used in self-driving applications. We can divide benchmarks tools into two categories: datasets and workload stress. KITTI [7] was the first dataset benchmark targetting autonomous driving. It contains much perception sensor data such as monocular/stereo images and 3D Lidar point clouds. The ground truth associated with data is offered to evaluate algorithm performance in multiple self-driving scenarios like lane detection, odometry, object detection, and tracking. Such a dataset can be used as a system stressing source to evaluate its peak performance in relevant tasks to self-driving. The second class of benchmarks is designed to evaluate novel hardware architecture through a suite of applications and vision kernels. CAVBench [8] currently is a good starting point for autonomous driving computing system performance evaluation. It is a suite of applications that evaluate real-world performance by simulating different scenarios based on datasets to create a virtual environment. Multiple workload evaluation tasks are available: object detection, object tracking, battery diagnostic, speech recognition, edge video analysis, and SLAM. Such task discretization allows developers to find performance bottlenecks on a system.

Unfortunately, no benchmark or evaluation process is universally adopted across the autonomous driving edge computing community. Still, Guo et al. [3] have been able to compare multiple state-of-the-art neural network inference accelerators.

Figure 1: Comparison of performance and resource utilization of state-of-the-art neural network accelerator designs [3]
Figure 1: Comparison of performance and resource utilization of state-of-the-art neural network accelerator designs [3]

Figure 1 compares the computing power and efficiency of different FPGA-based accelerators and GPU-accelerator. Generally speaking, FPGA achieves slightly better energy efficiency than GPU in the range of 10 − 100GOP/J. However, GPUs still excel in speed performance. The main challenge to increase the performance of FPGA-based solutions is scalability. Zhang et al. [9] (referenced [76] on fig. 1) propose an FPGA-cluster-based solution to reach GPU performances. They combined six Virtex-7 FPGA from Xilinx (device XC7VX690T) using a 16-bit fixed-point design. While this architecture equals GPU computing power, it also shares its low energy efficiency.

However, FPGA-based NNA is a fast-developing field of research. Heavy work is put toward optimizing architecture to reach increasingly better energy efficiency and computing power. Indeed, GPU-based solutions have already reached a high level of architectural optimization. Their performance is now mainly caped by materials’ physical limits and manufacturing methods. On the other hand, hardware-oriented solutions still have much room to grow. Even general-purpose processor specialist as AMD, Intel, or Nvidia now focus their effort on hardware accelerators.

ASICs are still the best performing NNA. For example, Tesla’s FSD computer execute a total of 144 TFLOP/s, where Nvidia Drive architecture runs a maximum of 24 TFLOP/s on Tesla’s autonomous-driving software stack. FPGA will hopefully catch up on ASIC’s performances which development requires a lot of engineering effort. Not only does an ASIC have to be designed, but the whole software architecture must be adapted. Each ASIC ANN requires a specific compiler to deploy neural networks.

Conclusion

Autonomous vehicles are no ordinary systems. They have critical requirements to ensure safety for the user and its environment. The challenge for self-driving computers is to offer enough computing power, robustness, and energy efficiency. General-purpose computing processors are not a viable option due to their lack of parallelization capability to efficiently run neural networks. Indeed, the critical characteristic of any NNA is the optimization of dot product operation used in CNN.

The level of optimization varies relative to the degree of freedom of the platform. The most flexible ANNs currently are GPU-based. Still, with thousands of cores, they perform exceptionally well on neural networks. But these performances come at the cost of low energy efficiency. FPGAs offer slightly less flexibility and must be configured to offer better performance than GPU at a similar power consumption level. However, FPGAs are hardly scalable. Heavy work is put toward the use of FPGA as NNA. Finally, static hardware systems built from the ground up, ASICs, currently offer the best performance from all points of view; robustness, energy efficiency, and computing power.

While ASICs perform incredibly well, only major companies can afford their development. Similarly, FPGAs are still complex systems that need carefully designed algorithms to run efficiently. For these reasons, GPUs are still widely used in autonomous vehicles, but customizable hardware will replace them soon.

Personal thoughts

Autonomous vehicle development is a young field of study. Developers are currently exploring possibilities to teach a machine to drive. Thus, machine learning solutions constantly evolve. While CNN is currently the norm, new promising architecture, like Transformers [10], are currently emerging in the field of computer vision. Since these solutions dictate the hardware level requirements, NNA can hardly be heavily optimized currently.

Software and ANN must be developed concurrently to reach the best performance. At least, machine learning models can be optimized to make the most of available hardware platforms. Take the example of the challenge of the end-to-end driving algorithm [11]. This approach certainly can yield better results. However, it reacquired too complex neural networks for the current NNA platforms.

On the other hand, splitting the system into a perception system and a decision system by adding human-engineered vectors allows splitting the architecture. Tesla currently adopts this second solution for good reasons: it eases computation. Many other neural network optimizations are available. Nevertheless, they tend to be counter-intuitive from a mathematical point of view (on which models are built). For example, machine learning engineers prefer a deeper network over large layers because it optimizes computation costs on general-purpose processors or GPUs. This is not necessarily the case on custom hardware NNAs.


Thanks for reading. Connect with me on LinkedIn to continue the discussion!

References

[1] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al.. In-Datacenter Performance Analysis of a Tensor Processing Unit. arXiv:1704.04760.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR. arXiv:1512.03385.

[3] Kaiyuan Guo and Shulin Zeng and Jincheng Yu and Yu Wang and Huazhong Yang. A Survey of FPGA-Based Neural Network Accelerator. arXiv:1712.08934.

[4] Suyog Gupta and Ankur Agrawal and Kailash Gopalakrishnan and Pritish Narayanan. Deep Learning with Limited Numerical Precision. arXiv:1502.02551.

[5] Xilinx CHaiDNN-v2 project. https://github.com/Xilinx/chaidnn Accessed: Mar. 21, 2020.

[6] David Gschwend. ZynqNet: An FPGA-Accelerated Embedded Convolutional Neural Network. arXiv:2005.06892

[7] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 3354–3361

[8] Y. Wang, S. Liu, X. Wu, and W. Shi. CAVBench: A benchmark suite for connected and autonomous vehicles, in Proc. IEEE/ACM Symp. Edge Comput. (SEC), Oct. 2018, pp. 30–42

[9] Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster, In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. ACM, 326–331.

[10] Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929.

[11] Chen, Jian-Yu, Zhuo Xu and M. Tomizuka. End-to-end Autonomous Driving Perception with Sequential Latent Representation Learning. arXiv:2003.12464.


Related Articles