How the hell are GPUs so fast? A HPC walk along Nvidia CUDA-GPU architectures. From zero to nowadays.

Published in

Towards Data Science

11 min readOct 6, 2020

Someone defined Machine learning as the perfect harmony among maths (algorithms), engineering (High Performance Computing) and human hability (experience). So, any progress on any of these fields will help Machine Learning to grow. Today is the turn of HPC, specifically we are talking about GPUs advances.

Nvidia just announced its Geforce RTX 30 Series (RTX3090, RTX3080, RTX3070), based on the Ampere architecture. Ampere is the last architecture of our favorite GPU brand, but several generations of CUDA-capable GPUs has been released so far. In the following paragraphs, I will describe a global overview of the CUDA architectures from beginning to end today, let’s drive the interesting road from Fermi to Ampere together. But before going into further details, I strongly recommend you to visit my previous post about the CUDA execution model, if you are not familiar with GPU computing.

Following the natural timeline of Nvidia GPUs, the company first produced a chip capable of programmable shading in 2001, the GeForce 3, used by the Playstation 2 and the Xbox. Previous to GeForce 3 (codename NV20) there were some others: NV1 (1995), NV3 (1997), NV4 (1998), NV5 (1999), GeForce I (late 1999) and GeForce II (2000). However, the GeForce 3 was likely the first popular Nvidia GPU.

It is interesting to point out the difference between target categories and architectures in the Nvidia world, which can be confusing for the readers. Traditionally, Nvidia has designed a different type of product for each target category of clients, given the name to four different products: GeForce, Quadro, Tesla and (more recently) Jetson; although the underlying architecture used internally is the same for the four products. In Nvidia words, they four have the same compute capability. The GeForce line is focused on desktop and gamers; the Quadro is thought to workstations and developers who create video-content; whereas Tesla is designed for supercomputers and HPC. Finally, the Jetson line contains embedded GPUs in chips.

As we just saw above, Nvidia started its adventure in the early 90s with GPUs focused on grapichs, but we waited until 2007 to work with the first CUDA architecture: Tesla (yes, you are right, they used the same name of the architecture for a product line later, that’s why I said it can be confusing). Tesla is a quite simple architecture, so I decided to start with Fermi directly, which introduces Error-Correcting Code memory, really improves the context switching, the memory hierarchy and the double precision.

Fermi Architecture

Each Fermi Streaming Multiprocessor (SMI) is composed of 32 CUDA cores (Streaming Processors), 16 load/store units (LD/ST units) to address memory operations for sixteen threads per clock, four special function units (SFU) to execute transcendental mathematical instructions, a memory hierarchy and warp schedulers.

Fermi Streaming Multiprocessor (*Image by author*)

The board has six 64-bit memory partitions with a 384-bit memory interface which supports up to 6 GB of GDDR5 DRAM memory. The CPU is connected to the GPU via a PCI-e bus. Each CUDA core has a fully pipelined arithmetic logic unit (ALU) as well as a floating point unit (FPU). In order to execute double precision, the 32 CUDA cores can perform as 16 FP64 units. Each SM has two warp schedulers which enable issue and execute 2 warps concurrently.

A key block of this architecture is the memory hierarchy. It introduces 64 KB of configurable shared memory and an L1 cache per SM, which can be configured as 16 KB of L1 cache with 48 KB of shared memory; or 16 KB of shared memory with 48 KB of L1 cache. Whereas the CPU L1 cache is designed for spatial and temporal locality, the GPU L1 is only optimized for spatial locality. Frequent accesses to a cached L1-memory location does not increase the probability of hitting the datum, but it is attractive when several threads are accessing to adjacent memory spaces. The 768 KB L2 cache is unified and shared among all SMs that services all operations (load, store and texture). Both caches are used to store data in local and global memory, including register spilling. However, it is necessary to configure whether reads are cached in both L1 and L2, or only L2. This architecture is represented as compute capability 2.x, the special Nvidia term to describe the hardware version of the GPU which comprises a major revision number (left digit) and a minor revision number (right digit). Devices with the same major revision number belong to the same core architecture, whereas the minor revision number corresponds to an incremental improvement to the core architecture.

Fermi Memory Hierarchy (*Image by author*)

Kepler Architecture

Kepler includes up to 15 SMs and six 64-bit memory controllers. Each SM has 192 single-precision CUDA cores, 64 double-precision units, 32 SFUs, 32 LD/ST units and 16 texture units.

Kepler Streaming Multiprocessor (*Image by author*)

Also, four warp schedulers, each with 2 dispatch units, which allow four warps to be issued and executed concurrently. It also increases the number of registers accessed by each thread, from 63 in Fermi, to 255; it introduces the shuffle instructions and improves the atomic operations by introducing native support for FP64 atomics in global memory. It also introduces the CUDA Dynamic Parallelism, the capacity of launching kernels from a kernel. Additionally, the memory hierarchy is organized similarly to Fermi.

Kepler Memory Hierarchy (*Image by author*)

The 64 KB shared memory/L1 cache is improved by permitting a 32 KB/32 KB split between the L1 cache and shared memory. It also increases the shared memory bank width from 32 bits in Fermi to 64 bits, and introduces a 48 KB Read-Only Data cache to cache constant data. The L2 cache is also increased to 1536 KB, doubling the Fermi L2 cache capacity. Additionally, Kepler compute capabilities are represented with the 3.x code.

Maxwell architecture

Maxwell consists of up to 16 SMs and four memory controllers. Each SM has been reconfigured to improve performance per watt. It contains four warp schedullers, each capable of dispatching two instructions per warp every clock cycle. The SM is partitioned into four 32-CUDA core processing blocks, each with eight texture units, 8 SFUs and 8 LD/ST units.

Maxwell Streaming Multiprocessor (*Image by author*)

Regarding the memory hierarchy, it features a 96 KB dedicated shared memory (although each threadblock can only use up to 48 KB), while the L1 cache is shared with the texture caching function. The L2 cache provides 2048 KB of capacity. The memory bandwidth is also increased, from 192 GB/sec in Kepler, to 224 GB/sec, and native support is introduced for FP32 atomics in shared memory. Maxwell is represented as compute capabilities 5.x.

Maxwell Memory Hierarchy (*Image by author*)

Pascal Architecture

A Pascal board is composed of up to 60 SMs and eight 512-bit memory controllers. Each SM has 64 CUDA cores and four texture units. It has the same number of registers as Kepler and Maxwell, but provides much more SMs, thus many more registers overall. It has been designed to support many more active warps and threadblocks than previous architectures. The shared memory bandwidth is doubled to execute code more efficiently. It allows the overlapping of load/store instructions to increase floating point utilization, also improving the warp scheduling, where each warp scheduler is capable of dispatching two warp instructions per clock. CUDA cores are able to process both 16-bit and 32-bit instructions and data, facilitating the use of deep learning programs, but also providing 32 FP64 CUDA cores for numerical programs. Global memory native support is also extended to include FP64 atomics.

Pascal Streaming Multiprocessor (*Image by author*)

The memory hierarchy configuration is also changed. Each memory controller is attached to 512 KB of L2 cache, providing 4096 KB of L2 cache, and introduces HBM2 memory, providing a bandwidth of 732 GB/s. It presents 64 KB of shared memory per SM, and an L1 cache that can also serve as texture cache, which acts as a coalescing buffer to increase warp data locality. Its compute capability is represented with the 6.x code.

Pascal Memory Hierarchy (*Image by author*)

Finally it presents the NVLink technology. The idea behind is that any 4-GPU and 8-GPU system configurations can work on the same problem. Even, several groups of multi-GPU systems are being interconnected using InfiniBand® and 100 Gb Ethernet to form much larger and more powerful systems.

Volta Architecture

A Volta board has up to 84 SMs and eight 512-bit memory controllers. Each SM has 64 FP32 CUDA cores, 64 INT32 CUDA cores, 32 FP64 CUDA Cores, 8 tensor cores for deep learning matrix arithmetic, 32 LD/ST units, 16 SFUs. Each SM is divided into 4 processing blocks, each containing a new L0 instruction cache to provide higher efficiency than previous intructions buffers and a warp scheduler with a dispatch unit, as opposed to Pascal’s 2-partition setup with two dispatch ports per sub-core warp scheduler. This means that Volta loses the capability to issue a second, independent instruction from a thread for a single clock cycle.

Volta Streaming Multiprocessor (*Image by author*)

A merged 128 KB L1 Data Cache/shared memory is introduced, providing 96 KB of shared memory. The HBM2 bandwidth is also improved, obtaining 900 GB/s. Additionally, the full GPU includes a total of 6144 KB of L2 cache and its compute capability is represented with the 7.0 code.

Volta Memory Hierarchy (*Image by author*)

However, the biggest change comes from its independent thread scheduling. Previous architectures execute warps in SIMT fashion, where a single program counter is shared among the 32 threads. In the case of divergence, an active mask indicates which threads are active at any given time, leaving some threads inactive and serializing the execution for the different branch options. Volta includes a program counter and call stack per thread. It also introduces a schedule optimizer that determines what threads from the same warp must execute together into SIMT units, giving more flexibility, as threads can now diverge at sub-warp granularity.

The newest breakout feature of Volta was called a Tensor Core, which makes up to 12x faster for deep learning applications compared to previous Pascal P100 accelerator. They are essentially arrays of mixed-precision FP16/FP32 cores. Each of the 640 tensor cores operates on a 4x4 matrix, and their associated datapaths are custom-designed to increase floating-point compute throughput of the operations over this kind of matrix. Each tensor core performs 64 floating-point fused-multiply-add (FMA) operations per clock, delivering up to 125 TFLOPS for training and inference applications.

Additionally, the second generation of NVLink delivers higher bandwidth, more links, and improved scalability for multi-GPU system configurations. Volta GV100 supports up to 6 NVLink links and total bandwidth of 300 GB/sec, compared to 4NVLink links and 160 GB/s total bandwidth on GP100.

Turing (Not a completely new) Architecture

NVIDIA CEO Jen-Hsun Huang provided an interesting reply about the architectural differences between Pascal, Volta, and Turing. Basically, he explained that Volta and Turing have different target markets. Volta is intended for large-scale training, with up to eight GPUs that can be connected, with the fastest HBM2, and other features specifically for datacenters. Turing on the other hand is designed with three applications in mind: Pro Visualization, video gaming, and image generation that uses the Tensor Core. Actually, Turing has the same compute capability as Volta, 7.x, that’s why I said that Turing is not a completely new architecture.

The most remarkable achievements: use of GDDR6 memory and the introduction of RT cores that enables to render visually realistic 3D games and complex professional models:

Turing Streaming Multiprocessor. Source:NVIDIA

Like Volta, the Turing SM is divided into 4 processing blocks with each having a single warp scheduler and dispatch unit. Turing is almost identical to Volta performing instructions over two cycles but with schedulers that can issue an independent instruction every cycle. Additionally, rather than per-warp like Pascal, Volta and Turing have per-thread scheduling resources, with a program counter and stack per-thread to track thread state.

Ampere Architecture

The most recent CUDA architecture is called Ampere and delivers the highest GPU performance so far. Someone else has done a really complete review here, which I strongly recommend.

Each Ampere SM contains four processing blocks with each having L0 cache for data caching, a warp scheduler, 16 INT32 CUDA cores, 16 FP32 CUDA cores, 8 FP64 CUDA cores, 8 LD/ST cores, a Tensor core for matrix multiplication, and a 16K 32-bit register file. Each SM has an 192 KB of combined shared memory and L1 data cache; and at the GPU level, it has 40MB of L2 cache to increase performance (7x larger than V100 in Volta). The L2 cache is split into two partitions to achieve higher bandwidth.

Ampere Streaming Multiprocessor. Source: NVIDIA.

The GA100 GPU chip is composed of 128 SMs, but mainly because of -marketing- manufacturing, the different Ampere GPUs will only enable some of them. For example, the A100 GPU only exposes 108 SMs. Anyway, the full GA100 is composed of 8 GPC, each with 16 SM, and 6 HBM2 stacks. In the case of the A100 GPU, this is translated into 40 GB of HBM2 DRAM memory at 1555 GB/s.

It also introduces the third generation NVIDIA Tensor cores (after Volta and Turing), allowing it to compute an 8×4×8 mixed-precision matrix multiplication per clock (multiplying 8×4 matrix with 4×8 matrix). For example, each A100 Tensor Core executes 256 FP16 FMA (Fused Multiply-Add) operations per clock. Ampere supports many data types including FP16, BF16, TF32, FP64, INT8, INT4, and Binary on its Tensor Core.

Finally, the third generation of NVLink is presented on Ampere. In the case of the A100 GPU, it has 12 NVLink links with 600 GB/s total bandwidth for multi-GPU computing. Regarding PCIe connections, the A100 GPU supports PCIeGen 4 which provides 31.5 GB/sec of bandwidth per direction (for x16 connections), doubling the bandwidth of PCIe 3.