ARTIFICIAL INTELLIGENCE | NEWS

This article is the third part of a 4-part series:
3. Project Dojo. Tesla’s New Supercomputer
The evolution of the GPU
It was back in the 70s and 80s that graphic processors – commonly known as GPUs – began to emerge together with the gaming industry. In the 90s, with the demand for arcade and console games, companies like Nintendo, Sony, and Fujitsu began raced to build improved 3D graphics hardware. But it wasn’t until Nvidia popularized the GPU in the early 2000s and released the Nvidia GeForce 8 series a few years later, that it became a general-purpose computing device, extending beyond gaming.
Today, GPUs are used in many areas. From gaming to linear algebra operations and image processing, to more novel applications like machine learning. In 2009, Andrew Ng and colleagues from Stanford University published a seminal paper in which they proposed graphic processors as a means to overcome computing limitations in machine learning model training: "[M]odern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods."
The GPU was no longer a specific hardware element, subservient to the CPU. Consequently, and with the rising of the deep learning industry, companies began to work on specific computing units that would exploit the groundwork laid by GPUs. In 2016, Google pioneered this trend with a new computing unit called Tensor Processing Unit (TPU), specifically designed for neural network training.
Chip manufacturers have been growing in number with the increasing demand for high-performance hardware needed to meet the requirements of ever-larger neural networks. SambaNova, founded in 2017, is leading the market for AI-specific chips – even competing with the likes of Nvidia. They bet on what VentureBeat’s Poornima Apte calls "software-driven hardware." They focus on what AI systems need and start from there. Cerebras, another startup on the same road, recently talked to the well-known AI company OpenAI. They want to fuel the next GPT generation with what seems to be "the largest computer chip ever."
Tesla’s shift to in-house chip development
In June 2021, Andrej Karpathy gave a talk on the strategy of Tesla towards full self-driving cars. He detailed the specifications of their latest and largest cluster for neural network training and testing (they have three clusters in total). The cluster consists of 720 nodes of 8 Nvidia A100 GPUs each, making up 1.8 EFLOPs at FP16 in total – in terms of FLOPs, it would rank 5th supercomputer in the world.
This worked out for Tesla until now. On the one hand, they didn’t want to rely on other company’s chips anymore and, on the other hand, Nvidia GPUs aren’t specifically designed to handle machine learning training, which makes them relatively inefficient for the task – GPUs were the best option until the AI industry grew so large that it became profitable to build specific hardware.
Tesla decided to follow the trend and start building their chips and eventually a supercomputer. Following the principle of vertical integration, they wanted to design and manufacture their hardware in-house and so Project Dojo was born.
Project Dojo – Tesla’s new supercomputer
Tesla’s goal with Dojo is to "achieve best AI training performance. Enable larger and more complex neural net models. Power efficient and cost-effective compute." This means Dojo isn’t necessarily intended to be more powerful or faster than the GPU Cluster they already have. Neither they want it to compete with the most powerful generic supercomputers out there. The main criterion is to make a computer that’s better at doing AI than any other – so they don’t need to use GPUs ever again in the future.
A common factor when building a supercomputer is how to find a compromise solution between scaling computing power (easy) while at the same time keeping high-bandwidth (hard) and low latencies (very hard). They found their answer in a distributed 2D architecture (a plane) composed of robust chips and a unique network fabric – allowing for fast communication, high bandwidth, and low latencies.
Loyal to their principle of vertical integration, they wanted to build themselves the elements at almost every level from the bottom up. From the training node, comprising the smallest computing elements, to the D1 chip, to the training tile – their unit of scale -, to the ExaPOD, the cluster that would eventually replace their GPU Stack. In the next sections, I’ll explain these components one by one finishing, as always, with a few insights:
- Training node – The smallest entity of scale
- D1 Chip – Comparable to the best GPUs out there
- Training tile – A magnificent piece of engineering
- ExaPOD – Tesla’s new supercluster
- Insights
Training node – The smallest entity of scale
GPUs consist of smaller sets of elements that are copied across the chip. These smaller sets are the training nodes. They hold the different parts needed to make the bulk of the computations – the arithmetic and logic unit, together with the controlling unit, SRAM memory, and other components.
Ganesh Venkataramanan, Director of Project Dojo, called the training node "the smallest entity of scale." It’s the smallest component that’s further scaled by placing exact copies in every direction. In particular, 354 connected training nodes make a chip, 25 connected chips make a training tile, 12 training tiles make a cabinet, and 10 cabinets make the ExaPOD. By scaling these elements all the way from the training nodes it’s possible to reach a computing performance up to the EFLOP – but some limitations need to be solved to achieve such a feat.
In particular, there’s the question of what size to make the training node. Too small makes it fast but too costly to synchronize. Too big makes it difficult to implement and can produce "memory bottlenecks." Because they wanted to keep the latency low, they designed the training node measuring the farthest distance that a high clock cycle signal (+2GHz) can traverse for 1 cycle (lowest latency) and drew a box around it to define the size of the node. And because they wanted to keep the bandwidth high, they filled the box with wires "to the brink."
Then the completed the high-performance node with the computing elements, the memory pool, and a programmable controlling core. This combination of features gives 1024 GFLOPs of compute at BF16 – which goes down to 64 GFLOPs at FP32 (the single-precision format is more used in performance tests). Finally, what makes these training nodes capable of scaling without worsening their performance is that they’re designed to be highly modular. That is, they’re connected in such a way that the computing capabilities are constant and they form a high-throughput communication plane.
D1 Chip – Comparable to the best GPUs out there
Putting together 354 training nodes results on 22.6 TFLOPs of compute at FP32 – for comparison purposes, the Nvidia A100 delivers 19.5 TFLOPs – and an on-chip bandwidth of 10 TBps in each direction. Around the set of nodes, they put an array of high-speed, low-power lanes to get an off-chip I/O bandwidth of 4 TBps per edge – which is twice the I/O bandwidth of state-of-the-art network switch chips. All together forms Tesla’s D1 chip.
In contrast with other chips out there like the Nvidia A100, the D1 chip is entirely purposed to train machine learning models. Its unique design provides "GPU-level compute, CPU-level flexibility, and twice the network chip-level I/O bandwidth." Here’s the comparison (off-chip bandwidth vs TFLOPs of compute) with state-of-the-art machine learning chips, including Google’s TPU, modern GPUs, and Startup chips.
The chips can be connected seamlessly, without glue, scaling the computational capacity and communication in every direction while keeping minimal latency between chips. The envisioned compute plane comprises ~500,000 training nodes and 1,500 D1 chips. But how could they integrate the chips to create such a compute plane and connect it with the rest of the high-level components – host systems and interface processors?
Training tile – A magnificent piece of engineering
The answer is training tiles. 25 D1 chips are integrated onto a fan-out wafer process so that they preserve the high bandwidth. Additionally, they put connectors on the edges to preserve the off-chip I/O bandwidth. The resulting component is what they call the training tile, which provides 9 PFLOPs at BF16 and 36 TB/s off-chip I/O bandwidth. This perhaps makes the training tile the "biggest organic mcm (multi-chip module) in the chip industry."
They designed the training tile to meet the criteria of high bandwidth and low latency across the computing plane, but they soon realized they needed to find new solutions to enable its manufacturing. To feed power into the training tile, they created a custom voltage regulator module that would go directly onto the fan-out wafer. They also integrated the electrical, thermal, and mechanical pieces to create a fully integrated training tile. The power supply and cooling are orthogonal to the compute plane, allowing for high performance, high bandwidth, and low latencies.
The training tile goes against the trends in the industry of "cutting the wafer into pieces," says Chanan Bos of CleanTechnica. "This is completely unprecedented."
ExaPOD – Tesla’s new supercluster
To build the cluster they just had to put together tiles. A 2×3 tile matrix forms a tray, and two trays together form a cabinet. ExaPOD consists of 10 cabinets. But, keeping in mind the necessities for high-bandwidth, they "broke the cabinets’ walls," and connected the trays one after another, creating a "seamless training mat."
The ExaPOD provides 1.1 EFLOPs at BF16 (120 training tiles, 3000 D1 chips, and +1M training nodes), which makes Dojo almost as powerful as the GPU cluster Tesla is now using for training its networks. Thanks to the highly distributed modular design, it’s possible to use any subset of Dojo – called DPUs, Dojo Processing Units – for training purposes.
The high-bandwidth low-latency fabric allows Dojo to perform 4x better than any other AI supercomputer at the same cost while keeping the carbon footprint five times smaller and saving more energy (1.3 times per W).
Elon Musk said at the end of the presentation that Dojo could be operational next year. If this wasn’t enough, Tesla has already thought of the next-generation plan that would allegedly provide 10x improvement over the 1st Dojo computer.
At the highest level, there are two keys to take away from the presentation on Dojo. First, building all the hardware in-house allows Tesla to achieve unmatched performance for training AI models and permits full vertical integration. Second, designing all the components to be highly modular helps keeping the bandwidth very high and the latency very low, two requirements to achieve such performance improvements. Tesla is again promising big, let’s see what they can deliver.
Insights
A fair comparison
The TOP500 project presents the most powerful non-distributed supercomputers in the world twice a year. This years’ June issue gives the first spot to the Japanese Fugaku, which achieves 442.01 PFLOPs per second. If we were to compare the ExaPOD’s performance of 1.1 EFLOPs with this, we’d surely conclude that Tesla is about to build not only the fastest supercomputer in the world, but it’d be twice as powerful as the current number one.
There are two reasons why Dojo won’t be crowned as the fastest supercomputer. First, the high-performance computers (HPC) that are considered for the TOP500 list have to be capable of performing many different tasks. Dojo’s specificity prevents it from qualifying for the status of HPC.
Second, the performance tests of HPCs are conducted on simple- or double-precision formats. That is, FP64 or FP32. Dojo achieves 1.1 EFLOPs at BF16 (Brain Floating Point Format 16. "Brain" is for Google Brain), which computes half the number of bits as the FP32. And it doesn’t support FP64, which is needed for the most demanding scientific computations.
However, for illustration purposes, we can calculate how many computations per second Dojo can do. Because Tesla disclosed the performance of a D1 chip at both BF16 and FP32, it’s possible to make the conversion to calculate the computing capability of Dojo at FP32. (This procedure isn’t exactly right because we can’t simply scale performance linearly from the chip to the cluster, but it serves us to make a rough comparison.)
The D1 chip gives 22.6 TFLOPs at FP32 and 362 TFLOPs at BF16. The ExaPOD gives 1.1 EFLOPs at BF16. Doing the math we have: Dojo performance at FP32 = 1.1 EFLOPs (BF16)/362 TFLOPs (BF16) · 22.6 TFLOPs (FP32) = 68.67 PFLOPs at FP32. If we assume the calculations are sufficiently accurate, Dojo is slightly less powerful than the current cluster Tesla is using, which provides ~90 PFLOPs.
In any way, Dojo is way more efficient in terms of costs and pollution and, in terms of AI training, no computer will probably beat Dojo for a long time.
A unique design
To create a system perfectly aligned with AI systems’ necessities Tesla engineers needed to break some rules and make a few innovations with respect to the industry standards. Bos has a very thorough review here on this topic.
Tesla’s D1 chip is a "system on a chip," or SoC. A chip that’s an SoC includes cache memory, a processor, a graphics card, and other components integrated within. Nowadays most chips are designed like this. However, there are a few important differences between the D1 chip and other similar chips, and between the training tile and other MCMs.
The first thing any expert in computing hardware would realize is that Tesla promises a level of performance in the training tile that usually can’t be defined a priori with total certainty. The reason is the way the chips are usually integrated into the training tile.
Chips aren’t made by placing their components individually. Instead, the elements of the chip are integrated into patterns in a slim circular piece of high-quality silicon, called a wafer. This wafer is then broken down into pieces that make up the processors (GPUs, SoCs, and so on).
In the process of breaking the wafer, some chips can become partially useless. That’s why it’s unusual that Tesla can promise a flawless performance from the training tile (which following the standard in the industry would be a piece of the broken wafer). How can they make sure the 25 D1 chips work perfectly fine in the training tile when sometimes the chips don’t work as intended?
There are two possibilities. Tesla engineers may have found a way to ensure a perfectly working 5×5 grid of D1 chips when the piece is extracted from the larger wafer. Another option is that the training tile is itself the whole wafer. In any case, there’s groundbreaking innovation because by doing this they can guarantee the performance of the ExaPOD from the design of the D1 chip.
The second great difference is that computers always have a RAM (random access memory) component outside the chips, but Dojo doesn’t. There are two types of RAM; SRAM (static RAM, for instance, the cache memory is SRAM) and DRAM (dynamic RAM). The main advantage of the SRAM is that it’s faster to access and consumes less energy. On the other side, DRAM is denser and so it can host more data in the same space. Both are usually necessary but Tesla has designed Dojo so that it doesn’t need DRAM.
The training nodes have 1.25 MB of SRAM each. Bos argues that it’s probably one of the faster types of SRAM, L2 cache – which has a response time of 3–4 ns (in contrast to the response time of 60 ns of DRAM). By putting 354 training nodes in each D1 chip, it amounts to 442.5 MB of cache per chip, which is more than any other chip out there.
So what we get here is a D1 chip that has enough SRAM to not need an external DRAM nor a shared cache either. "As strange as the design sounds, the missing components that you would usually expect to find in an SoC might have been unnecessary," Bos says. "This is a very specific system fine-tuned to a very particular task whereas most processors have a wider array of components to be more flexible to fit all kinds of tasks."
As a closing section, I want to highlight the importance of Dojo, not only for its breakthrough specs both in terms of innovation and performance but also because Musk said that they’d allow other companies to access Dojo to train their neural networks in the future.
When a user asked him if he had thought of Dojo as a machine-learning-training service, Musk’s response was simply "Yes."
And it wouldn’t be just to train networks related to vehicle autonomy, but "pretty much any machine-learning."
Although Tesla hasn’t made any great theoretical breakthroughs, like OpenAI or DeepMind – and for now no one can use what they’ve invented or designed for research purposes – it’s clear they’re betting hard on AI applicability. How it’ll turn out in the end is yet to be seen, but it’s worth keeping an eye on them.
If you liked this article, consider subscribing to my free weekly newsletter Minds of Tomorrow! News, research, and insights on Artificial Intelligence every week!
You can also support my work directly and get unlimited access by becoming a Medium member using my referral link here! 🙂