Why Cerebras announcement is a big deal

Cerebras’ chip can become the de facto chip for Deep Learning

Published in

Towards Data Science

5 min readAug 21, 2019

One of the biggest problems with Deep Learning models is that they are becoming too big to train in a single GPU. If the current models were trained in a single GPU, they would take too long. In order to train models in a timely fashion, it is necessary to train them with multiple GPUs.

We need to scale training methods to use 100s of GPUs or even 1000s of GPUs. For example, a famous researcher was able to reduce the ImageNet training time from 2 weeks to 18 minutes, or train the largest and the state of the art Transformer-XL in 2 weeks instead of 4 years. He used 100s of GPUs to do that.

As models become bigger, the more processors are needed. Whenever scaling the training of these models into several GPUs, there are a few bottlenecks that can increase significantly the time to train them. But there are two main bottlenecks that block development in this area: the network speed between processors and how much memory each GPU can store. Let’s cover them:

Network Speed

The speed of the network becomes a bottleneck, due to the fact that to train these Neural Networks, you need to pass around the gradients of each node so that the algorithm can figure out how to update its weights. Algorithms like Ring AllReduce are usually used to synchronize the work between different nodes. The speed of communication between chips is so important that Nvidia spent 6.8B dollars buying a company called Mellanox, in order to improve the communication between its GPUs.

GPU Memory

The second part of the the challenge to scale models into multiple GPUs is the amount of memory each GPU has. Neural nets can consume GBs of memory, and GPUs only have MBs of on-chip memory. Currently to solve this problem, GPUs store neural nets on external memory soldered next to it. The problem is that external memory is 10 to 100x slower & more power hungry vs. on-chip memory.

Large models like Google’s Neural Machine Translation don’t even fit in one GPUs external memory. Often they have to be split up across tens of GPUs. This increases latency by another 10 to 100x.

Therefore, storing weights and the dataset used in training on each GPU is important to train the model fast. The more memory it has, the faster you can train a model and the less energy it uses. Ideally the whole model fits on a single chip.

Cerebras integrated approach

It is not a coincidence that the fastest AI chip today is also the largest. The more area, the more cores and memory. The problem is that the process today for making chips uses standard lithography machines made by ASML, that have a reticle size of around 858nm², almost the same size as Nvidia V100. It is at its limit.

ASML machinery to produce chips of size of around 858nm²

Cerebras integrated the old process of making chips with a new method. What Cerebras has done is to build a single wafer of chips as one “giant chip”. Its new “chip” contains a total 1.2 trillion transistors. That is 50x more than the state of the art GPU chip produced by Nvidia. More importantly, with all these chips merged into a big one, Cerebras’ chip achieves much higher speeds of communication between them. It also has much more memory! Reportedly, it has 3,000x the on-memory chip of Nvidia’s flagship GPU and 10,000x the bandwidth that GPUs could achieve before.

*Cerebras’ co-founder Sean Lie and the* Cerebras’ Wafer Scale Engine (WSE). It holds 400,000 cores and 18GB of memory

In the 80s, there were companies that tried to build a big integrated chip, like Cerebras did, but they failed. These companies failed due to lack of funding. They weren’t able to overcome a lot of engineering problems to do it. Cerebras did it.

Cerebras built the largest chip ever, a chip that is the size of entire wafer. Cerebras’ chip (big wafer) still uses the same method as the old one to build each “separate chip”, but the wafer is “etched” by sections. Cerebras worked with TSMC (the manufacturer) to add additional wires, so all the chips worked together, as a whole instead of separate ones. To built a wafer size chip, Cerebras overcame 5 big challenges:

Challenges

1. Communication between scribe lines

First, to achieve that, the Cerebras team had to handle communication across the “scribe lines.” Working with TSMC, they not only invented new channels for communication, but also had to write new software to handle chips with trillion-plus transistors.

2. Chip Yield

The second challenge was yield. With a chip covering an entire silicon wafer, a single imperfection in the etching of that wafer could render the entire chip inoperative. Cerebras approached the problem using redundancy by adding extra cores throughout the chip that are used as backup in the event that an error appears in that core’s neighborhood on the wafer

3. Thermal Expansion

The third problem that they handled was thermal changes. Chips get extremely hot in operation, but different materials expand at different rates. That means the connectors tethering a chip to its motherboard also need to thermally expand at precisely the same rate. They invented a material that could absorb some of that temperature difference.

4. Packaging and manufacturing flow

Their fourth challenge was to basically integrated with existing server infrastructure. Nobody had tools to handle chips of this size. So they had to built packaging for it, and create a new manufacturing flow. Not only that, but they also created software to test it all out.

5. Cooling

Finally, all that processing power in one chip requires immense power and cooling. Much more than “smaller” chips. They essentially approached the problem by turning the chip on its side. Cooling is delivered vertically at all points across the chip, instead of only horizontally.

Road ahead

After overcoming all these challenges, Cerebras still has a long path ahead, that is not going to be easy. They will still have to prove themselves before becoming mainstream. They started releasing prototypes to a few customers, and it will take some time for customers to use its chips in production.

Not only that, Nvidia has built a lot of tooling for developers to create and deploy their models in multiple GPUs. It will also take time for developers to have access to similar tooling that works with the Cerebras’ chip. But this chip will represent a big step towards bigger and better neural networks, and it will be a big step towards building AGI.