SambaNova’s $676M Funding Round

Dataflow architecture and how it may influence your IT strategy

Dan McCreary
Towards Data Science

--

SambaNova is an example of a Dataflow Architecture that has many architectural similarities to the human brain. This figure is my own intuitive measure of a similarity score between hardware, knowledge graphs, and the human brain. Note that I rank Dataflow as having a .9 similarity score with the human brain. Artwork by author.

This week a new AI startup SambaNova landed $676 million in Series-D funding, making it the best-funded AI startup in the world. Many of my colleagues that study enterprise knowledge graphs and large-scale machine learning may be a bit confused by the scale of the funding. Indeed, at first, even I was surprised by the large funding for such a small company. However, after studying SambaNova’s 40B transistor dataflow chip architecture, I now understand why their system deserves this level of attention.

This blog will review what dataflow architecture is, how SambaNova implements this architecture, why this funding makes sense, and what you need to do to integrate this new development into your overall IT strategy.

NLP and the EKG

Let’s first recall that over 80% of the knowledge in an organization is tied up in unstructured data. Enterprise Knowledge Graph (EKG) architects need high-quality custom-built natural language process (NLP) models like BERT and GPT to:

  1. extract uniform concepts from large documents,
  2. classify the documents for search,
  3. and extract specific facts that we can quickly integrate into our EKGs.

Every incoming e-mail from your customers should go into your EKG in minutes (not days), and directly impact their embedding. This is not only e-mail classification, but for product reviews, comments, and even social media posts. For many companies, the cost of building high-quality machine-learning NLP models has been a primary problem. Industries like life science and healthcare also need specialized vocabularies that are not included in the thousands of standard NLP models hosted on Huggingface.

With the SambaNova funding announcement last week, these cost barriers might eventually become much less of a concern.

Claiming Order-of-Magnitude Better NLP Model Training Performance

SambaNova figure from their website here. Image used with permission of SambaNova.

Let’s start with the claim by SambaNova: a 1/4 rack of their hardware has an order-of-magnitude performance advantage over 1,024 NVIDIA V100 GPUs for training large language models like BERT and GPT. Granted, the web page they provide is still a bit sketchy on the details. It is not clear if their claim about “an order of magnitude improvement in performance” is related to the time to build a large NLP model like BERT or GPT or some other metric like purchase price or ROI.

Note there is no mention of price comparison on this page. SambaNova could price their hardware at the $2M Cray Graph Engine level! And this could still provide value for some firms that need fast model training times. Yet, I think this price would make their hardware unaffordable for most companies, so let's hope for better pricing from them.

Let’s assume that this is not just marketing hype and that eventually, SambaNova backs up their claims with reproducible benchmarks and pricing supported by an independent third party. And we hope they make the hardware affordable by many smaller companies.

Now let’s look at how this would be possible. How could a new startup (under 200 employees on LinkedIn) create a system with “GPU-crushing” performance when NVIDIA is the most valuable chipmaker globally and has almost 20,000 employees? To understand how their claim might be valid, we need to understand how dataflow hardware is radically different from the design of both a serial CPU and SIMD GPU chips. Recall that GPUs were originally designed to accelerate video games, not perform training on massive deep learning data sets.

Understanding Asynchronous Dataflow Hardware

Figure 2: Comparison of globally clocked (synchronous) perceptron circuit on the left compared with an asynchronous dataflow (unlocked) circuit on the right. Image by the author.

When I was a VLSI circuit designer for Bell Labs in the 1980s, I was fascinated by the different approaches chip designers took to build highly specialized custom chips for high-speed telecommunications systems. One of the approaches was to build circuits that would take data inputs and immediately transform the data into the output without ever needing a clock. These circuits were called asynchronous dataflow architectures because they continuously flowed from input to output without ever using a clock to signal the circuits when they could start their calculations.

In the area of dataflow circuits, we use the term “kernel function” or simply “kernel” to describe an atomic library of functions such as an add, multiply, sum or average. We perform these functions on scalers, vectors, or multi-dimensional arrays of numbers.

Although dataflow circuits had fantastic performance characteristics, they had had several design challenges:

  1. You could not deterministically predict when the output for a given function (kernel unit) would be valid. Multiplication of two zeros might finish very quickly, but the multiplication of two large numbers would take a lot longer.
  2. Getting localized dataflows to work together with each other and with complex bus logic was a challenge.

As a result, depending on the input data and the timing between kernels, the output might differ for the same input. Testability was also a challenge. Entire conferences were dedicated to attacking these issues. The differences between the two architectures are shown in Figure 2.

Here are a few facts about dataflow architectures:

  1. They don’t have a single program counter — they are not typical von Neumann architecture.
  2. There is often less need for a central “global clock” to synchronize the circuits. Global clocks are still needed to move data into and off of communication busses, but they play a smaller overall role.
  3. The output of any block of circuits is solely determined based on the availability of input data. This should remind you of purely functional programming units.
  4. The timing of exactly when the correct results appear on the output signals is complex. Circuits to determine this could double complexity.
  5. Dataflow circuits can implement parallel data transformations much faster than traditional CPUs if the designers can figure out the way to synchronize the logic. Designing a circuit to signal when the results are complete is a non-trivial problem.
  6. Only a small percentage of circuit designers have ever been trained in designing asynchronous dataflow logic circuits.

Although synchronous circuits have dominated almost all chip designs, dataflow architectures still had their place when fast, and sometimes inexact calculations are being executed. The problem was that they only tended to be cost-effective in specialized settings that required high-speed parallel compute performance. It wasn't easy to design a dataflow chip that could be used as a general-purpose computing system.

Even if you could design a new dataflow chip architecture that had 100x the performance of a CPU, the other challenge is how to train developers to stop thinking about writing procedural IF/THEN/ELSE code and get them to think of solving a problem using dataflows. Many business plans of new startups don’t fail due to a lack of hardware performance. They fail to recruit enough developers that know how to program their chips. Even today, less than 1% of data scientists and Python developers can begin to fathom how to integrate an ultra-fast parallel processing FPGA into their production code. This is despite FPGAs being a mature technology that has been around for 30 years.

If you are interested in hearing a recent podcast on dataflow architectures, here is a podcast where Kunle Olukotun talks about dataflow architectures. Olukotun is one of the founders of SambaNova. In this podcast, Olukotun talks about how much energy is spent simply moving data around in traditional von Neumann “Software 1.0” ways of solving problems.

Looking Inside the SambaNova Architecture

SambaNova chip architecture that places dataflow compute units next to memory units and connects them with high-speed switches. Image used with permission of SambaNova.

I could spend an entire blog doing a deep dive into how SambaNova partitions their 40 billion transistors 7-nanometer TSMC-built chip into dataflow blocks. In summary, they partition kernel logic into dataflow kernel circuits designed to fit into logic blocks called Reconfigurable Dataflow Units (RDUs). And RDU can be quickly reconfigured using on-chip networking switches.

To summarize, they interleave both dataflow kernel circuits (PCUs) with memory, and they use high-speed on-chip switches to move data between these units. They place the kernel functions in the PCUs by analyzing the static structure of a deep learning neural network. Then the software wires the PCUs and memory together to minimize the movement of data around the chip. This all sounds horribly complex to program. Luckily, if you are using PyTorch of TensorFlow, you only need to change a few lines of code.

Entry Points: Hooking into Your Python Code

The primary entry points to the SambaNova system are directly through PyTorch and TensorFlow. Image used by permission of SambaNova.

The key to the SambaNova system is how they integrate all these dataflow kernels to train your neural networks. When you write most deep learning algorithms that train neural networks, you specify the topology of a directed graph of dataflows. SambaNova intercepts this code, builds a data flow graph within its hardware, and produces the same results as you would get using a CPU or GPU. The hard part is done by a Dataflow Graph Analyzer that converts the graph into a format that can be compiled into hardware connections.

This is similar to the same process that FPGAs are programmed. But unlike an FPGA that changes the wiring in a chip so that the wiring is permanent upon chip power-up, the SambaNova does dynamic routing of the data in switches, and it does not need two seconds; switches can reroute RDU inputs and outputs in milliseconds.

Enter Machine Learning on Deep Neural Networks

Despite the dominance of synchronous von-Naumann architectures over the last 40 years, a few chip architects still believed that the extreme challenges of designing dataflow architectures could be overcome. Many of these engineers were affiliated with the Stanford Electrical Engineering and Computer Science departments and the Stanford Pervasive Parallelism Lab led by Kunle Olukotun.

Around 2018 it also became clear that there was a strong need for computing centered around training deep neural networks. Companies were spending millions of dollars a year training these networks, and their tool of choice centered around two Python libraries: TensorFlow and PyTorch. This provided a way for dataflow developers to easily interrupt the python library calls and use their hardware without rewriting the algorithms from scratch. Now new hardware developers only have to build the “hooks” into a Python library to use their hardware to train neural networks.

The Future of AI: Low-Cost Specialized BERT Boxes

When OpenAI announced their 175B parameter GPT-3, it sent a shock wave through the entire AI community. Although they didn’t indicate how much they spent on GPU time to train these massive models, we can extrapolate somewhere between $5M and $10M. This puts building new deep learning models using BERT and GPT out of budgetary range for most academic research institutions. It also signaled to the hardware community that there was a growing demand for hardware specializing in deep neural networks training.

Much of the costs of building deep-learning models now have focused on the NLP and graph domains. BERT and the hundreds of BERT variants are now the most popular language model that needs to scale. The standard BERT is limited by the number of input tokens in processes, and BERT training costs go up exponentially as the number of input tokens goes up. Many companies and open source projects have focused on BERT large benchmarks and the clever tricks needed to scale these models on distributed GPUs.

We need to remember that SambaNova is not the first company to claim their hardware is an order-of-magnitude faster, much lower-cost than traditional GPUs. Both Graphcore and Cerebras have been making these claims for a few years now. They openly publish their performance benchmarks on public data sets that you can even try to reproduce (if you can afford to purchase the evaluation hardware). This brings us to our next topic: how do we know if vendors are telling the truth?

The Need for Objective Machine Learning Benchmarks

One of the first rules of any objective procurement process is to take any performance benchmarks published by vendors with a healthy dose of skepticism. I recall opening my ComputerWorld magazine in the 1980s and seeing both Oracle and Sybase both claiming in full-page back-to-back advertisements that they were each ten times faster than each other. All they had to do was adjust a few parameters in the configuration files to slow the opponents down. This taught me early on that vendor benchmarks could not be trusted.

This problem is made even more difficult when you carefully read the software license agreements of many software vendors. It makes it a violation of your license agreement to publish benchmarks without their written approval. Some vendors even refuse to give you license keys to run your own internal benchmarks, which I personally consider an unethical practice.

MLCommons and MLPerf

In machine learning, there are now a few small efforts to standardize machine-learning benchmarks, such as the MLCommons project. But even these “objective” websites are sometimes easily co-opted by supposedly independent consultants that big industry players fund.

If you do decide to try to understand the benchmarks like the MLPerf benchmarks, make sure to understand the difference between training and inference benchmarks, the difference between dense and sparse data representations, and the fact that SIMD hardware is a good match for dense matrix problems (like images) but may not be appropriate for sparse problems (like NLP, BERT and graph convolutions).

Getting raw execution speeds to train a large deep neural network is only the first step. The next step is to understand the long-term costs and ROI of doing this type of calculation. Many software optimizations can lower both the training cost and inference costs. Knowing how many people in your organization will need this type of hardware in the future is also difficult to project. Although cloud service providers are all happy to sell you GPU time, they may not be working in your best interest to explore innovative new machine-learning hardware.

Take-Home Points

You don’t really need to know how dataflow architectures work. But you do need to know that they may totally disrupt the machine-learning marketplace and the value of the existing GPU hardware. Here are a few take-home points:

  1. Consider these new hardware options if you are spending over $50K/month on training your machine learning models.
  2. If you are still using “R” and SAS to do data science, consider that the new generation of hardware vendors will only support Python libraries like TensorFlow and PyTorch. Now might be a good time to retrain your staff.
  3. If you are just about to purchase a large number of GPUs to do deep learning, consider leasing them so you can upgrade to specialized deep-computer hardware a year from now.
  4. If you currently own a large number of GPUs, consider selling them before they become obsolete. GPUs may soon be one of the fastest depreciating components in your data center.
  5. If you are using a cloud service provider (CSP), ask your CPS if they have a plan in place to help you evaluate low-cost BERT boxes and their kin.
  6. If you want faster and lower-cost inference, consider low-cost hardware options like FPGAs that can be reprogrammed in two seconds. As Occam’s Razor teaches us, sometimes the simplest solution is the best solution.
  7. Don’t blindly trust a vendor’s benchmarks. Do your own benchmarks on your own data in your own data centers. Take everything, including the cost of network cards and switches, into account when benchmarking.
  8. If you see your vendors making unfair claims check your license agreement before posting your internal benchmarks results on Twitter.
  9. Refuse to work with vendors that don’t allow you to evaluate their claims objectively. Ask them to post the code on public source code repositories like GitHub.
  10. Support open and transparent machine learning benchmarks.

Summary

The SambaNova funding announcement does not stand alone. Many new companies are building custom silicon hardware to help lower the cost of computing. From Raspberry Pi’s RS2040 to the Intel PIUMA graph accelerator, everyone is getting into action, and the semiconductor industry is ripe for disruption. We are looking forward to running objective benchmarks on these new systems!

--

--

Distinguished Engineer that loves knowledge graphs, AI, and Systems Thinking. Fan of STEM, microcontrollers, robotics, PKGs, and the AI Racing League.