Daniel Shapiro, PhD
Towards Data Science
3 min readSep 20, 2017

--

Accelerating Deep Neural Networks

Neural networks are “slow” for many reasons, including load/store latency, shuffling data in and out of the GPU pipeline, the limited width of the pipeline in the GPU (as mapped by the compiler), the unnecessary extra precision in most neural network calculations (lots of tiny numbers that make no difference to the outcome), the sparsity of input data (lots of 0s), and many many other factors.

Neural Networks compile for A LONG TIME. (Credit: IT Experts)

How can we make deep neural network training, testing, and predictions faster? One way is to write faster algorithms, like the relu activation function, which is much faster than tanh and sigmoid, and another is to write better compilers to map the neural network into the hardware. A third approach is what I want to tell you about today. Making better hardware, and by better I mean faster processing speed. Matrix multiplication and indexing is at the core of deep learning, and it’s an “embarrassingly parallel problem.” That’s what gets the hardware guys really interested: the fact that a solution should be “easy”, or at least not impossible. In a recent article, I pointed to a nice comprehensive review of recent progress in accelerating deep neural networks prepared by MIT+Nvidia.

There are some really cool custom hardware solutions invented these past few years, like the Volta GPUs from nVidia, the TPUs from Google, and a bunch of FPGA accelerators.

Without tooting my horn too hard, let me tell you about the FPGA stuff, because the TPU and Volta stuff are a lot more “commercial”. Intel (Altera) and Xilinx are the only ones to benefit from selling FPGAs, and so you don’t hear as much about them, but AWS has FPGA instances, and the I/O you can get on an FPGA is pretty nuts.

Way back in 2011, my collaborators and I built custom processors on FPGAs to speed up neural network computations. Back then we were really into Bidirectional and Hopfield associative memory neural networks (unsupervised learning), whereas today we use much more supervised learning approaches like DNNs, CNNs, and RNNs. Put simply, you get better results on most problems with supervised learning.

In another paper, also in 2011, my collaborators and I used lookup tables on an FPGA to speed up the most common calculations encountered by a neural network.

These approaches still apply today. You can even go embedded and get a tiny parallela board for $100 and map a small neural network right into the on-system FPGA. It has an ARM9 processor, runs linux, and just for kicks it has a 16 CPU network-on-chip for coprocessing. The main downside is that this credit-card sized supercomputer will not literally melt and catch on fire if you push it too hard, but in many cases the chip will damage itself from the heat, even with a passive heat sink.

OK maybe I’m an alarmist. Here is a raspberry pi heat sink discussion with a lot less flash. One more for good measure. In my work on neural networks we use p instance GPUs in AWS for most projects. FPGAs are just too full-custom for the clients I’ve interacted with. And there is no nice and easy way to connect Keras to the FPGA. Oh well.

I will leave you with this video of a brave hacker cooking baloney with his overheated CPU:

If you enjoyed this article on artificial intelligence, then please try out the clap tool. Follow us on medium. Go for it. I’m also happy to hear your feedback in the comments. What do you think?

Happy Coding!

-Daniel
daniel@lemay.ai ← Say hi.
Lemay.ai
1(855)LEMAY-AI

Other articles you may enjoy:

--

--