The Sparse Future of Deep Learning

Could this be the next quantum leap in AI research?

Michael Klear

Published in

Towards Data Science

7 min readNov 25, 2018

A new deep learning algorithm has the potential to be a game changer.

In June of 2018, a group of researchers (Mocanu et al.) published “Scalable Training of Artificial Neural Networks with Adaptive Sparse Connectivity inspired by Network Science.” They show that their novel way to train neural networks is competitive with state-of-the-art methods and requires far fewer compute resources; that is, in theory. This should allow future projects to scale projects up to unprecedented scales.

The paper appears to have gotten little traction in the machine learning community so far, perhaps because of its publication in the natural sciences journal Nature Communications. The machine learning community tends to pay attention to a number of specific journals. Nature Communications publishes a lot of interesting papers on machine learning applications in the natural sciences, but I fear many ML researchers have overlooked this important paper because it’s not a typical journal to find important novel deep learning algorithms.

The journal name is the only reason I can think of for the minimal amount of attention this paper has gotten. Looking over their results, it appears the research has opened the doors to a quadratic increase in the learning capacity of deep neural network models.

So what’s the big deal?

Fully Connected Layers

To understand this paper, one must first understand the scalability issues associated with modern deep neural networks.

Deep neural networks use fully-connected layers (FCLs). These use dense matrices to transform input vectors into an output space. An FCL is usually represented by a diagram like this:

Fully-connected layer diagram, taken from Jessica Yung’s blog. Each input node connects with each output node.

This has proven to be a very powerful method. The problem is that the number of connections in each FCL grows quadratically with the number of nodes in the inputs and in the outputs.

An FCL with 16 inputs and 32 outputs has 16 x 32 = 512 connections; an FCL with 32 inputs and 64 outputs has 32 x 64 = 2048 connections. This puts a practical upper limit on the number of inputs and outputs that a FCL can have.

This is a problem when we deal with very high-dimensional data; simply plugging the data into a neural network with FCLs is computationally intractable.

Datasets in the biology domain, including genetics research, are an important example of high-dimensional data.

Images are an example of very high-dimensional data; for example, a RBG image with 256 x 256 pixels contains 380,208 values. If we hope to apply a FCL here, we’d end up with tens or even hundreds of millions of parameters in just one layer.

Convolutional layers provide a way around the quadratic FCL problem and lead to the breakthroughs in image processing models in the 2010’s. Basically, we can use convolutions to reduce the dimensions of our input data sufficiently to pipe them in to FCLs before producing some useful output.

In the case of convolutions, we take advantage of the fact that adjacent pixels share information that is important. There is inherent structure to pixels in image data that gives meaning to the overall image. Convolutions use our prior knowledge of this structure to produce excellent results in image processing problems while limiting the computational cost of training these models.

Example of convolution layers feeding image data into a fully-connected layer. — Source: https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-networks--1489512765771.html

But what if our data isn’t structured in this way? For example, what if we are modeling genetic data with tens or hundreds of thousands of DNA markers as input features? There’s no inherent structure to take advantage of here, and convolution layers cannot save us in such situations.

Sparsely Connected Layers

Is it absolutely necessary that each node in one layer be connected to each layer in the next? The answer turns out to be “no.”

We can easily imagine a layer that does not connect every node exhaustively. In the research, this sort of layer is often called a Sparsely Connected Layer (SCL).

Diagram comparing FCL to SCL. Taken from Amir Alavi’s blog.

It’s easy to imagine this. In fact, many networks found in nature behave a lot more like this. For example, returning to the original inspiration for artificial neural networks (the brain), neurons (analogous to nodes) are only connected to handful of other neurons.

SCL have been applied and studied in various projects and publications. It is a heavily explored topic in the study of neural network pruning.

Which Connections Should We Use?

In some cases, researchers use observed networks found in the source of data (such as protein interaction networks) to construct sparsely-connected layer architectures. But what if we don’t have any such prior knowledge at our disposal?

Finding a way to learn to connect nodes was still an outstanding problem. It appears to now be solved.

Learning a Sparsely Connected Layer Topology

Finding an intelligent way to connect nodes is the subject of the Adaptive Sparse Connectivity paper. The algorithm, called the Sparse Evolutionary Training (SET) procedure, is actually very simple.

The SET algorithm, taken from the original publication.

Basically, we randomly initialize SCLs in our network and start training using backpropagation and other standard-issue deep learning optimization techniques. At the end of each epoch, we remove the connections with the smallest weights (the “weakest” connections) and replace them with randomly initialized new ones. Rinse and repeat.

SET turns out to be surprisingly robust and stable. Encouragingly, the authors are able to show very similar results to FCL models (sometimes surpassing their performance) with SET models that contain far fewer parameters.

SET-MLP is competitive with a fully-connected multilayer perceptron model, while using quadratically fewer parameters (taken from the original publication).

These results impressed me, but apparently the greater ML community didn’t get very excited.

Encoding Domain-Specific Information in Learned Connections

Not only do SET networks learn the supervised objective; they also encode input feature importance in the connections formed in each layer.

An easy example of this is looking at the input connections from an SET model trained on MNIST.

Input connections from SET-trained SCL on MNIST. Digits are centered and scaled to be consistently located throughout the dataset; it stands to reason that a well-learned connection map will place more connections on these center pixels (taken from the original publication).

When we look at the evolution of the distribution of input connections in the MNIST problem, we can see the model implicitly learns about the distribution of predictive information in the input data through the connections it keeps. In a domain where we do not already know this distribution, this information can help us “discover” interesting features and relationships in the unstructured input data.

I Want My SET

I read this paper and got really excited. It promises to deliver a new avenue to previously intractable ML problems. The problem is, we’re not ready yet. The researchers perhaps said it best themselves:

Currently, all state-of-the-art deep learning implementations are based on very well-optimized dense matrix multiplications on graphics processing units (GPUs), while sparse matrix multiplications are extremely limited in performance.

In fact, the researchers took advantage of these optimized structures to perform their proof-of-concept experiments. They simply used FCL and masking layers to simulate their algorithms, leaving the reader with nothing but hope for the awesome future of sparsely-connected layers learned through SET.

If these software engineering challenges are solved, SET may prove to be the basis for much larger ANNs, perhaps on a billion-node scale, to run in supercomputers.

I’m not a computer scientist, but I want to get the ball rolling. That’s why I decided to implement SET in PyTorch. I built the algorithm to use sparse data structures, but it’s not optimal. It’s pretty slow, to be honest. But I want to get this thing out in the wild so people can start experimenting with it and hacking away at it. Maybe soon we will have a much faster implementation.

For now, anyone reading this is welcome to experiment with SET by using my custom PyTorch layer. I call the project “Synapses” because it’s a cool, fitting name that luckily wasn’t taken on the PyPi repositories. To use it, just install with pip:

pip install synapses

The only dependency is torch!

It is slow, but at least it is truly sparse. No dense layers and masking trickery involved here.

Take a look at some example code and try playing with it a bit! I already found a method that may speed up optimization significantly by changing the evolution policy: rather than evolving the connections at the end of each epoch, my flavor of SET evolves the connections with some very small probability at each forward pass of the data. It appears to lead to faster convergence, although I need to experiment a bit more to support this claim.

The conclusion of this research is that the algorithm is robust and simple, leaving much room for improvement in future works. I hope Synapses will facilitate that!