The world’s leading publication for data science, AI, and ML professionals.

Binarized Neural Networks: An Overview

Binarized Neural Networks are an intriguing neural network variant that can save memory, time, and energy. Learn how they work.

Photo by Alexander Sinn on Unsplash
Photo by Alexander Sinn on Unsplash

One roadblock in using Neural Networks are the power, memory, and time needed to run the network. This is problematic for mobile and internet-of-things devices that don’t have powerful CPUs or large memories. Binarized neural networks are a solution to this problem. By using binary values instead of floating point values, the network can be computed faster, and with less memory and power.

Mechanics:

Conceptually, binarized neural networks (BNN) are similar to regular feedforward neural networks (NN). One difference is that the weights and activations in a BNN are constrained to be only two values: 1 and -1, hence the name "binarized". Because of this, there is only one activation function used: the sign function. We can’t use regular activation functions like sigmoid or relu because those have continuous outputs, and as we just said the activations can only be 1 and -1.

Source: Boolean Masking of an Entire Neural Network
Source: Boolean Masking of an Entire Neural Network

These restrictions cause some problems when we do gradient descent to train the network. First, the gradient of the sign function is 0. This is bad because it makes the gradients of all the weights and activations also 0, meaning no training would actually take place. We can get around this by ignoring the sign activation functions when doing backpropogation. However, we have another problem, which is that the gradient updates will cause the weights to no longer be either 1 or -1. We solve this by keeping a set of real-valued weights, and doing the gradient updates on those weights. The network weights are then the binarization of these real-valued weights.

Our final algorithm is therefore:

  • Forward pass: We have real valued weights W_r and the input vector x. First, we apply the sign function to all the weights to get W_b = sign(W_r). Then we compute the output of our neural network as usual with W_b and the sign activation functions.
  • Backward pass: We do backpropogation as usual and calculate the gradient for the weights W_b, except we ignore the sign activation functions. We update the weights W_r by subtracting these gradients (remember, W_b comes from W_r, so we don’t update W_b directly!).
Source: [2]
Source: [2]

Performance:

We’ve seen how a binarized neural net works. Now let’s compare the theoretical performance difference between a normal NN and a BNN. A normal NN uses 32-bit floating point numbers. If we have two 32-bit registers, we can perform one multiplication between two 32-bit numbers with one computer instruction. A binarized network uses 1-bit numbers (we encode +1 as 1, and -1 as 0). To multiply two 1-bit numbers with our representation, we can use the XNOR instruction.

Source: [2]
Source: [2]

What we can do is put 32 1-bit numbers into each of our two 32-bit registers, and then run one XNOR instruction to multiply all the numbers simultaneously. The upshot of this is that we can do 32 1-bit multiplications with one instruction, and just 1 32-bit multiplication with one instruction. Therefore, the BNN has a theoretical 32x speedup versus a normal NN.

In practice the 32x number is hard to achieve. The way the CPU/GPU is programmed affects instruction scheduling, and instructions themselves don’t always take the same number of clock cycles to execute. Also, modern CPU/GPUs are not optimized to run bitwise code, so care has to be taken in how the code is written. Finally, while multiplication is a large part of the total computation in a neural network, there is also accumulation/sum that we didn’t account for. Nevertheless, [1] reports a 23x speedup when comparing an optimized BNN to an unoptimized normal NN, demonstrating that a significant speedup is definitely achievable.

Because we are using 1-bit numbers instead of 32-bit numbers, we also expect to use roughly 32x less memory. When we use 32x less memory, we have 32x fewer memory accesses. This should also reduce power usage by roughly 32x. I was unable to find experimental numbers testing this hypothesis, but this conclusion seems sound to me.

Finally, it is important to note that all of these BNN advantages only apply during run-time, and not during train-time. This is because, as we mentioned earlier, we keep a set of real-valued weights to train on. This means the gradient is a 32-bit floating point number, and is not subject to the 1-bit advantages we described. Therefore, BNN models should not be trained on memory/power-constrained devices, but they can be run on those devices.

Accuracy:

Of course, it doesn’t matter how much speed/memory/power improves if the network is unusable because of poor accuracy on the test set. Having binarized activations and weights gives less expressive power than a regular NN, so we expect worse accuracy. The question is how much worse. I wasn’t able to find any theoretical papers answering this question, so for now I think empirical numbers like the below are the best we have. The BNN implementation we described here is the first section, and regular NN variants are the last section.

Source: [1]
Source: [1]

We see there definitely is an increased error, especially on the CIFAR-10 dataset (10.15% vs. 7.62%). However, this roughly 3% difference isn’t the end of the world, considering all the performance benefits that BNNs have as described earlier. For tasks where every single point of accuracy is vital, for example medical x-ray screening, we wouldn’t want to use BNNs. But in situations where accuracy isn’t that important, or where performance is more important, BNNs are a viable choice.

In this article we’ve taken a brief look at the mechanics of a BNN. We’ve also seen how BNNs may drastically improve speed, memory use, and power use, with the tradeoff of being slightly less accurate. For a more in-depth look at BNNs, take a look at the below references.

[1] M. Courbariaux and I. Hubara, Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 (2016)

[2] T. Simons and D. Lee, A Review of Binarized Neural Networks (2019)


Related Articles