Thoughts and Theory

Feedback Alignment Methods

A biologically-motivated alternative to backpropagation

Albert Jimenez
Towards Data Science
8 min readSep 16, 2021

--

Backpropagation’s simplicity, efficiency, and high accuracy and convergence rates, make it the de facto algorithm to train neural networks. However, there is evidence that such an algorithm could not be biologically implemented by the human brain [1]. One of the main reasons is that backpropagation requires synaptic symmetry in the forward and backward paths. Since synapses are unidirectional in the brain, feedforward and feedback connections must be physically distinct. This is known as the weight transport problem.

To overcome this limitation, recent studies in learning algorithms have focused on the intersection between neuroscience and machine learning by studying more biologically-plausible algorithms. One of the main family of methods is known as feedback alignment, which employ distinct forward and feedback synaptic weights.

Comparison of the different learning algorithms: Backpropagation, FA, and DFA — Image by Author

In addition to providing more biologically plausible training schemes, it has been shown that alignment methods can increase the robustness of deep learning models against adversarial attacks. Finally, the additional interest for feedback alignment is also driven by their ability to allow forward and feedback weights to live locally in application-specific integrated circuits (ASICs), which ultimately allows for time and energy savings.

In this post, I will describe the main feedback alignment methods and show a benchmark comparing their accuracy and robustness with backpropagation.

Introduction to feedback alignment methods

Backpropagation is not biologically plausible

The training of a neural network of N layers using backpropagation alternates between forward passes, to perform inference and compute an error signal, and backward passes, to send the error signal back and update the weights.

We can observe the weight update rule below:

Backpropagation weight update rule

The weight update of layer i requires the knowledge of the forward weight matrix W_(i+1). That is biologically implausible because it requires that neurons send to each other large numbers of synaptic weights to perform the backward pass.

Weight Transport Problem— Photo by Ian Taylor on Unsplash

Feedback alignment algorithms avoid the weight transport problem

Feedback alignment algorithms propose replacing the forward weight matrix W information in the backward pass, removing the need of transporting the weights. The difference between the different methods consist of how the backward pass is computed and how the backward weight matrix B is built.

Feedback Alignment (FA): The weight updates are computed in the same fashion as in backpropagation, but the backward weight matrix is a random matrix [2]. Initializing B to have the same distribution and magnitude scale as W helps improve network training convergence. The formula below depicts the weight update rule for FA:

Feedback Alignment wight update formula

Uniform Sign-concordant Feedbacks (uSF): This method weight update is similar to FA, but it transports the sign of the forward matrices by assuming synaptic weights with unit magnitudes [3]. Therefore, the B matrix is:

Sign Symmetry formula

Batchwise Random Magnitude Sign-concordant Feedbacks (brSF): Instead of assuming a unit magnitude for the backward weights of the ith layer, this method re-draws their magnitude|R_i|after each update such that:

Batchwise Random Magnitude Sign-concordant Feedbacks formula

Fixed Random Magnitude Sign-concordant Feedbacks (frSF): This is a variation of the brSF method where the magnitude of weights|R_i|is not re-drawn after each update, but rather fixed and initialized at the start of the training.

Direct Feedback Alignment (DFA): While the weight update in FA is computed recursively across layers, it is possible to project the error propagation by directly backwarding the derivative of the loss at the last layer, 𝛿z_N, to all layers [4]. This results in the following update:

Direct Feedback Alignment wight update formula

where B_i is a fixed random matrix of appropriate shape (i.e., input dimension of layer_i × output dimension of layer_N).

Algorithmic Benchmark

In this section I will explain the experiments carried out in Benchmarking the Accuracy and Robustness of Feedback Alignment Algorithms [5]. We used the BioTorch open-source framework to carry out the benchmark.

We initialized our layers forward and backward weights, W and B, using Xavier uniform. Using this variance preserving initialization allowed us to keep weights in the same magnitude scale and improved training in asymmetric conditions. For the uSF method, we scaled the sign of the weights by the standard deviation of Xavier initialization.

Our experiments show a significant gap in the classification accuracy of the models depending on the choice of the optimizer, specially for FA and DFA, even after tuning their respective learning rate. For this reason, we show the results of our experiments for both SGD and Adam optimizers.

MNIST & Fashion MNIST

We start our empirical study by benchmarking all the alignment methods for LeNet on the MNIST and Fashion-MNIST datasets. The networks were trained with the SGD optimizer setting a momentum of 0.9, and weight decay of 10^(−3). We trained for 100 epochs, decreasing the initial learning rate by a factor of 2 at the 50th and 75th epoch.

Top-1 error rate (%) for MNIST and Fashion MNIST
Top-1 error rate (%) for a LeNET network in MNIST and Fashion MNIST

We observe that the performances of FA and DFA are close to BP, and that sign concordant methods match BP performance on MNIST. If we increase the dataset difficulty (Fashion MNIST), the performance gap between backpropagation and the other methods, especially with the ones not using sign concordant feedback, also increases.

CIFAR-10

To scale the application of alignment methods to deeper architectures and more challenging tasks, we benchmarked a ResNet-20 and a ResNet-56 in CIFAR-10. Networks using the SGD optimizer were trained with a momentum of 0.9 and weight decay of 10^(−4) . Networks using Adam were trained with the same weight decay and betas parameters equal to [0.9, 0.999]. We trained with a batch size of 128 for 250 epochs, decreasing the initial learning rate by a factor of 10 at the 100th, 150th and 200th epoch. We used a grid search to select the best learning rate for every method.

The use of an adaptive parameter independent optimizer as Adam outperformed SGD in both network configurations for the FA and DFA methods, as shown below. These methods are the ones where the asymmetry in the backward pass is larger.

Top-1 Error (%) in CIFAR-10 for a ResNet-20 and ResNet-56 for every method trained with SGD and Adam
Top-1 Error (%) in CIFAR-10 for a ResNet-20 and ResNet-56 for every method trained with SGD and Adam

The significant improvement brought by Adam is expected since it maintains per-parameter learning rates that are adapted based on the average of the second moments of the gradients inherited from RMSProp. Therefore, it yields better performances under noisy gradients. To confirm this observation, we plot the backward-forward norm weight ratios for the DFA method for both SGD and Adam optimizers.

Weight ratios for the methods FA and DFA when training a ResNet56 on CIFAR-10 for SGD and Adam
Weight ratios for the methods FA and DFA when training a ResNet56 on CIFAR-10 for SGD and Adam

It can be observed that SGD has driven the norm weight ratios of the first layers of the network very close to 0, which means that the forward weight matrices W_i were updated to reach much larger values than the ones in the backward weight matrices B_i. This is due to the direct error projection of the error from the last layer to each layer in the DFA method, thereby sidestepping the small norms of the gradients computed using the chain rule.

The same observation does not hold for the FA method, where the weight norm ratio of the first layers does not vanish to 0 as shown below. However, we see that Adam achieves smaller backward-forward weight alignment angles compared to SGD.

Matrix alignment (left) and weight ratios (right) for a ResNet-20 trained with FA with SGD (a) and Adam (b)
Matrix alignment (left) and weight ratios (right) for a ResNet-20 trained with FA with SGD (a) and Adam (b)

ImageNet

Finally, we benchmarked all the methods training a ResNet-18 network on ImageNet. We used a batch size of 256 and trained for 75 epochs using SGD with an initial learning rate of 0.1. A scheduler decreased the learning rate by a factor of 10 at the 20th, the 40th and the 60th epoch. We used a weight decay of 10^(−4) and a momentum of 0.9. For DFA we used Adam with an initial learning rate of 0.001. Our results can be seen in the graph below:

Feedback alignment algorithms performance on ImageNet
Top-1 ImageNet validation error (%) for a ResNet-18 network trained with all the feedback alignment methods

We can observe that none of the alignment methods can match backpropagation in terms of performance. While the sign concordant feedback methods are closer in accuracy, there is still a small gap. For FA and DFA methods the gap is much larger, meaning that the feedback weights could not align with the backward weights during the course of the training.

Conclusion

Feedback alignment algorithms are a more biologically plausible alternative to backpropagation as they avoid the weight transport problem. Throughout the benchmark we have seen that even though their performance is competitive for MNIST and CIFAR-10, they do not scale to difficult tasks such as ImageNet. However, their application could be useful to reduce costs in applications using ASICs as feedback and backward weights are independent. Furthermore, the study of these methods remains important because, along with the help of neuroscience, help us unveil and understand more about the human brain learning processes.

Thanks for reading, hope you enjoyed the article and learned something new!

References

[1] Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton, Backpropagation and the brain (2020), Nature Reviews Neuroscience.

[2] Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman, Random synaptic feedback weights support error backpropagation for deep learning (2016), Nature Communications.

[3] Qianli Liao, Joel Leibo, and Tomaso Poggio, How important is weight symmetry in backpropagation? (2016), Proceedings of the AAAI Conference on Artificial Intelligence.

[4] Arild Nøkland, Direct feedback alignment provides learning in deep neural networks (2016), Neural Information Processing Systems.

[5] Sanfiz, Albert Jiménez, and Mohamed Akrout, Benchmarking the Accuracy and Robustness of Feedback Alignment Algorithms (2021), arXiv preprint arXiv:2108.13446.

--

--

Machine Learning Engineer & Researcher @scribd. I write about ML, Data Science and MLOps.