We present the implementation and the results of our reproduction of LambdaNetworks, a novel machine learning architecture developed at Google Brain by Irwan Bello, with the smaller and lower dimensional CIFAR-10 dataset.
The authors of this reproducibility project are José Ignacio de Alvear Cárdenas and Wesley A.J.G. de Vries, Aerospace Engineering Master students at Delft University of Technology. The source code of our implementation of LambdaNetworks can be found in GitHub. We also have a poster that summarises this article here. If you are interested in other reproductions by TU Delft students, you can find it at this link.
This article is structured as follows:
- Introduction
- Existing LambdaNetworks paper reviews and reproducibility projects
- Goals of this reproducibility project
- Why lambda layers?
-
LambdaNetworks explained 5.1 Content lambda 5.2 Position lambda 5.3 Lambdas applied to queries 5.4 Multi-query lambda layer
-
Constraints, dataset and parameters 6.1 Constraints 6.2 Dataset 6.3 Parameters
-
Implementation details: ResNet-50 7.1 Data pre-processing 7.2 Model preparation 7.3 Training and testing
-
Implementation details: LambdaNetworks 8.1 Single lambda layer implementation 8.2 Lambda layer integration within ResNet-50
-
Results 9.1 Accuracy and model complexity 9.2 Training time and throughput 9.3 Learning rate sensitivity analysis
- Conclusion
- Recommendations
- View of the authors on the Lambda Networks paper
1. Introduction
Irwan Bello proposes in "LambdaNetworks: Modeling long-range interactions without attention" [1] a method where long-range interactions are modelled by layers which transform contexts into linear functions called lambdas, in order to avoid the use of Attention maps. The great advantage of lambda layers is that they require much less compute than self-attention mechanisms according to the original paper by Bello. This is fantastic because it does not only provide results faster but also saves money and has a more favourable carbon footprint! However, Bello still uses 32 TPUv3s and the 200 GB sized ImageNet classification dataset. Therefore, we started this reproducibility project wondering: Could lambda layers be scaled to mainstream computers while keeping their attractive properties?
In 2021 the world did not only have to deal with the COVID-19 epidemic but was struck by chip shortages as well due to an increase in consumer electronics for working at home, shut down factories in China and the rising prices of crypto-currencies. This has decreased supply to record lows and prices to record highs. Resulting in a situation, whereby researchers, academics, and students (who are all usually on a budget) are no longer able to quickly build a cluster out of COTS (commercial off-the-shelf) GPUs resulting in having to deal with older, less, and less efficient hardware.
No official code was released at the time of starting the project in March 2021. Therefore, in order to answer the aforementioned question, it is up to us to reproduce the paper by Bello as accurately as possible while trying to scale it down such that it can be run on an average consumer computer. The produced code is made publicly available
2. Existing LambdaNetworks paper reviews and reproducibility projects
The paper by Bello has been published on February 2, 2021, and has been cited 7 times at the time of writing this article. This means that the article is brand new and therefore has not been combed through yet by the academic community.
However, some researchers and members of the machine learning community have already read the article and provided a review of the paper.
- Yannic Kilcher has published "LambdaNetworks: Modeling long-range interactions without Attention (Paper Explained)" on YouTube 4 months prior to the publication of the article by Bello. Kilcher goes over the preliminary version of the paper and explains them to listeners and provides recommendations to the author.
- Carlos Ledezma published a video along the lines of Kilcher but takes more time to extensively clarify the distinction between the structure of an attention layer and a lambda layer.
- Myeongjun Kim has not only reproduced the lambda layer code, but this member of the community has also applied it to different ResNet versions and different datasets, as well as performed an ablation study. The code from the aforementioned data scientists is not used for generating our code. However, we will briefly compare our code to theirs in order to see why they made specific choices for their implementations and justify ours.
- Phil Wang has published unofficial code using Pytorch about the lambda layer.
Luckily, Bello has clarified that he will publish the code corresponding to the LambdaNetworks paper soon. This will most likely enhance everyone’s understanding of the LambdaNetworks.
3. Goals of this reproducibility project
Originally, the scientific goal of this paper is to reproduce two particular results from Table 3 from Bello’s paper, which can be seen below. These results are Conv (He et al., 2016) and Lambda layer. This was set in order to find out whether we would be able to reproduce this without prior code implementations.
However, the 1-to-1 reproduction of these results is not possible for the average student and researcher due to the large dataset employed by Bello in his research, namely ImageNet, and the compute resources he exploited to obtain those results in a reasonable time, namely 8–128 TPUv3; resources beyond what we, as students, have at our disposal. Therefore, the scope was shifted slightly. No longer was it the goal to exactly reproduce the aforementioned results without published code, but rather to find out if these results are attainable with a much smaller dataset as well.
The personal goal of this reproducibility project is twofold. Firstly, attention seems to be state-of-the-art at the moment with a lot of ongoing research. However, attention has some shortcomings that lambda layers aim to fix while, at the same time, slightly increasing the accuracy. Therefore, reproducing the lambda layers can contribute to the early future adoption of this potentially superior algorithm by the community. Additionally, its implementation on a smaller and lower dimensional dataset proves the robustness of the algorithm, as well as its potential implementation on resource-constrained devices (TinyML).
The second reason is the enhancement and broadening of the Deep Learning knowledge and skills of the authors of the here presented reproducibility project, which is part of the Deep Learning course at Delft University of technology taught by Dr. Jan van Gemert . In order to get a feel for Deep Learning, it is not only important to read from the great number of online resources, but also to have hands-on experience with some groundbreaking papers. After having read the well-received paper called "Attention Is All You Need" [2], we were excited by the whole pool of papers that could follow up. From the abstract, lambda layers promised to be a great advancement from Transformers, targeting multiple weakness and leading to greater performance and lower computational load. Eager to learn more about attention and thrilled by the potential of lambda layers, choosing this reproducibility project was an obvious choice for us.
4. Why lambda layers?
Lambda layers are closely related to self-attention mechanisms as they allow for modelling long-range interactions. However, self-attention mechanisms have a big drawback which has to do with the fact that they require attention maps for modelling the relative importance of layer activations, which require additional compute and are hungry for RAM (Random Access Memory). This makes them less applicable for use in machine vision applications which heavily rely on images (consisting of a grid of pixels), due to compute and RAM requirements for modelling long-range interactions between each of these pixels. Therefore, it is evident that this problem should be solved in order to decrease the training and inference times of any attention based vision task.
As Bello says it himself in his article [1]: "We propose lambda layers which model long-range interactions between a query and a structured set of context elements at a reduced memory cost. Lambda layers transform each available context into a linear function, termed a lambda, which is then directly applied to the corresponding query".
Linear attention mechanisms [3] have posed a solution to the problem of high memory usage. However, these methods do not capture positional information between query and context elements (e.g. where pixels are on an image). Lambda layers, in contrast, have low memory usage and capture position information. The latter even results in increased performance, such that it outperforms convolutions with linear attention and local relative self-attention on the ImageNet dataset.
5. LambdaNetworks explained
The main advantage of the Lambdanetworks over Transformers is that content and position-based interactions between all components of the context, which can be chosen to be global or local, are computed without producing expensive and memory intensive attention maps. To achieve this, the lambda layers feature 3 main steps:
- Computation content lambda which encapsulates the context content.
- Computation of the position lambda which encapsulates the relative position information between the query and the elements of the context.
- Application of the content and position lambdas to the queries for the computation of the output.
The next figure provides a great overview of the complete process. In the following subsections, we will explain each of the steps in detail, constantly referring back to the presented lambda layer computational graph.
5.1 Content lambda
For the computation of the content lambda, the global context is used. In the case of a single square image, the global context consists of all the pixels. Therefore, if the image shape is d×n×n, being n the number of pixels along the width and length of the image and d the image dimensions (3 in the case of a colour image), then the context is of shape |n|×d, where |n|=n². The computational graph can be misleading when understanding the dimensions of the context since it shows that it is of shape |m|×d. In contrast with content lambdas, positional lambdas use local contexts and the author decided to express the dimensions as required for the computation of the latter. In reality, the content lambdas are computed first with the global context and then the positional lambdas are computed with local contexts. Unfortunately, this sequential computation of the lambdas and different matrix dimensions are not reflected clearly in the computational graph.
From the global context, the values and the keys are computed as:
Then the keys are normalised with softmax along the |m| dimension, whereas the values are batch normalised. Finally, the content lambda is obtained from the matrix multiplication of the normalised values and keys.
The benefit of this approach is that the content lambda encodes how to transform the queries based solely on the context content, independently of each query. As a result, the content lambda is shared among all queries/pixels of the image.
5.2 Position lambda
For the computation of the content lambda, the user can choose whether using local contexts of size |m|×d or a global context by making |m| equal to |n|. For the purpose of this reproducibility project, the global context is used for the computation of the position lambda due to the low |n| value of the inputs fed to the lambda layers, namely |n|⊂ [8, 4, 2, 1]. This small input dimension is caused by the dataset used, namely CIFAR-10, whose images are 6 times smaller than ImageNet. With this reduced dataset, the potential acceleration gained from computing the position lambdas from smaller contexts is insignificant when compared to the extra computations required to extract the local context from the input.
As can be observed next, the position lambdas are the result of the product between the value matrix and position embeddings. The latter are n learnt |m|×k matrices that encapsulate the positional relation between each of the n queries with the context. As a result, the embeddings are a |n|×|m|× k block that generates n positional lambdas.
Thanks to the separation of the content and the positional information, it is possible to share the position embeddings among lambda layers that receive the same input size, leading to a lower memory footprint. The position lambdas encode how the queries need to be transformed solely based on their relative position with respect to each of the components of its context.
5.3 Lambdas applied to queries
Once the content and position lambdas have been computed, it is possible to compute the final lambda matrices that will be applied to the queries by summing both components.
Furthermore, the queries are computed in a similar way as the keys and values, namely by applying the linear projection matrix to the inputs.
Once the queries and their respective lambdas are computed, the lambdas transform the queries to the output by a simple matrix multiplication between the computed lambda and the query generated from every pixel; each row in the Q matrix.
5.4 Multi-query lambda layer
In the LambdaNetworks paper [1] it was observed that the reduction of the value dimension |v| could greatly reduce the computational cost, as well as space and time complexities. Therefore, the author decides to decouple these complexities from this dimension by manipulating its value at will.
For that purpose, he proposes using |h| queries for each (pixel) input to which the same lambda is applied. Then the output for a single (pixel) input is the result of the concatenation of each of the h outputs:
Consequently, |v| is now equal to d/|h|, which reduces the complexity by a factor of |h|. This reduction in the dimensionality of the values is called by the author multi-query lambda layers. It is important to note that there is a trade-off between the size of the lambdas:
, and the size of the queries:
6. Constraints, dataset and parameters
In this section, we will explain how constraints in compute lead to the usage of a different dataset and how and why some parameters were adjusted from the original paper by Bello. We would however like to stress, that most of the parameters are kept exactly the same as the ones used by Bello for a fair comparison.
6.1 Constraints
There are some computational constraints in comparison to Bello’s setup. Bello, as a Google researcher, was able to use 32 TPUv3 units in his research project with the ImageNet classification dataset. Unfortunately, we do not have this compute at our disposal. TU Delft was kind enough to provide us with 50 Euros of Google Cloud credit. However, 50 Euros may would most likely not have been enough to train the architecture with the 200 GB sized ImageNet a few times. ImageNet consists of 130 million filtered and balanced JFT images with pseudo-labels generated by an EfficientNet-L2 with 88.4% accuracy, according to the set used by Bello. Therefore, a different dataset was chosen, namely CIFAR-10.
The data was trained and tested on Google Colab and a high-end laptop with a mediocre graphics card from 2020. The local laptop provided comparable results as a sanity check since the availability of components is fixed, whereas Colab assigns different resources to the client depending on their availability at the time of running.
The primary environment was Google Colab. It is widely available for free and, therefore, deemed as the most suitable candidate for this project. In this environment, the resources are not guaranteed as they are shared among users of the platform and therefore it is hard to say whether the obtained results will be consistent in terms of running time. This is due to the fact that not only the amount of RAM changes but also the available type of GPU can change between the Nvidia K80, T4, P4 and P100. The algorithm determines on a case-by-case basis how much RAM will most likely be used and allocate more to heavy users, up to 12 GB. The maximum allowed time for a session is 12 hours. For the purpose of this project, it was key that the allocated resources do not vary throughout a session which makes intersession results reliable for comparison.
The secondary training environment consisted of a laptop with an Intel i7–10750H on stock speed with adequate cooling, Nvidia Quadro T1000 Max-Q 4 GB DDR5 (stock speed) and 16 gigabytes of DDR4 ram in single-slot configuration at a stock speed of 2400 MHz. Due to the fact this runs fully independent, all results are reproducible.
6.2 Dataset
It was decided to go for a smaller dataset, namely CIFAR-10. It is a dataset with 60,000 32×32 colour images in 10 classes, with 6,000 images per class, split into 5:1 train-test ratio. An example of what these images look like can be found below.
6.3 Parameters
In order to find out how the author of the paper has exactly compiled the results, it is important to look at the initial conditions and hyperparameters which were used to obtain the results. Next, we compile a list of all the most important parameters used with their corresponding definitions and values. They are split up into three tables:
The first table contains the parameters that can readily be changed by the user. They all can be found in _user_input.py_ within our code. These parameters include among others batch size, input size and initial learning rate.
The second table contains other parameters used within the implementation of the ResNet-50 and the lambda layer. They are defined in _lambda_layer.py, resnet.py and resnet_lambda.py_.
The third table contains more information about the implementation of the algorithms. For more information, we would like to refer the reader to the documentation in the code.
7. Implementation details: ResNet-50
For the baseline implementation, we used the one provided by the Pytorch team. Next, we explain the data preprocessing steps, the model preparation and briefly discuss the training and testing steps.
7.1 Data pre-processing
Before the data can be fed to the algorithms, it is pre-processed. Here, the data is first downloaded, augmented and normalised. For that purpose, the original images are randomly cropped maintaining the original size by padding all image sides with four black pixels. Additionally, images are randomly flipped with 50% probability and are normalised with means [0.4914, 0.4822, 0.4465] and variances [0.2023, 0.1994, 0.2010]; values which correspond to the means and variances of the CIFAR-10 data along each of the 3 image dimensions. The transformation for the training dataset can be observed in the next code snippet.
After that, the training and test data are fed to iterators (DataLoader) and a sample of training images is plotted for visual verification.
7.2 Model preparation
Before executing the training and testing with the pre-processed data, the model needs to be created. For that purpose, the loss function and the optimizer need to be defined, as well as the learning rate scheduler.
For the definition of the loss function or criterion, the author of the LambdaNetworks paper mentions that label-smoothing with a smoothing value of 0.1 was used. However, this could not be found by default in the chosen ResNet-50 architecture implementation. As a result, the label smoothing presented by Christian Szegedy in the paper "Rethinking the Inception Architecture for Computer Vision" [5] had to be implemented. For that purpose, we adjusted the label smoothing proposed by Suvojit Manna, which leads to the final compact form shown in the code snippet below. As can be seen, the loss applied before the label smoothing is cross-entropy with log softmax activations.
Furthermore, the Adam optimizer was used with the hyperparameters (weight decay, initial learning rate, etc.) presented in the Parameters Section. The learning rate scheduler is based on a combination of the linear and cosine learning rate schedulers.
In the model preparation, it is further checked whether the model can be run in the host GPU with CUDA and, in the case that is specified by the user, it fills the model parameters with those stored in the user-specified checkpoint of the previous run. As design choice, the model automatically stores checkpoints every 5 epochs.
7.3 Training and testing
Once the model has been defined and the data pre-processed, the training and testing were executed simultaneously for 90 epochs on an Nvidia Quadro T1000 max-Q 4GB DDR5. During this process, two measures have been implemented in order to guarantee a smooth analysis and comparison of the results. First, the train and test accuracy, loss and computation time, as well as the learning rate for every epoch, are stored in log text files for posterior analysis. Additionally, the accuracy and loss information is stored such that it can be plotted in Tensorboard.
8. Implementation details: LambdaNetworks
8.1 Single lambda layer implementation
Even though some code snippets are provided in the paper, it assumes that the keys, values and queries have already been computed, no layer parameter initialisation is provided and it is not clear how the position embeddings are defined. For the implementation of the lambda layer and its smooth later integration within the ResNet-50 architecture, a class was created with 3 main methods:
init: initialisation method that receives as input the input size (|n|), the context size (|m|), the value size (|v|), the output size (d), the number of heads (h) and the position embeddings (E). The position embeddings are not created within the class such that they can be easily shared among multiple layers. Outside of the class, the embeddings are created and fed as input to different lambda layers. The instantiation of the position embeddings and their initialisation outside of the lambda layer is defined as follows:
For the computation of the keys, values and queries, single linear transformations are applied without biases. In the case of the queries, the output of the linear transformation would be of size |n|×(kh), as discussed in the Multi-query section. These transformations can be observed in the following code:
Additionally, 1D batch normalisation layers are instantiated for the values and the queries, as well as a softmax function for the keys:
This function ends by calling the "_reset_params_" method, which is explained next.
_reset_params_: initialises the learnt matrices of the lambda layer, namely the key, value and query projection matrices with the same normal distributions mentioned in the paper. The position embeddings are initialised outside of the lambda layer as mentioned before.
forward: function that is run during the forward propagation of the neural network. First, since the lambda layer requires to compress the input image height and width into a single dimension, namely |n|, the input x is reshaped as follows:
Furthermore, since the global context is used for the current reproducibility project, the context is also obtained from resizing the input accordingly:
The next step is to compute the keys, queries and values using the linear transformations and normalisations defined in init:
Finally, the lambdas and outputs are computed by using the torch.einsum function and following the equations outlined in the LambdaNetworks explained section. The output is reshaped at the end such that it has the same number of dimensions as the input fed to the lambda layer.
As can be observed, all the computations required to reproduce the lambda layer can be compressed in less than 20 lines of code.
8.2 Lambda layer integration within ResNet-50
When reading the original ResNet paper from Facebook AI Research [6], different architectures were proposed whose main difference was the number of layers. As can be observed in the first column, the ResNet architectures contain 5 blocks (conv1, _conv2x, _conv3x, _conv4x and _conv5x) which are referred to as cx stages in the LambdaNetworks paper.
In the LambdaNetworks paper [1] it is stated that the (non-hybrid) LambdaResNets are obtained by replacing the 3×3 convolutions in the bottleneck blocks, namely _conv2x, _conv3x, _conv4x and _conv5x of the ResNet architectures, by lambda layers. Therefore, the main change in the original ResNet-50 architecture can be found in the definition of the bottleneck layers, as well as the initialisation of the ResNet where the position embeddings are initialised.
The following two code snippets show part of the original ResNet-50 bottleneck layer initialisation and its corresponding lines of code in the ResNet-50 with integrated lambda layers. It clearly shows how the 3×3 CNN layers are exchanged for the lambda layers while the rest of the code remains intact.
Furthermore, as can be seen from the lambda layer architecture, the output of the lambda layer has the same dimensions as its input. However, in the original ResNet-50 architecture, some of the 3×3 CNN layers decreased the size of the input image by using a stride higher than 1. In order to establish a fair comparison between the baseline and the lambda layer implementation, the images also need to be downsized in the lambda layer implementation at those stages of the ResNet-50 architecture where the original 3×3 CNN layers used a stride higher than 1. For that purpose, at those stages of the network, an average 2D pooling layer is introduced with a kernel size of 3×3, (1, 1) padding and the corresponding stride. The benefit of a pooling layer, when compared to a 1×1 convolution, is that no additional parameters need to be learnt by the network.
Whenever the image is downsampled, new position embeddings are generated that will be shared with the next lambda layers until the image is newly downsampled.
Finally, the next figure shows the current architecture of a layer within a bottleneck layer of the ResNet-50 without and with lambda layers. In the scenario presented, the stride used by the layer is 2. In the original ResNet-50, there already existed a downsampling block made of 1×1 convolutions that was applied to the residual connection and that increased the dimensions of the bottleneck layer input to match its output. Now, the afore mentioned average pooling downsampling block is added at the end of the bottleneck layer. The implementation of this downsampling leads to a reduction in required GPU RAM from 5GB to 2GB.
9. Results
The main claims of the original paper on lambda layers are their superior performance and higher computationally efficiency when compared to their convolutional and attention counterparts. Therefore, the following two sections compare the accuracy of the original ResNet-50 to its modified version with lambda layers, as well as their required training computation times and throughput, respectively. Then, the third section provides a brief sensitivity analysis performed on the initial learning rate in order to tune this hyperparameter to the new architecture-dataset combination.
All the results were obtained with the parameters and values previously defined, except that the initial learning rate used was 0.0005; as it is discovered to be the best choice.
9.1 Accuracy and model complexity
The following two plots present the accuracy and the loss of the training and test data splits for the ResNet-50 and its lambda layer modified version with respect to the number of epochs. It shows the top-1 accuracy, which is the percentage of data points for which their top class (the class with the highest probability after softmax) is the same as their corresponding targets. As can be observed, the training and testing accuracy performance measure of the ResNet-50 is higher than that of the version with lambda layers. In the case of the test data set, it is 3.2% higher. As a result, when trained on a lower dimensional dataset as CIFAR-10, lambda layers do not outperform the convolutional counterparts; however, they still reach competitive results.
On the original ImageNet dataset, Bello reports an accuracy of the convolutional baseline of 76.9%, whereas the lambda layers version acquires an accuracy of 78.4%. When compared to the results obtained in the CIFAR-10 dataset, the relation between both architectures has been reversed. Besides that, it has been observed that the accuracy of both architectures increases on CIFAR-10. This observation alludes to the lower difficulty of classifying an image among 10 classes instead of 1000.
Finally, Bello reported that the baseline and the lambda layer models have 25.6M and 15M trainable parameters, respectively. In our case, they have 23.5M and 12.8M respectively. Since both models miss approximately the same number of parameters, namely 2M, we hypothesize that the missing parameters are from the borrowed ResNet-50 implementation and not from our lambda layer implementation.
9.2 Training time and throughput
The accuracy and loss plots also show the training time per model. For ResNet-50 this was 1 hour 17 minutes 29 seconds resulting in an average time of 51.6 seconds per epoch. For ResNet-50 with implemented lambda layer, the total run time was with 1 hour 3 minutes 6 seconds, resulting in 42.1 seconds per epoch. Therefore, it may seem that the training and testing time of ResNet-50 is approximately 18.5% longer with only a mere 3.2% increase in accuracy and that the implementation of the lambda layers pays off. However, one does not only have access to the final result obtained at epoch 90 but also at intermediate results. It can be seen that the top-1 test accuracy of the ResNet-50 with lambda layer at epoch 90 is already reached around epoch 41 for the baseline ResNet-50. In this case, epoch 41 for the baseline ResNet-50 translates to about 35 minutes and 20 seconds and is therefore almost two times faster than its equivalent with lambda layers.
When comparing the throughput results of different architectures, it is necessary that they are run within the same platform. Given that Bello ran his algorithms on multiple TPUv3s, it is not possible to compare the throughput results of this reproducibility project with those of Bello. However, the throughput between the baseline and the lambda layer version on the CIFAR-10 could be compared. In the case of the baseline, a training epoch takes, on average, 50.92 s. Given that there are 50,000 training samples, the throughput of the baseline is 981.93 ex/s. In the case of the lambda layers, a training epoch takes approximately 38.96 s. As a result, the lambda layer model has a throughput of 1283.26 ex/s. From these values, it can be seen that the lambda layer has a higher throughput than the convolutional counterpart, namely 31% higher.
9.3 Learning rate sensitivity analysis
Finally, in order to determine a good learning rate for the architecture-dataset combinations proposed in this reproducibility project, the initial learning rate (_initial_lr) was modified; however, the learning rate scheduler was maintained constant. As a result, the learning rate of the first 5 epochs is defined as [initial_lr, 2 ⋅ initial_lr, 3 ⋅ initial_lr, 4 ⋅ initial_lr, 5 ⋅ initial_lr_] and increases whereas the learning rates of the later epochs follow a cosine scheduler and decreases.
Top-1 test accuracies were obtained for the baseline and lambda models for the following initial learning rates: 0.01, 0.005, 0.001, 0.0005 and 0.0001. The highest learning rate (0.01) was obtained using the formula proposed by Bello, namely _initial_lr_ = 0.1 ⋅ B/(256 ⋅ 5), with B as our batch size of 128 samples. The rest of the learning rates were obtained by successively halving the pre-computed highest learning rate. The maximum accuracy can be observed at 0.0005 for the baseline and the lambda layer variant. The results can be seen in the next figure.
10. Conclusion
LambdaNetworks promise superior performance to convolutional and attention alternatives, as well as a lower memory footprint and a substantial speed-up during training and inference. The greater accuracy is mainly attributed to the combination of extracted position- and context-based interactions, whereas the lower memory usage and higher speed are achieved by bypassing the memory-intensive attention maps, sharing context information among the elements of the batch and sharing the position embeddings among lambda layers. In the original paper, Bello validates all these claims by training on ImageNet a ResNet-50 with lambda layers in exchange for its standard 3×3 convolutions. This choice of dataset hinders the reproducibility of the paper since the author uses 32 TPUv3 for training, a resource far beyond the reach of most students and researchers. Therefore, the work here presented assessed the accuracy and speed of the lambda layers by training on a lower dimensional dataset, namely CIFAR-10. For that purpose, we have not only summarised the original paper on LambdaNetworks and presented our implementation, but we have also discussed those aspects that we considered ambiguous for understanding or reproducing it.
From the reproducibility project, we can highlight 4 main conclusions:
- When trained on a lower dimensional dataset as CIFAR-10, lambda layers do not outperform the convolutional counterparts; however, they still reach competitive results.
- On the ImageNet dataset, Bello reports a baseline accuracy of 76.9% and a lambda layer accuracy of 78.4%. The accuracy of both architectures increases on CIFAR-10. This observation alluded to the lower difficulty of classifying an image among 10 classes versus 1000.
- The lambda layer has a higher throughput than the convolutional counterpart, namely 31% higher. This results in lower training times.
- The best initial learning rate is 0.0005 for both architectures.
The results from this work show that LambdaNetworks can be applied to smaller lower dimensional datasets with performances comparable to those presented in the original paper and within an affordable computational time budget, namely a few hours and not days.
The next steps of this work would be the implementation of the position lambda with local contexts in which |m| ≠ |n|. Previous reproducibility attempts have used 3D convolutions for this purpose. However, the dimensions of the resulting relative position embeddings are |m| × |k| instead of |n| × |m| × |k|. As a result, they are not computing a relative position embedding for every pixel, but instead a single one for the complete input.
Finally, we hope that this work will contribute to the discussion and potential adoption of this novel architecture by the research community and that our implementation will accelerate its integration by the industry.
11. Recommendations
In order to get a more sound conclusion on the amount of time it takes to obtain the train and test accuracy results, these runs have to be performed multiple times and post-processed by taking their mean or median. The importance of multiple runs increases when performing them in (free) Google Colab, since the compute availability can vary over time and one can not properly determine which hardware configuration was used at each point in time. The results within a session, however, are comparable, yet it would be advised to run the algorithm on a more stable platform in order to obtain more accurate running time results.
The maximum learning rate that Bello used with ImageNet differs substantially from what we used with CIFAR-10. While he used a specific formula to obtain the maximum learning rate, we tuned the learning rate using a coarse 1D-grid. It may not be fair to compare accuracies between his results and ours. Perhaps the behaviour of his trained ResNet-50 and ResNet-50 with lambda layer networks change significantly from ours as we optimized the learning rate. Therefore, it could be that there is much more potential in his network than shown. It is recommended to perform further research in finding the optimum learning rate. In this light, it could also prove interesting to see whether the tuning or removal of warm restarts and the type of scheduler could have a significant influence on the accuracy.
12. View of the authors on the LambdaNetworks paper
The paper on LambdaNetworks promised great advances in terms of accelerating training and inference, as well as reducing the memory footprint, while maintaining or slightly increasing the performance when compared to existing attention-based alternatives. Hereby we would like to present to you our view on the paper with some aspects that made it stand out from existing literature, as well as some points that could be improved.
Always good news first. Even though the paper is very long, namely 31 pages, we were glad that the author was very complete and detailed when discussing the algorithm. Bello did not only include compact code snippets with part of the implementation of the lambda layers but he also included all the information required to exactly reproduce the results (ambiguities have been discussed in this document), even the initialization used for the learnable parameters. Additionally, he included an ablation study to support his choice of architecture.
We were also positively surprised by Appendix A of the paper. It presents in a "Frequently Asked Questions" format theoretical and practical questions that the reader could ask himself/herself. Some of the entries of this appendix helped us to better understand the paper and implement it. Therefore, we believe that it is a great addition to the paper, as well as a successful format that the community could consider adopting in future publications. Upon submission to a conference, there are always questions posed by reviewers that the author responds to in order to have his/her submission accepted. In the case that those questions do not lead to a change in the final document, they could be included with the author response as a "Q&A" appendix.
Although it is true that for the reproduction of the paper it contains all the required information, we consider that the author presented too many variations of his model that makes its reading difficult and sometimes confusing. Therefore, we believe that the paper could have been split into multiple papers. For instance, in multiple parts of the paper, Bello mentions the use of the ResNet-RS with and without squeeze-and-excitation, an architecture that has been published at the same time as the lambda layers paper. The incorporation of this information scattered throughout the whole paper only contributes to the disorientation of the reader, even for a specialist in the field of attention since the information has only become recently available. Also, it is not very clear what is the difference between LambdaResnets and the ResNet architecture with lambda layers. We believe that the latter is a specific case of the former, but this is not clear specified in the paper.
Besides that, we are surprised by the difficulty and complexity of the LambdaNetworks paper when compared to the "Attention Is All You Need" paper [2]. Even though the architectures and the general concept have many points in common, we found that the LambdaNetworks paper was an order of magnitude more difficult to read and to visualize. Given the huge potential of this architecture, we believe that some lessons could be learnt from the original papers that boosted the field of attention in Machine Learning in 2017.
Additionally, we believe Figure 2 in the original paper can be very misleading since it does not clearly reflect that the global context is available to the content lambda whereas a local context is used for the position lambda. Also, it does not show how the context is obtained from the layer input. Being it the only graphic displaying the concept, ambiguities in this figure can have a detrimental effect on the acceptance and diffusion of the lambda layer by the research community.
To conclude, even though the paper has multiple ambiguities, we have experienced first-hand the huge potential of this framework. Therefore, we hope that the author will base his future work on the lambda layers and will ease the accessibility to this novel method by providing a Google Colab tutorial and a step-by-step explanation of the concept. With the here presented reproducibility project and open-source implementation, we hope to facilitate access to this novel framework for the research community and industry.
References
[1] I. Bello, LambdaNetworks: Modeling long-range Interactions without Attention (2021), International Conference on Learning Representations
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin, Attention is all you need (2017), Advances in Neural Information Processing Systems
[3] R. Li, J. Su, C. Duan and S. Zheng, Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation (2020), arXiv:2007.14902 [cs.CV]
[4] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015), Proceedings of the 32nd International Conference on International Conference on Machine Learning
[5] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, Rethinking the Inception Architecture for Computer Vision (2016), 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[6] K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition (2016), 2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR)