Introduction
Gone are the days when deeper networks used to rule the ML space with the Inceptions, ResNets, DenseNets, etc. which are now being replaced by even massive Transformer models. The industry always moved in the direction of more the better or deeper the better and it did work and produce groundbreaking results but at the cost of – Massive Computational cost, larger memory requirements, and the most important of all "Carbon Footprint" by training the huge models for days to together. Non-Deep Networks (also called Parnet) [1] have come for a good measure with an amazing performance of 80% top-1 accuracy, 96 % on CIFAR-10, and 81% on CIFAR-100 with just 12 layers! It is quite a feat and pushed me in writing this paper review.
Background
There were days when a few 10s of layers were considered deep which has now grown 100x to upto 1000 layers deep networks. Of course, the accuracy and performance come with drawbacks in latency and parallelization. One more aspect I have noticed is that the modern models are rarely reproducible because of the cost and scale at which these models are built. And we have seen these architectures dominate the ImageNet benchmarks for a long time with the average depth constantly increasing for SOTA models. Things are starting to change surely.
ParNet is matching the performance of the famed Vision Transformer with 12 layers!
We have also seen a classic work in 1989 where the network was just a single layer with sigmoid activations that approximate a function very well. Here, when scaling, the width was increased which resulted in increased parameters. The other way to do this was the depth of the network which always approximated the function better without a huge increase in parameters and always outscored the "less" deep networks (even with a similar number of parameters too). Scaling a neural network conventionally involves increasing the depth, resolution, and width. On the contrary, the authors of ParNets have chosen parallel sub-architectures as their approach against the conventional method.
Idea & Architecture
The intuition was to keep the layers constant at 12 and use parallelization to an extent they called the architecture "Embarrassingly parallel". If we glance at the architecture depicted below, we see some streams (branches) with blocks that are more or less like the VGG model [2]. The blocks are called ParNet blocks where several (simple) things are happening internally. They chose the VGG-style because of its ability – Structural Reparameterization. Multiple 3×3 convolutional branches can be merged into a single 3×3 convolutional which can effectively reduce the inference times.
Each ParNet block consists of 3 main components which are then merged /fused before the next block,
- 1×1 convolution
- 3×3 convolutions and
- SSE (Skip-Squeeze-and-Excitation) layer also called the RepVGG-SSE block
Skip-Squeeze-Excitation block is depicted in the rightmost part of the above image. What it did essentially was, increase the receptive field without an increase in the depth aspect as opposed to the conventional Squeeze-Excitation [3] implementation. To induce more non-linearity than the ReLU activation in a shallow network size the authors chose to use the more recent SiLU [4].
Next comes the down-sampling and fusion blocks which reduce resolution and combine information from multiple streams respectively. Down-sampling results in an increase in width which facilitates multi-scale processing. This block is quite simple with a Squeeze-Excitation (SE) layer and an Average pooling attached to the 1×1 convolutional branch. The Fusion is no different from down-sampling except for an extra concatenation layer.
For scaling the network, experiments were conducted on the width, resolution, and the number of streams according to the datasets (CIFAR-10, CIFAR-100 & ImageNet).
Results & Benchmarks
The model was trained for 120 epochs on the ImageNet using the SGD optimizer with a few learning rate schedulers and decay values with a batch size of 2048 (pretty big). The learning rate was reduced in proportion with the batch size if it did not fit into memory.
The results of ParNet are on par or better than that of ResNets (Although the number of parameters is more because of the larger and extra-large scaled versions of ParNet).
This shows that with just 12 layers the base ParNet can outperform a retrained ResNet-34 with the same procedure and augmentation used in ParNet in both top-1 and top-5 accuracy. Also, the ParNet-XL is outperforming the original ResNet-Bottleneck 101.
A separate section in the paper is dedicated to Parallelism and its advantages, do give that a read for more details. I will leave the Ablation studies section for you to explore and understand how the performance was boosted. Further, there are details on various outcomes of ResNets, ParNets, DenseNets, etc for the CIFAR-10 and CIFAR-100 datasets.
The authors were curious to test the ParNet as a backbone in the Object Detection networks and would it improve the existing performance? Looks like it does when ParNet replaced the Darknet53 [6] in YOLOv4 [7] both in terms of accuracy and latency.
My Thoughts
The ParNet can be implemented (and results reproducible) without the need for a massive infrastructure which is a welcome change. It opens the avenues for researchers to explore in this shallow depth parallel architecture space and bring out robust networks which might be suitable for edge-deployment scenarios without a great hit to performance.
Since it has shown promise in both Image Classification and Object Detection I’m very eager to see how it fares when transferred to a sensitive domain like healthcare. Will it be able to keep up with a considerable performance? Only time will tell.
Conclusion
ParNets showed us that going against the tide is rewarding. With a very shallow depth, it did not break a sweat when ascending the leader-board of the ImageNet classification benchmark [8]. It was for the first time that a network was able to excellently on three prominent datasets – CIFAR-10, CIFAR-100 [9] & ImageNet with just 12 layers. The performance of ParNet increases with an increase in streams, resolution, and width keeping the depth constant. The authors also observed that the current performance is not saturated and can be scaled further.
References
[1] Non-Deep Networks: https://arxiv.org/pdf/2110.07641.pdf
[2] VGG Network: https://arxiv.org/pdf/1409.1556.pdf
[3] Squeeze-and-Excitation Networks: https://arxiv.org/pdf/1709.01507.pdf
[4] SiLU activation: https://arxiv.org/pdf/1702.03118.pdf
[5] DenseNet: https://arxiv.org/pdf/1608.06993v5.pdf
[6] DarkNet53 in YOLOv3: https://arxiv.org/pdf/1804.02767v1.pdf
[7] YOLOv4: https://arxiv.org/pdf/2004.10934v1.pdf
[8] ImageNet benchmark: https://paperswithcode.com/sota/image-classification-on-imagenet
[9] CIFAR-10 & CIFAR-100 datasets: https://www.cs.toronto.edu/~kriz/cifar.html