With progressive learning, our EfficientNetV2 significantly outperforms previous models on ImageNet and CIFAR/Cars/Flowers datasets. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. Code will be available at https://github.com/ google/automl/efficientnetv2.
Source: Arxiv
EfficientNets have been the SOTA for high quality and quick image classification. They were released about 2 years ago and were quite popular for the way they scale which made their training much faster compared to other networks. A few days ago Google released EfficientNetV2 which is a big improvement over EfficientNet in terms of training speed and a decent improvement in terms of accuracy. In this article, we are going to explore how this new EfficientNet is an improvement over the previous one.
The main foundation of better performing networks such as DenseNets and EfficientNets is achieving better performance with a lower number of parameters. When you decrease the number of parameters you usually get a lot of benefits such as smaller model sizes making them fit into memory easier. However, that usually lowers the performance. So the main challenge is to decrease the number of parameters without lowering the performance.
This challenge now mostly falls under the Neural Network Architecture Search (NAS) field which is becoming more of a hot topic every day. In an optimal case scenario, we would be giving a problem description to some neural network and it would be spitting out an optimal network architecture for that problem.
I don’t want to go over EfficientNets in this article. But, I want to remind you of the concept of EfficietNets so that we can pinpoint the main differences in the architecture that actually result in better performance. EfficientNets use NAS to construct a baseline network (B0), then they use "compound scaling" to increase the capacity of the network without increasing the number of parameters greatly. The most important metric to measure in this case is FLOPS (Floating point operations per second) and the number of parameters of course.
1. Progressive Training
EfficientNetV2 uses the concept of progressive learning which means that although the image sizes are originally small when the training starts, they increase in size progressively. This solution stems from the fact that EfficientNets’ training speeds start to suffer on high image sizes.
Progressive learning is not a new concept though, it has been used before. The issue is that when it was previously used, the same regularisation effect was used for different image sizes. The authors of EfficientNetV2 argue that this decreases the network capacity and decreases performance. That’s why they dynamically increase the regularisation along with the image sizes to remedy this issue.
If you think about it, it makes a lot of sense. A huge regularisation effect on small images would cause underfitting and a small regularisation effect on large images would cause overfitting.
With the improved progressive learning, our EfficientNetV2 achieves strong results on ImageNet, CIFAR-10, CIFAR- 100, Cars, and Flowers dataset. On ImageNet, we achieve 85.7% top-1 accuracy while training 3x – 9x faster and being up to 6.8x smaller than previous models
Source: Arxiv
2. Fused-MB Conv layers over MB Conv layers
EfficientNets use a convolution layer called the "depthwise convolution layer" these layers have a lower number of parameters and FLOPS, but they cannot fully utilize modern accelerators (GPU/CPU) [1]. To remedy this problem, a recent paper called "MobileDets: Searching for Object Detection Architectures for Mobile Accelerators" solves this issue with a new layer that they call "Fused-MB Conv layer". This new layer is being used in EfficientNetV2 here. However, they cant simply replace all the old MB Conv layers with the fused ones because it has a higher number of parameters.
Which is why they use training-aware NAS to dynamically search for the best combination of fused and regular MB Conv layers [1]. The results of NAS show that the replacement of some of the MB Conv layers with the fused ones in early stages offers better performance with smaller models. It also shows that a smaller expansion ratio for the MB Conv layers (along the network) is more optimal. Finally, it shows that smaller kernel sizes with more layers are better.
3. A more dynamic approach to scaling
I think one of the main valuable ideas to learn here is the approach that they took to improve the network. I think the best way to sum up this approach is first to investigate the issues with EfficientNet, which is obvious, but then the next step is to start making some of the rules and concepts more dynamic to better fit the goals and objectives. We first saw that in progressive learning, when they made the regularisation more dynamic to better fit the image sizes, improving the performance.
We now see this approach being used again in scaling up the network. EfficientNet equally scales up all stages using a simple compound scaling rule [1]. The authors of EfficientNetV2 point out that this is not needed as not all of these stages need scaling to improve performance. That is why they use a non-uniform scaling strategy to gradually add more layers at later stages. They also add a scaling rule to restrict maximum image sizes as EfficientNets tend to aggressively scale up image sizes [1] (leading to memory issues).
I think the main gist behind this is that the earlier layers don’t really need scaling because at this early stage the network is only looking at high-level features. However, as we get into the deeper part of the network and start looking at low-level features, we will need larger layers to fully digest those details.
References:
[1] EfficientNetsV2 in Arxiv