EfficientNetV2: Faster, Smaller, and Higher Accuracy than Vision Transformers

Detailed explanation about the EfficientNetV2 models, and the development of its architecture and training methods

Arjun Sarkar

Follow

Published in

Towards Data Science

10 min readOct 8, 2022

--

(Source: Photo by Chris Ried on Unsplash— https://unsplash.com/photos/ieic5Tq8YMk)

EfficientNets are currently one of the most powerful convolutional neural network (CNN) models. With the rise of Vision Transformers, which achieved even higher accuracies than EfficientNets, the question arose whether CNNs are now dying. EfficientNetV2 proves this wrong by not just improving accuracies but by also reducing training time and latency.

In this article, I have discussed in detail about these CNNs were developed, how powerful are they, and what it says about the future of CNNs in computer vision.

1. Introduction

The EfficientNet models are designed using neural architecture search. The first neural architecture search was proposed in the paper in 2016 — ‘Neural Architecture Search with Reinforcement Learning’.

The idea is to use a controller (a network such as an RNN) and sample network architectures from a search space with probability ‘p’. This architecture is then evaluated by first training the network, and then validating it on a test set to get the accuracy ‘R’. The gradient of ‘p’ is calculated and scaled by the accuracy ‘R’. The result (reward) is fed to the controller RNN. The controller acts as the agent, the training and testing of the network act as the environment, and the result acts as the reward. This is the common Reinforcement learning (RL) loop. This loop runs multiple times till the controller finds the network architecture which gives a high reward (high test accuracy). This is shown in Figure 1.

Figure 1. An overview of Neural Architecture Search (Source: Image from Neural architecture search paper)

The controller RNN samples various network architecture parameters — such as the number of filters, filter height, filter width, stride height, and stride width for each layer. These parameters can be different for each layer of the network. Finally, the network with the highest reward is chosen as the final network architecture. This is shown in Figure 2.

Figure 2. All the different parameters that the controller searched for in each layer of the network (Source: Image from Neural architecture search paper)

Even though this method worked well, one of the problems with this method was that this required a huge amount of computing power as well as time.

To overcome this problem, in 2017, a new method was suggested in the paper — ‘Learning Transferable Architectures for Scalable Image Recognition’.

In this paper, the authors looked into previously famous Convolutional Neural Network (CNN) architectures such as VGG or ResNet, and figured, that these architectures do not have different parameters in each layer, but rather have a block with multiple convolutional and pooling layers, and throughout the network architecture, these blocks are used multiple times. The authors used this idea to find such blocks using the RL controller and just repeated these blocks N times to create the scalable NASNet architecture.

This was further improved in the ‘MnasNet: Platform-Aware Neural Architecture Search for Mobile’ paper in 2018.

In this network, the authors chose 7 blocks, and one layer of a block was sampled and repeated for each block. This is shown in Figure 3.

Figure 3. Parameters sampled in MnasNet architecture. All the contents written in blue are searched using RL (Source: Image from MnasNet paper)

In addition to these parameters, one more very important parameter was considered while deciding the reward, which went into the controller, and that was ‘latency’. So for MnasNet, the authors considered both the accuracy and latency to find the best model architecture. This is shown in Figure 4. This made the architecture small, and it could run on mobile or edge devices.

Figure 4. Workflow to find model architecture, considering both accuracy and latency to decide the final reward for the controller (Source: Image from MnasNet paper)

Finally, the EfficientNet architecture was proposed in the paper — ‘EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks’ in 2020.

The workflow for finding the EfficientNet architecture was very similar to the MnasNet, but instead of considering ‘latency’ as a reward parameter, ‘FLOPs (floating point operations per second)’ were considered. This criteria search gave the authors a base model, which they called EfficientNetB0. Next, they scaled up the base models' depth, width, and image resolution (using grid search) to create 6 more models, from EfficientNetB1 to EfficientNetB7. This scaling is shown in Figure 5.

Figure 5. Scaling the depth, width, and image resolution to create different variations of the EfficientNet model (Source: Image from the EfficientNet paper)

I have written a separate article about Version 1 of EfficientNet. To learn in detail about this version kindly click on the link below—

Understanding EfficientNet — The most powerful CNN architecture

Understanding the best and the most efficient CNN model present currently — The EfficientNet

medium.com

2. EfficientNetV2

Paper- EfficientNetV2: Smaller Models and Faster Training (2021)

EfficientNetV2 goes one step further than EfficientNet to increase training speed and parameter efficiency. This network is generated by using a combination of scaling (width, depth, resolution) and neural architecture search. The main goal is to optimize training speed and parameter efficiency. Also, this time the search space also included new convolutional blocks such as Fused-MBConv. In the end, the authors obtained the EfficientNetV2 architecture which is much faster than previous and newer state-of-the-art models and is much smaller (up to 6.8x times). This is shown in Figure 6.

Figure 6(b) clearly shows that The EfficientnetV2 has 24 million parameters, while a Vision Transformer (ViT) has 86 million parameters. The V2 version also has nearly half the parameters of the original EfficientNet. While it does reduce the parameter size significantly, it maintains similar or higher accuracies than the other models on the ImageNet dataset.

Figure 6. Training and Parameter efficiency of the EfficientNetV2 model compared with other state-of-the-art models (Source: EfficientNetV2 paper)

The authors also perform progressive learning, that is, a method to progressively increase image size along with regularizations such as dropout and data augmentation. This method further speeds up training.

2.1 Problems with EfficientNet (Version 1)

The EfficientNet (original version) has the following bottlenecks —

a. EfficientNets are generally faster to train than other large CNN models. But, when large image resolution was used to train the models (B6 or B7 models), the training was slow. This is because larger EfficientNet models require larger image sizes to get optimal results, and when larger images are used, the batch size needs to be lowered to fit these images in the GPU/TPU memory, making the overall process slow.

b. In the early layers of the network architecture, depthwise convolutional layers (MBConv) were slow. Depthwise convolutional layers generally have fewer parameters than regular convolutional layers, but the problem is that they cannot fully make use of modern accelerators. To overcome this problem EfficientNetV2 uses a combination of MBConv and Fused MBConv to make the training faster without increasing parameters (discussed later in the article).

c. Equal scaling was applied to the height, width, and image resolution to create the various EfficientNet models from B0 to B7. This equal scaling of all layers is not optimal. For example, if the depth is scaled by 2, all the blocks in the network get scaled up 2 times, making the network very large/deep. It might be more optimal to scale one block two times and the other 1.5 times (non-uniform scaling), to reduce the model size while maintaining good accuracy.

2.2 EfficientNetV2 — changes made to overcome the problems and further improvements

a. Adding a combination of MBConv and Fused-MBConv blocks

As mentioned in 2.1(b), MBConv block often cannot fully make use of modern accelerators. Fused-MBConv layers can better utilize server/mobile accelerators.

The MBConv layer was first introduced in MobileNets. As seen in Figure 7, the only difference between the structures of MBConv and the Fused-MBConv are the last two blocks. While the MBConv uses a depthwise convolution (3x3) followed by a 1x1 convolution layer, the Fused-MBConv replaces/fuses these two layers with a simple 3x3 convolutional layer.

Fused MBConv-layers can make training faster with only a small increase in the number of parameters, but if many of these blocks are used, it can drastically slow down training with many more added parameters. To overcome this problem, the authors passed both MBConv and Fused-MBConv in the neural architecture search, which automatically decides the best combination of these blocks for the best performance and training speed.

Figure 7. Structure of MBConv and Fused-MBConv blocks (Source: EfficientNetV2 paper)

b. NAS search to optimize Accuracy, Parameter Efficiency, and Training Efficiency

The neural architecture search was done to jointly optimize accuracy, parameter efficiency, and training efficiency. The EfficientNet model was used as a backbone, and the search was conducted with varying design choices such as — convolutional blocks, number of layers, filter size, expansion ratio, and so on. Nearly 1000 models were samples and trained for 10 epochs and their results were compared. The model which optimized best for accuracy, training step time and parameter size was chosen as the final base model for EfficientNetV2.

Figure 8. The architecture of EfficientNetV2-S (Source: EfficientNetV2 paper)

Figure 9. The architecture of EfficientNet-B0 (Source: EfficientNet paper)

Figure 8 shows the base model architecture of the EfficientNetV2 model (EfficientNetV2-S). The model contains Fused-MBConv layers in the beginning but later switches to MBConv layers. For comparison, I have also shown the base model architecture for the previous EfficientNet paper in Figure 9. The previous version only has MBConv layers and no Fused-MBConv layers.

EfficientNetV2-S also has a smaller expansion ratio as compared to EfficientNet-B0. EfficeinetNetV2 does not use 5x5 filters, and only uses 3x3 filters.

c. Intelligent Model Scaling

Once the EfficientNetV2-S model was obtained, it was then scaled up to obtain the EfficientNetV2-M and EfficientNetV2-L models. A compound scaling method was used, similar to the EfficientNet, but some more changes were made to make the models smaller and faster —

i. maximum image size was restricted to 480x480 pixels to reduce GPU/TPU memory usage, hence increasing training speed.

ii. more layers were added to later stages (stages 5 and 6 in Figure 8), to increase network capacity without increasing much runtime overhead.

d. Progressive Learning

Larger image sizes generally tend to give better training results but increase training time. Some papers have previously proposed dynamically changing image size, but it often leads to a loss in training accuracy.

The authors of EfficientNetV2 show that as the image size is dynamically changed while training the network, so should the regularization be changed accordingly. Changing the image size, but keeping the same regularization leads to a loss in accuracy. Furthermore, larger models require more regularization than smaller models.

The authors test their hypothesis using different image sizes and different augmentations. As seen in Figure 10, when the image size is small, weaker augmentations give better results, but when the image size is large, stronger augmentations give better results.

Figure 10. ImageNet top-1 accuracy tested with varying image size and varying augmentation parameters (Source: EfficientNetV2 paper)

Taking this hypothesis into consideration, the authors of EfficientNetV2 used Progressive Learning with Adaptive Regularization. The idea is very simple. In the earlier steps, the network was trained on small images and weak regularization. This allows the network to learn the features fast. Then the image sizes are gradually increased, and so are the regularizations. This makes it hard for the network to learn. Overall this method, gives higher accuracy, faster training speed, and less overfitting.

Figure 11. The algorithm for progressive learning with adaptive regularization (Source: EfficientNetV2 paper)

The initial image size and the regularization parameter are user-defined. Linear interpolation is then applied to increase the image size and the regularization after a particular stage (M), as seen in Figure 11. This is better explained visually in Figure 12. As the number of epochs increases the image size and the augmentations are also increased gradually. EfficicentNetV2 uses three different types of regularization — Dropout, RandAugment, and Mixup.

Figure 12. Progressive learning with adaptive regularization visual explanation (Source: EfficientNetV2 paper)

2.3 Results

i. EfficientNetV2-M achieves similar accuracy as EfficientNetB7 (the best previous EfficientNet model). Also, EfficientNetV2-M trains nearly 11 times faster than EfficientNetB7.

Figure 13 a. Parameter efficiency comparison with other state-of-the-art models (Source: EfficientNetV2 paper)

Figure 13 b. FLOPs efficiency comparison with other state-of-the-art models (Source: EfficientNetV2 paper)

Figure 13 c. Latency comparison with other state-of-the-art models (Source: EfficientNetV2 paper)

As seen in Figures 13 a, 13 b, and 13 c, EfficientNetV2 models are better than all other state-of-the-art computer vision models including Vision Transformers.

To learn more about Vision Transformers kindly visit the link below —

Are Transformers better than CNN’s at Image Recognition?

towardsdatascience.com

A detailed comparison of EfficientNetV2 models pretrained on ImageNet21k with 13 million images and some pretrained on ImageNet ILSVRC2012 with 1.28 million images against all other state-of-the-art CNN and transformer models is shown in Figure 14. Other than ImageNet datasets, the models were also tested on other public datasets such as CIFAR-10, CIFAR-100, Flowers dataset, and Cars dataset, and in each case, the models showed very high accuracies.

Figure 14. EfficientNetV2 model comparison against other CNNs and transformer models in terms of accuracy, parameters, and FLOPs (Source: EfficientNetV2 paper)

3. Conclusion

EfficientNetV2 models are smaller and faster than most state-of-the-art models. This CNN model shows that even though Vision Transformers have taken the computer vision world by storm, by getting higher accuracies than other CNNs, better-structured CNN models with improved training methods can still achieve faster and better results than transformers, further proving that CNNs are here to stay.

References —

EfficientNetV2 paper —

Tan, Mingxing & Le, Quoc. (2021). EfficientNetV2: Smaller Models and Faster Training. arXiv, doi: 10.48550/ARXIV.2104.00298 https://arxiv.org/abs/2104.00298