Google’s EfficientDet: An Overview

Using networks to optimise networks

--

If you’re like me and you read Google’s EfficientDet paper and you’re like “what the hell is going on?”. Don’t fret I’ve reviewed the paper and will try to explain the model as best I can.

This model is built on previous work (like so many other models), more specifically, a key piece of work known as EfficientNet:

  1. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Before we delve into the details of Google’s new objected detection model let’s do the background work on EfficientNets.

EfficientNet

The authors of this paper focus on how to efficiently scale Convolutional Neural Networks (ConvNets) to improve performance. The typical way of scaling ConvNets for improved performance is to increase one of the following: network depth, width or image resolution. Though it is possible to scale two or more of these together it is currently a tedious process and generally leads to models with sub-optimal accuracy and efficiency. Below is an illustration of the different scaling methods, directly lifted from the paper:

Ways of scaling a neural network model. Taken from Tan & Le, 2019

The authors aim to answer the question:

Is there a principled method to scale up ConvNets that can achieve better accuracy and efficiency?

You may have guessed it, but the answer is yes. They discover is it important to balance all dimensions of network width/depth/resolution. What will surprise you is the simplicity of their scaling method. They simply scaling each dimension by a constant ratio. That’s it! They term this method compound scaling. As images used to train ConvNets get larger compound scaling starts to make sense because larger images need deeper networks to increase the receptive field and more channels to capture smaller details in the bigger image.

What was quickly discovered is that the performance improvement from model scaling is heavily dependent on the baseline network. The authors take this one step further and use an algorithm called neural architecture search (NAS) to find an optimum baseline network. The NAS algorithm at a high level uses reinforcement learning to determine optimum structures, given an objective function. Using NAS a new family of models was created called EfficientNets. The performance of these models is compared against ConvNets on the ImageNet dataset and the results are shown in the image below:

Performance of the EfficientNet family compared to other classifiers. Taken from Tan & Le, 2019

What is striking from these results is the margin these models outperform other state of the art models both in terms of accuracy and the number of parameters. The authors also showed these models transfer well to other datasets becoming the top scoring models in 5 out of 8 publicly available datasets, all while having 21x less fewer parameters.

Compound Model Scaling

So how does compound scaling actually work? First, let’s define a ConvNet as:

Tan & Le, 2019 definition of a ConvNet

They essentially describe it as a list of layers composed of operations (e.g. convolution) which are applied to input tensors.

Compound scaling attempts to expand the network length (Lᵢ), Width (Cᵢ) and/or resolution (Hᵢ, Wᵢ) all without changing predefined Fᵢ in the baseline network. The rationale for fixing Fᵢ was to reduce the design space, but the design space is still rather large considering that Lᵢ, Cᵢ, Hᵢ and Wᵢ can still be explored for each layer. To reduce the design space further a restriction is applied in that all layers must be scaled uniformly with a constant ratio. The target of compound scaling is to maximise the accuracy of the model given resource constraints, which is formulated as an optimisation problem:

Target of the compound scaling was to optimise for system resource constraints. Take from Taken from Tan & Le, 2019

w, d and r are coefficients for scaling the network width, depth and resolution. The predefined parameters have a ^ above their notation. An example of a predefined network is shown below:

A pre-defined EfficientNet network. Taken from Tan & Le, 2019

Scaling Dimensions

The difficulty in scaling the depth (d), width (w) and resolution (r) of a ConvNet is that all three depend on each other and can change under different resource constraints. As a result, ConvNets are usually only scaled in a single dimension.

Depth: Scaling the depth is the most common way of scaling a ConvNet. The networks are made deeper with the rationale that deeper ConvNets can extract richer and more complex features. On the downside, deeper ConvNets are more difficult to train due to the vanishing gradients problem. This issue has been alleviated with skip connections and batch normalisation, but there are diminishing returns for very deep networks.

Width: Usually used for smaller models. Theses networks are typically better at capturing fine-grained features is images and are easier to train.

Resolution: Increasing the input resolution of images, ConvNets can capture fine-grained patterns in the images. Early ConvNets were trained on 224x224 where more recent ConvNets are trained on 480x480. The figure below shows the performance of scaling each of these dimensions individually (scaling performed on the pre-defined network, shown above):

Performance improvements by scaling each dimension individually. Left is width, the middle is depth and right is resolution. Taken from Tan & Le, 2019

The key observation from this result shows scaling up the depth, width and resolution of a network improves accuracy, but the accuracy diminishes for deeper and bigger models.

Compound Scaling

So far it is clear that scaling dimensions are not independent. As an example, the authors state that:

high resolution images should require deeper networks, so that larger receptive fields can capture similar features that include more pixel in bigger images.

As a result of increasing the resolution, the network width should also be increased to capture more fine-grained details. These intuitions suggest that models need to be scaled in all of these dimensions and not in a single dimension. To validate their intuitions the authors scaled the width (w) of a network over different combinations of depth (d) and resolution (r) (image below). The example of scaling in the graph below starts with a baseline network (d=1.0, r=1.0) and has 18 convolutional layers with a resolution of 224x224. The last baseline network (d=2.0, r=1.3) results in 36 convolutional layers and a resolution of 299x299.

Scaling the network width. Each line represents a different work scaled by the coefficients defined in the legend. A point on each line shows a different width. The network is scaled from the pre-defined network shown earlier. Taken from Tan & Le, 2019

This result leads the authors to an observation that to obtain better model accuracy and efficiency, it is important to balance for all the dimensions of the network during scaling. From this observation, they propose a compound scaling method. This method uses a compound scaling coefficient Φ to uniformly scale the network dimensions:

In the equation α, β and γ are constants which can be determined by a grid search. The user-defined coefficient (Φ) controls how the resources are allocated for model scaling. While α, β and γ define how to assign these resources to the network width, depth and resolution respectively.

The scaling method is evaluated by scaling existing ConvNets. To truly take advantage of the new method, a new architecture family was created called EfficientNet. They build the baseline network by using NAS to optimise for both accuracy and FLOPS. Their search produces an efficient network called EfficientNet-B0. The table below illustrates the EfficientNet-B0 network (it’s exactly the same as the pre-defined network shown earlier):

EfficientNet-B0 network. Taken from Tan & Le, 2019

Their next step was to scale up the EfficientNet-B0 by applying compound scaling in two steps:

  1. Fix Φ to 1. Perform a grid search of α, β and γ. They find the best values for EfficentNet-B0 are α = 1.2, β = 1.1 and γ = 1.15.
  2. Fix α, β and γ as constants and scale up the baseline network with different Φ to obtain EfficientNet-B1 to B7.

The performance of these new networks compared to other state of the art networks are shown in the table below:

Performance of the EfficinetNet model family compared to other classifiers. Taken from Tan & Le, 2019

The EfficientNets consistently perform better and reduce both the number of parameters and FLOPS compared to existing ConvNets. The table and figure below further reinforce this:

The speed-up achieved with EfficientNets. Taken from Tan & Le, 2019

The latency on a CPU for EfficientNet-B1 is 5.7x faster compared to ResNet-152 even though they have comparable accuracy. At the top end of the accuracy scale, the GPipe model has a latency of 19.0s for a single image with 84.3% accuracy on the dataset. The largest EfficientNet model (B7) only has a latency of 3.1s which is a 6.1x speedup. The figure below shows how the FLOPS vs. the Imagenet Top-1 Accuracy.

FLOPS of the EfficientNet model family compared to other classifiers. Taken from Tan & Le, 2019

The figure clearly shows the EfficientNet family makes far better use of the resources available, with the model performing better in terms of accuracy and a significant reduction in the number of FLOPS.

Single Dimension Scaling

To deconvolute the contribution of the compound scaling method from the EfficientNet architecture the B0 architecture was scaled only in a single dimension. Those results are shown in the figure below:

Scaling the EfficientNet-B0 network with different methods. Taken from Tan & Le, 2019

It is clear to see that compound scaling is far superior than scaling only a single dimension which quickly results in diminishing returns. This highlights the importance of compound scaling. To further understand why compound scaling is so effective the figure below compares the class activation maps from a selection of scaled B0 models:

Class activation maps for different scaling methods. Shows the compound scaling provides more relevant activations. Taken from Tan & Le, 2019

The compound scaling model tends to focus on regions with more relevant object details. The other models show less focus on these details.

Overall the authors of this paper showed that balancing network depth, width and resolution is critical for optimising performance given a set of resource constraints. The compound scaling method provides an elegant way of effectively scaling a model in all of these dimensions. This now brings up to the next section of this post, Google’s EfficientDet which builds on top of the work discussed above.

EfficientDet: Scalable and Efficient Object Detection

Over recent years progress in object detection has been tremendous resulting in models which are more accurate but at the cost of increased computation. These large computation costs will deter their deployment in many real-world applications such as self-driving cars where low latency predictions are a necessity. With such constraints, model efficiency is incredibly important for object detection. Model detector architectures such as one-stage and anchor-free detectors are more efficient but usually at the cost of accuracy. It is therefore natural to ask:

Is it possible to build a scalable detection architecture with both higher accuracy and better efficiency?

Google’s EfficientDet aims to tackle this question and to answer this question we first need to understand the challenges of the current design choices for object detectors:

Challenge 1 - Efficient Multi-Scale Feature Fusion: Feature Pyramid Networks (FPN) are widely used for multi-scale feature fusion. Recent works such as PANet and NAS-FPN allow for cross-scale feature fusion. Previous feature fusion methods simply sum the features together, however, these features are at different resolutions and have been observed to contribute to the output fused feature unequally. To get around this issue a weighted bi-directional feature pyramid network (BiFPN) is proposed. The BiFPN has learnable weights to determine the importance of different input features, which apply top-down and bottom-up multi-scale feature fusion.

Challenge 2 - Model Scaling: Model scaling of object detectors usually sacrifices either accuracy or efficiency. Inspired by the work done by the authors of EfficientNets a compound scaling method for object detectors is proposed. Like EfficientNets this scaling method also scales the depth, width and resolution of the network.

The combination of EfficientNet with the proposed BiFPN and compound scaling resulted in the creation of a new family of detectors knows as EfficientDet. This family of models consistently achieved higher accuracy and reduced the number of FLOPS by an order of magnitude compared to previous object detectors. The contribution of the EfficentDet to the object detection community can be summarised into three points:

  • Introduction of the BiFPN. This is a weighted bidirectional feature network for easy and multi-scale feature fusion
  • Proposal of a scaling method, which scales the backbone, feature network, box/class network and resolution in a principled way
  • Combining the two points above resulted in EfficientDet, a new family of object detectors. These models have significantly better accuracy and efficiency across a wide spectrum of resource constraints

BiFPN

To understand the contribution of the BiFPN we need to first formulate the problem. Multi-scale feature fusion aims to aggregate features at different resolutions. This can be represented as a list of multi-scale features:

where each element above represents the feature at level lᵢ. The aim is to find a transformation f that can aggregate the features and output a list of new features:

To understand why this is important let's take a look at the traditional FPN which integrates features at different scales:

The traditional FPN network. Taken from Tan et al, 2019

It has 3–7 input features (P₃ - P₇) where each input feature represents a feature level with a given resolution. This FPN aggregates multi-scale features in a top-down manner:

Resize is typically the upsampling or downsampling operations for resolution matching. Finally, Conv is a convolution operation for feature processing.

The FPN shown above is inherently limited by the flow of information in one direction. To get around this issue PANet adds a bottom-up path and the NAS-FPN uses neural architecture search to find better cross-scale feature network topology. Both these network designs are shown in the image below:

Three different FPN structures. Taken Tan et al, 2019.

The authors show that PANet achieves better accuracy than NAS-FPN, but at the cost of more parameters and computations. Several optimisations were proposed:

  1. Remove nodes that only have one input edge. If a node has only one input edge with no feature fusion, then it is likely to contribute less to the feature network objective. This leads to a simplified PANet (above)
  2. An extra edge is added from the input node to the output node only at the same level. This will fuse more features together without much additional cost (BiFPN, Below)
  3. There is bidirectional information flow (both top-down and bottom-up). Each bidirectional path is treated as its own layer and therefore allows for the layers to be repeated enabling more high level-feature fusion (BiFPN, Below)

With these optimisations, the new feature network is termed bidirectional feature pyramid network (BiFPN). The BiFPN is illustrated in the figure below:

The BiFPN structure. Taken from Tan et al, 2019

Weighted Feature Fusion

As mentioned earlier, the fusion of features at different resolutions typically involves resizing them followed by a sum operation. The drawback of this method is that all features are treated equally. Since these features are at different resolutions they usually contribute to the output feature unequally. To get around this an additional weight for each input feature is calculated to allow the network to learn the importance of each feature. A total of three weighted fusion approaches were tested:

  1. Unbounded Fusion: This contains an unbounded learnable weight. However, since it is unbounded it can cause training instability, so was discarded
  2. Softmax-based fusion: Apply a softmax to each weight thus limiting the weight between 0 and 1, but this lead to a significant increase in latency.
  3. Fast normalised fusion: The equation is shown below. Each weight (ωᵢ) is ensured to be greater than or equal to zero via the application of a Relu. The ϵ is set to 0.001 to avoid numerical stability. This calculation was shown to be 30% faster on GPUs compared to the softmax-based method.
Fast normalised fusion. Taken from Tan et al, 2019

The final BiFPN integrates both bidirectional cross-scale connections the fast normalisation method. An example of this is shown below, which is the 6th layer in the BiFPN:

Example of the integration of cross-scale connections in layer P6. Taken from Tan et al, 2019

The top equation is the intermediate for the top-down pathway and the bottom is the equation for the bottom-up pathway.

EfficientDet

With the invention of the BiFPN, a new family of detectors has been created called EfficientDet. The architecture of EfficientDet is shown below and uses the EfficientNet as a backbone network.

EfficientDet architecture with both the EfficientNet backbone and the BiFPN structure. Taken from Tan et al, 2019

The BiFPN in this network serves as a feature network. It takes the features from the levels 3 - 7 from the backbone network and repeatedly applies the BiFPN. The fused features are fed into a class and box network to predict the object class and bounding box.

Compound Scaling

Inspired by the compound scaling used in EfficientNets a new compound scaling method was proposed for object detection. This method uses a coefficient (Φ) to jointly scale-up all dimensions of the backbone network, BiFPN network, class/box network and resolution. The scaling of each network component is described below:

  • Backbone network: Use the same coefficients as defined in B0-B6 so their ImageNet pre-trained weights can be reused
  • BiFPN: The width is exponentially grown (channels) and the depth is linearly increased (layers). Formally, the width and depth are scaled using the following equation:
  • Box/class prediction network: The width is fixed to be the same as in the BiFPN but the depth is linearly increased as follows:
  • Input Image Resolution: The resolution is also increased linearly since the resolution must be dividable by 2⁷ = 128. This is achieved with the equation below:

Using the three equations shown above a family networks is created from EfficientDet-D0 (Φ = 0) to D6 (Φ = 6). This further elaborated upon in the table below:

Summarised architectures of the different EfficientDet models. Taken from Tan et al, 2019

The table also contains a D7 model which could not fit into memory unless the batch size was changed or other settings. As a result, the D6 model was expanded to D7 by only increasing the input image resolution.

Performance

How do these models compare to other detectors? The table below shows the comparison of the EfficientDet family to other models which are grouped together by accuracy:

Performance of the EfficientDet family of models compared to other detectors. Taken from Tan et al, 2019

The EfficientDet family of models achieves better accuracy and efficiency than previous detectors across a wide range of accuracy levels and resource constraints.

To understand how much the BiFPN contributes to the model performance, the table below compares the impact of the backbone and BiFPN:

Contribution (in terms of mAP) of each component in the EfficientDet model. Taken from Tan et al, 2019

It is clear to the see that a strong backbone structure improved performance, but the addition of the BiFPN improved performance further not only by increasing mAP but also by decreasing the number of parameters and FLOPs. Another key addition is that of weighted connections. How do weighted connections influence performance compared to other FPNs:

Influence of weighted connections on mAP. Taken from Tan et al, 2019

You can see the vanilla FPN which is limited by the one-directional flow of information has the lowest accuracy. The PANet and NAS-FPN models show improved accuracy but require more parameters and FLOPS. Overall the BiFPN achieves the best accuracy with fewer parameters and FLOPS.

Conclusion

The EfficientDet network is heavily inspired by the work done on the EfficientNet models. Using compound scaling the performance of the model in terms of both accuracy and efficiency has been improved compared to other modern object detection models. The EfficientDet family of models have been shown to have significant speedup on GPU and CPUs which is critical for applications that require low latency. I believe we are entering a phase in object detection development where optimising current models are going to become critical. We can always build bigger and deeper models for more accuracy, but can we optimise them to get the most out of them?

--

--