Review: RetinaNet — Focal Loss (Object Detection)

One-Stage Detector, With Focal Loss and RetinaNet Using ResNet+FPN, Surpass the Accuracy of Two-Stage Detectors, Faster R-CNN

Sik-Ho Tsang
Towards Data Science

--

In this story, RetinaNet, by Facebook AI Research (FAIR), is reviewed. It is discovered that there is extreme foreground-background class imbalance problem in one-stage detector. And it is believed that this is the central cause which makes the performance of one-stage detectors inferior to two-stage detectors.

In RetinaNet, an one-stage detector, by using focal loss, lower loss is contributed by “easy” negative samples so that the loss is focusing on “hard” samples, which improves the prediction accuracy. With ResNet+FPN as backbone for feature extraction, plus two task-specific subnetworks for classification and bounding box regression, forming the RetinaNet, which achieves state-of-the-art performance, outperforms Faster R-CNN, the well-known two-stage detectors. It is a 2017 ICCV Best Student Paper Award paper with more than 500 citations. (The first author, Tsung-Yi Lin, has become Research Scientist at Google Brain when he was presenting RetinaNet in 2017 ICCV.) (Sik-Ho Tsang @ Medium)

A Demo of RetinaNet on Parking Lot Entrance Video (https://www.youtube.com/watch?v=51ujDJ-01oc)
Another Demo of RetinaNet on Car Camera Video

Outline

  1. Class Imbalance Problem of One-Stage Detector
  2. Focal Loss
  3. RetinaNet Detector
  4. Ablation Study
  5. Comparison with State-of-the-art Approaches

1. Class Imbalance Problem of One-Stage Detector

1.1. Two-Stage Detectors

  • In two-stage detectors such as Faster R-CNN, the first stage, region proposal network (RPN) narrows down the number of candidate object locations to a small number (e.g. 1–2k), filtering out most background samples.
  • At the second stage, classification is performed for each candidate object location. Sampling heuristics using fixed foreground-to-background ratio (1:3), or online hard example mining (OHEM) to select a small set of anchors (e.g., 256) for each minibatch.
  • Thus, there is manageable class balance between foreground and background.

1.2. One-Stage Detectors

Many negative background examples, Few positive foreground examples
  • A much larger set of candidate object locations is regularly sampled across an image (~100k locations), which densely cover spatial positions, scales and aspect ratios.
  • The training procedure is still dominated by easily classified background examples. It is typically addressed via bootstrapping or hard example mining. But they are not efficient enough.

1.3. Number of Boxes Comparison

  • YOLOv1: 98 boxes
  • YOLOv2: ~1k
  • OverFeat: ~1–2k
  • SSD: ~8–26k
  • RetinaNet: ~100k. RetinaNet can have ~100k boxes with the resolve of class imbalance problem using focal loss.

2. Focal Loss

2.1. Cross Entropy (CE) Loss

  • The above equation is the CE loss for binary classification. y ∈{±1} which is the ground-truth class and p∈[0,1] which is the model’s estimated probability. It is straightforward to extend it to multi-class case. For notation convenience, pt is defined and CE is rewritten as below:
  • When summed over a large number of easy examples, these small loss values can overwhelm the rare class. Below is the example:
Example
  • Let treat the above figure as an example. If we have 100000 easy examples (0.1 each) and 100 hard examples (2.3 each). When we need to sum over to estimate the CE loss.
  • The loss from easy examples = 100000×0.1 = 10000
  • The loss from hard examples = 100×2.3 = 230
  • 10000 / 230 = 43. It is about 40× bigger loss from easy examples.
  • Thus, CE loss is not a good choice when there is extreme class imbalance.

2.2. α-Balanced CE Loss

  • To address the class imbalance, one method is to add a weighting factor α for class 1 and 1 - α for class -1.
  • α may be set by inverse class frequency or treated as a hyperparameter to set by cross validation.
  • As seen at the two-stage detectors, α is implicitly implemented by selecting the foreground-to-background ratio of 1:3.

2.3. Focal Loss (FL)

  • The loss function is reshaped to down-weight easy examples and thus focus training on hard negatives. A modulating factor (1-pt)^ γ is added to the cross entropy loss where γ is tested from [0,5] in the experiment.
  • There are two properties of the FL:
  1. When an example is misclassified and pt is small, the modulating factor is near 1 and the loss is unaffected. As pt →1, the factor goes to 0 and the loss for well-classified examples is down-weighted.
  2. The focusing parameter γ smoothly adjusts the rate at which easy examples are down-weighted. When γ = 0, FL is equivalent to CE. When γ is increased, the effect of the modulating factor is likewise increased. (γ=2 works best in experiment.)
  • For instance, with γ = 2, an example classified with pt = 0.9 would have 100 lower loss compared with CE and with pt = 0.968 it would have 1000 lower loss. This in turn increases the importance of correcting misclassified examples.
  • The loss is scaled down by at most 4× for pt ≤ 0.5 and γ = 2.

2.4. α-Balanced Variant of FL

  • The above form is used in experiment in practice where α is added into the equation, which yields slightly improved accuracy over the one without α. And using sigmoid activation function for computing p resulting in greater numerical stability.
  • γ: Focus more on hard examples.
  • α: Offset class imbalance of number of examples.

2.5. Model Initialization

  • A prior π is set for the value of p at the start of training, so that the model’s estimated p for examples of the rare class is low, e.g. 0.01, in order to improve the training stability in the case of heavy class imbalance.
  • It is found that training RetinaNet uses standard CE loss WITHOUT using prior π for initialization leads to network divergence during training and eventually failed.
  • And results are insensitive to the exact value of π. And π = 0.01 is used for all experiments.

3. RetinaNet Detector

RetinaNet Detector Architecture

3.1. (a) and (b) Backbone

  • ResNet is used for deep feature extraction.
  • Feature Pyramid Network (FPN) is used on top of ResNet for constructing a rich multi-scale feature pyramid from one single resolution input image. (Originally, FPN is a two-stage detector which has state-of-the-art results. Please read my review about FPN if interested.)
  • FPN is multiscale, semantically strong at all scales, and fast to compute.
  • There are some modest changes for the FPN here. A pyramid is generated from P3 to P7. Some major changes are: P2 is not used now due to computational reasons. (ii) P6 is computed by strided convolution instead of downsampling. (iii) P7 is included additionally to improve the accuracy of large object detection.

3.2. Anchors

  • The anchors have the areas of 32² to 512² on pyramid levels from P3 to P7 respectively.
  • Three aspect ratios {1:2, 1:1, 2:1} are used.
  • For denser scale coverage, anchors of sizes {2⁰, 2^(1/3), 2^(2/3)} are added at each pyramid level.
  • In total, 9 anchors per level.
  • Across levels, scale is covered from 32 to 813 pixels.
  • Each anchor, there is a length K one-hot vector of classification targets (K: number of classes), and a 4-vector of box regression targets.
  • Anchors are assigned to ground-truth object boxes using IoU threshold of 0.5 and to background if IoU is in [0,0.4). Each anchor is assigned at most one object box, and set the corresponding class entry to one and all other entries to 0 in that K one-hot vector. If anchor is unassigned if IoU is in [0.4,0.5) and ignored during training.
  • Box regression is computed as the offset between anchor and assigned object box, or omitted if there is no assignment.

3.3. (c) Classification Subnet

  • This classification subnet predicts the probability of object presence at each spatial position for each of the A anchors and K object classes.
  • The subnet is a FCN which applies four 3×3 conv layers, each with C filters and each followed by ReLU activations, followed by a 3×3 conv layer with KA filters. (K classes, A=9 anchors, and C = 256 filters)

3.4. (d) Box Regression Subnet

  • This subnet is a FCN to each pyramid level for the purpose of regressing the offset from each anchor box to a nearby ground-truth object, if one exists.
  • It is identical to the classification subnet except that it terminates in 4A linear outputs per spatial location.
  • It is a class-agnostic bounding box regressor which uses fewer parameters, which is found to be equally effective.

3.5. Inference

  • The network only decodes box predictions from at most 1k top-scoring predictions per FPN level, after thresholding detector confidence at 0.05.
  • The top predictions from all levels are merged and non-maximum suppression (NMS) with a threshold of 0.5 is applied to yield the final detections.

3.6. Training

  • Thus, during training, the total focal loss of an image is computed as the sum of the focal loss over all 100k anchors, normalized by the number of anchors assigned to a ground-truth box.
  • ImageNet1K pre-trained ResNet-50-FPN and ResNet-101-FPN are used.

4. Ablation Study

  • COCO dataset is used. COCO trainval35k split is used for training. And minival (5k) split is used for validation.
α for CE loss (Left), γ for FL (Right)

4.1. α for α-Balanced CE loss

  • ResNet-50 is used.
  • First, α-Balanced CE loss with different α is tested.
  • α = 0.75 gives a gain of 0.9 AP.

4.2. γ for FL

  • γ=0 is α-Balanced CE loss.
  • When γ increases, easy examples get discounted to the loss.
  • γ=2 and α=0.25 yields a 2.9 AP improvement over α-Balanced CE loss (α=0.75).
  • It is observed that lower α’s are selected for higher γ’s.
  • The benefit of changing is much larger, and indeed the best α’s ranged in just [0.25, 0.75] with α∈[:01; :999] tested.
Cumulative distribution functions of the normalized loss for positive and negative samples

4.3. Foreground and Background Samples Analysis

Foreground samples

  • The loss from lowest to highest is sorted and plot its cumulative distribution function (CDF) for both positive and negative samples and for different settings for γ.
  • Approximately 20% of the hardest positive samples account for roughly half of the positive loss.
  • As γ increases more of the loss gets concentrated in the top 20% of examples, but the effect is minor.

Background samples

  • As γ increases, substantially more weight becomes concentrated on the hard negative examples.
  • The vast majority of the loss comes from a small fraction of samples.
  • FL can effectively discount the effect of easy negatives, focusing all attention on the hard negative examples.

4.4. Anchor Density

Different Number of Scales (#sc) and Aspect Ratios (#ar)
  • Using one square anchor (#sc=1, #ar=1) achieves 30.3% AP which is not bad.
  • AP can be improved by nearly 4 points (34.0) using 3 scales and 3 aspect ratios.
  • Increasing beyond 6–9 anchors did not shown further gains.

4.5. FL vs OHEM (Online Hard Example Mining)

FL vs OHEM (Online Hard Example Mining)
  • Here, ResNet-101 is used.
  • In OHEM, each example is scored by its loss, non-maximum suppression (NMS) is then applied, and a minibatch is constructed with the highest-loss examples.
  • Like the focal loss, OHEM puts more emphasis on misclassified examples.
  • But unlike FL, OHEM completely discards easy examples.
  • After applying nms to all examples, the minibatch is constructed to enforce a 1:3 ratio between positives and negatives.
  • The best setting for OHEM (no 1:3 ratio, batch size 128, NMS of 0.5) achieves 32.8% AP.
  • And FL obtains 36.0% AP, i.e. a gap 3.2 AP, which proves the effectiveness of FL.
  • Note: Authors also tested Hinge Loss, where loss is set to 0 above a certain value of pt. However, training is unstable.

5. Comparison with State-of-the-art Approaches

5.1. Speed versus Accuracy Tradeoff

Speed versus Accuracy
  • RetinaNet-101–600: RetinaNet with ResNet-101-FPN and a 600 pixel image scale, matches the accuracy of the recently published ResNet-101-FPN Faster R-CNN (FPN) while running in 122 ms per image compared to 172 ms (both measured on an Nvidia M40 GPU).
  • Larger backbone networks yield higher accuracy, but also slower inference speeds.
  • Training time ranges from 10 to 35 hours.
  • Using larger scales allows RetinaNet to surpass the accuracy of all two-stage approaches, while still being faster.
  • Except YOLOv2 (which targets on extremely high frame rate), RetinaNet outperforms SSD, DSSD, R-FCN and FPN.
  • For faster runtimes, there is only one operating point (500 pixel input) at which RetinaNet using ResNet-50-FPN improves over the one using ResNet-101-FPN.

5.2. State-of-the-art Accuracy

Object detection single-model results (bounding box AP), vs. state-of-the-art on COCO test-dev
  • RetinaNet Using ResNet-101-FPN: RetinaNet-101–800 model trained using scale jitter and for 1.5× longer than the models in Table (5.1).
  • Compared to existing one-stage detectors, it achieves a healthy 5.9 point AP gap (39.1 vs. 33.2) with the closest competitor, DSSD.
  • Compared to recent two-stage methods, RetinaNet achieves a 2.3 point gap above the top-performing Faster R-CNN model based on Inception-ResNet-v2-TDM. (If interested, please read my review about Inception-ResNet-v2 and TDM.)
  • RetinaNet Using ResNeXt-101-FPN: Plugging in ResNeXt-32x8d-101-FPN [38] as the RetinaNet backbone further improves results another 1.7 AP, surpassing 40 AP on COCO. (If interested, please read my review about ResNeXt.)

By using focal loss, the total loss can be balanced adaptively between easy samples and hard samples.

--

--