Review: Residual Attention Network — Attention-Aware Features (Image Classification)

Outperforms Pre-Activation ResNet, WRN, Inception-ResNet, ResNeXt

Sik-Ho Tsang
Towards Data Science

--

In this story, Residual Attention Network, by SenseTime, Tsinghua University, Chinese University of Hong Kong (CUHK), and Beijing University of Posts and Telecommunications, is reviewed. Multiple attention module is stacked to generate attention-aware features. Attention residual learning is used for very deep network. Finally, this is a 2017 CVPR paper with over 200 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Attention Network
  2. Attention Residual Learning
  3. Soft Mask Branch
  4. Overall Architecture
  5. Ablation Study
  6. Comparison with State-of-the-art Approaches

1. Attention Network

Residual Attention Network
  • where p is he number of pre-processing Residual Units before splitting into trunk branch and mask branch.
  • t denotes the number of Residual Units in trunk branch.
  • r denotes the number of Residual Units between adjacent pooling layer in the mask branch.
  • In experiments, unless specified, p=1, t=2, r=1.

1.1. Mask Branch & Trunk Branch

  • There are two terms in Residual Attention Network: Mask branch & Trunk branch.
  • Trunk branch: It is the upper branch in the attention module for feature extraction. They can be Pre-Activation ResNet block or other blocks. With input x, it output T(x).
  • Mask branch: It uses bottom-up top-down structure to learn the same-size mask M(x). This M(x) is used as control gates similar to Highway Network.
  • Finally, the output of Attention Module H is:
  • where i ranges over the spatial locations, and c is the channel index from 1 to C.
  • The attention mask can serve as feature selector during forward inference.
  • During backpropagation:
  • where θ are mask branch parameters and φ are trunk branch parameters.
  • It also act as a gradient update filter during backpropagation.
  • Thus, this makes Attention Modules robust to noisy labels. Mask branches can prevent wrong gradients (from noisy labels) to update trunk parameters.
  • (It somehow like Spatial Transformer Network (STN), but different objectives. STN aims for deformation invariant while attention network aims for generating attention-aware features. And with the ability to handle more challenging dataset such as ImageNet in which images contain clutter background, complex scenes, and large appearance variations, that need to be modeled.)
An Example of Hot Air Balloon Images
  • As shown above, in hot air balloon images, blue color features from bottom layer have corresponding sky mask to eliminate background, while part features from top layer are refined by balloon instance mask.
  • Besides, the incremental nature of stacked network structure can gradually refine attention for complex images.

2. Attention Residual Learning

  • However, Naive Attention Learning (NAL) leads to performance drop.
  • This is because dot production with mask range from zero to one repeatedly will degrade the value of features in deep layers.
  • Also, soft mask can potentially break good property of trunk branch, for example, the identical mapping of Residual Unit, from Pre-Activation ResNet.
  • A better mask is constructed as above, which is called Attention Residual Learning (ARL).
  • F(x) is the original features and M(x) ranges from [0,1].
  • Thus, ARL can keep good properties of original features.
  • Stacked attention modules can gradually refine the feature maps as in the figure above. Features become much clearer as depth going deeper.

3. Soft Mask Branch

Soft Mask Branch
  • A bottom-up top-down fully convolutional structure is used.
  • Max pooling are performed several times to increase the receptive field rapidly after a small number of Residual Units.
  • Then, the global information is then expanded by a symmetrical top-down architecture to guide input features in each position.
  • A linear interpolation upsamples the output after some Residual Units.
  • Then a sigmoid layer normalizes the output after two 1×1 convolutions.

4. Overall Architecture

Overall Architecture
  • The network consists of 3 stages and similar to Pre-Activation ResNet, equal number of Attention Modules are stacked in each stage.
  • Additionally, two Residual Units are added at each stage.
  • The number of weighted layers in trunk branch is 36m+20 where m is the number of Attention Module in one stage.

5. Ablation Study

5.1. Activation Function in Soft Mask Branch

Test Error (%) on CIFAR-10 of Attention-56
  • Except Sigmoid, other types of activation functions are tested as above using CIFAR-10 with 56 weight layers Attention-56.
  • Sigmoid is the best among three as shove above.

5.2. Naive Attention Learning (NAL) vs Attention Residual Learning (ARL)

Test Error on CIFAR-10
  • With m = {1, 2, 3, 4}. it leads to Attention-56 (named by trunk layer depth), Attention-92, Attention-128 and Attention-164 respectively.
  • ARL consistently outperforms NAL.
  • NAL suffers obvious degradation with increased number of Attention Module.
  • In RAL, the performance increases with the number of Attention Module when applying attention residual learning.
Mean Absolute Response Value Using Attention-164
  • Mean absolute response value of output layers for each stage using Attention-164 is measured.
  • NAL quickly vanishes in the stage 2 after 4 attention modules.
  • ARL can suppress noise while keeping useful information, relieve signal attenuation using identical mapping. It gains benefits from noise reduction without significant information loss.

5.3. Different Mask Structures

Test Error on CIFAR-10
  • Local Convolutions: No encoder and decoder structure but only convolutions.
  • Encoder and Decoder: Error is smaller, which benefits from multi-scale information.

5.4. Noise Label Robustness

  • A confusion matrix with r, clean label ratio, is used for the whole dataset.
  • Different r, different levels of label noises injected into the dataset.
Test Error on CIFAR-10 with Label Noises
  • ARL can perform well even trained with high level noise data.
  • When label is noisy, the mask can prevent gradient caused by label error since the soft mask branch masks the wrong label.

6. Comparison with State-of-the-art Approaches

6.1. CIFAR-10 & CIFAR-100

  • The CIFAR-10 and CIFAR-100 datasets consist of 60,000 32×32 color images of 10 and 100 classes respectively, with 50,000 training images and 10,000 test images.
Comparisons with State-of-the-art Methods on CIFAR-10/100
  • Attention-452 consists of Attention Module with hyper-parameters setting: {p = 2, t = 4, r = 3} and 6 Attention Modules per stage.
  • With attention modules, it outperforms Pre-Activation ResNet and WRN.
  • Attention-236 outperforms ResNet-1001 using only half of the parameters. That means the Attention Module and attention residual learning scheme can effectively reduce the number of parameters in the network while improving the classification performance.

6.2. ImageNet

  • ImageNet LSVRC 2012 dataset contains 1,000 classes with 1.2 million training images, 50,000 validation images, and 100,000 test images. The evaluation is measured on the non-blacklist images of the ImageNet LSVRC 2012 validation set.
Single Crop Validation Error on ImageNet
  • The Attention-56 network outperforms ResNet-152 by a large margin with a 0.4% reduction on top-1 error and a 0.26% reduction on top-5 error, with only 52% parameters and 56% FLOPs compared with ResNet-152.
  • And Residual Attention Network can generalize well using different basic unit. With attention modules, it outperforms the corresponding network without attention modules.
  • AttentionNeXt-56 network performance is the same as ResNeXt-101 while the parameters and FLOPs are significantly fewer than ResNeXt-101.
  • The AttentionInception-56 outperforms Inception-ResNet-v1 by a margin with a 0.94% reduction on top-1 error and a 0.21% reduction on top-5 error.
  • Attention-92 outperforms ResNet-200 with a large margin. The reduction on top-1 error is 0.6%, while ResNet-200 network contains 32% more parameters than Attention-92.
  • Also, the attention network reduces nearly half training time comparing with ResNet-200.

--

--