Review: ShuffleNet V1 — Light Weight Model (Image Classification)

With Channel Shuffling, Outperforms MobileNetV1

Sik-Ho Tsang
Towards Data Science

--

ShuffleNet, Light Weight Models for Limited Computational Budget Devices such as Drones (https://unsplash.com/photos/DiTiYQx0mh4)

In this story, ShuffleNet V1, by Megvii Inc. (Face++), is briefly reviewed. ShuffleNet pursues the best accuracy in very limited computational budgets at tens or hundreds of MFLOPs, focusing on common mobile platforms such as drones, robots, and smartphones. By shuffling the channels, ShuffleNet outperforms MobileNetV1. In ARM device, ShuffleNet achieves 13× actual speedup over AlexNet while maintaining comparable accuracy. This is a paper in 2018 CVPR with more than 300 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Channel Shuffle for Group Convolutions
  2. ShuffleNet Unit
  3. ShuffleNet Architecture
  4. Ablation Study
  5. Comparison with State-of-the-art Approaches

1. Channel Shuffle for Group Convolutions

(a) Two Stacked Group Convolutions (GConv1 & GConv2), (b) Shuffle the channels before convolution, (c) Equivalent implementation of (b)
  • Group convolutions are used in AlexNet and ResNeXt.
  • (a): There is no channel shuffle, each output channel only relates to the input channels within the group. This property blocks information flow between channel groups and weakens representation.
  • (b): If we allow group convolution to obtain input data from different groups, the input and output channels will be fully related.
  • (c): The operations in (b) can be efficiently and elegantly implemented by a channel shuffle operation. Suppose a convolutional layer with g groups whose output has g×n channels; we first reshape the output channel dimension into (g, n), transposing and then flattening it back as the input of next layer.
  • And channel shuffle is also differentiable, which means it can be embedded into network structures for end-to-end training.

2. ShuffleNet Unit

(a) bottleneck unit with depthwise convolution (DWConv), (b) ShuffleNet unit with pointwise group convolution (GConv) and channel shuffle, (c) ShuffleNet unit with stride = 2.
  • (a) Bottleneck Unit: This is a standard residual bottleneck unit, but with depthwise convolution used. (Depthwise convolution is used in MobileNetV1.) With 1×1 then 3×3 DW then 1×1 convolutions used, it can be also treatedas a bottleneck type of depthwise separable convolution used in MobileNetV2.
  • (b) ShuffleNet Unit: The first and second 1×1 convolutions are replaced by group convolutions. A channel shuffle is applied after the first 1×1 convolution.
  • (c) ShuffleNet Unit with Stride=2: When stride is applied, a 3×3 average pooling on the shortcut path is added. Also, the element-wise addition is replaced with channel concatenation, which makes it easy to enlarge channel dimension with little extra computation cost.
  • Given the input c×h×w, and bottleneck channels m, ResNet unit requires hw(2cm+9m²) FLOPs and ResNeXt requires hw(2cm+9m²/g) FLOPs, while ShuffleNet only requires hw(2cm/g+9m) FLOPs where g is the number of group convolutions.
  • In other words, given a computational budget, ShuffleNet can use wider feature maps. We find this is critical for small networks, as tiny networks usually have an insufficient number of channels to process the information.

3. ShuffleNet Architecture

ShuffleNet Architecture
  • The proposed network is mainly composed of a stack of ShuffleNet units grouped into three stages.
  • The number of bottleneck channels is set to 1/4 of the output channels for each ShuffleNet unit.
  • A scale factor s is applied on the number of channels. The networks in the above table is denoted as “ShuffleNet 1×”, then ”ShuffleNet s×” means scaling the number of filters in ShuffleNet 1× by s times thus overall complexity will be roughly s² times of ShuffleNet 1×.

4. Ablation Study

  • ImageNet 2012 classification validation set is used.

4.1. Different Number of Group Convolutions g

Different number of group convolutions g
  • With g = 1, i.e. no pointwise group convolution.
  • Models with group convolutions (g > 1) consistently perform better than the counterparts without pointwise group convolutions (g = 1).
  • Smaller models tend to benefit more from groups.
  • For example, for ShuffleNet 1× the best entry (g = 8) is 1.2% better than the counterpart, while for ShuffleNet 0.5× and 0.25×, the gaps become 3.5% and 4.4% respectively.
  • For some models (e.g. ShuffleNet 0.5×) when group numbers become relatively large (e.g. g = 8), the classification score saturates or even drops. With an increase in group number (thus wider feature maps), input channels for each convolutional filter become fewer, which may harm representation capability.

4.2. Shuffle vs No Shuffle

Shuffle vs No Shuffle
  • Channel shuffle consistently boosts classification scores for different settings, which shows the importance of cross-group information interchange.

5. Comparison with State-of-the-art Approaches

5.1. Comparison with Other Structure Units

Comparison with Other Structure Units
  • VGGNet, ResNet, Xception, and ResNeXt do not fully explore low-complexity conditions.
  • For fair comparison, in the above table, the ShuffleNet units in Stage 2–4 with other structures is replaced by other networks’ units, then the number of channels is adapted to ensure the complexity remains unchanged.
  • ShuffleNet models outperform most others by a significant margin under different complexities.
  • For example, under the complexity of 38 MFLOPs, output channels of Stage 4 (see Table 1) for VGG-like, ResNet, ResNeXt, Xception-like, ShuffleNet models are 50, 192, 192, 288, 576 respectively, which is consistent with the increase of accuracy.
  • GoogLeNet or Inception series are not included due to too many hyperparameters to be tuned.
  • Another lightweight network structure named PVANET has 29.7% classification error with a computation complexity of 557 MFLOPs, while our ShuffleNet 2× model (g = 3) gets 26.3% with 524 MFLOPs.

5.2. Comparison with MobileNetV1

Comparison with MobileNetV1
  • ShuffleNet models are superior to MobileNetV1 for all the complexities.
  • Though ShuffleNet network is specially designed for small models (< 150 MFLOPs), it is still better than MobileNetV1 for higher computation cost, e.g. 3.1% more accurate than MobileNetV1 at the cost of 500 MFLOPs.
  • The simple architecture design also makes it easy to equip ShuffeNets with the latest advances such as Squeeze-and-Excitation (SE) blocks. (Hope I can review SENet in the future.)
  • ShuffleNets with SE modules boosting the top-1 error of ShuffleNet 2× to 24.7%, but are usually 25 to 40% slower than the “raw” ShuffleNets on mobile devices, which implies that actual speedup evaluation is critical on low-cost architecture design.

5.3. Comparison with Other Models

Comparison with Other Models

5.4. Generalization Ability

Object detection results on MS COCO
  • Here, MS COCO minival images are used for testing.
  • Faster R-CNN is used as detection framework.
  • Comparing ShuffleNet 2× with MobileNetV1 whose complexity are comparable (524 vs. 569 MFLOPs), our ShuffleNet 2× surpasses MobileNetV1 by a significant margin on both resolutions.
  • ShuffleNet 1× also achieves comparable results with MobileNet on 600× resolution, but has ~4× complexity reduction.
  • Authors conjecture that this significant gain is partly due to ShuffleNet’s simple design of architecture.

5.5. Actual Speedup Evaluation

Actual Speedup Evaluation on ARM device
  • Empirically g = 3 usually has a proper trade-off between accuracy and actual inference time.
  • Due to memory access and other overheads, it is found that every 4× theoretical complexity reduction usually results in 2.6× actual speedup in our implementation.
  • Compared with AlexNet, ShuffleNet 0.5× model still achieves ~13× actual speedup under comparable classification accuracy (the theoretical speedup is 18×).

Hope I can review ShuffleNet V2 in the coming future. :)

--

--