Review: DPN — Dual Path Networks (Image Classification)

Better Than ResNet, DenseNet, PolyNet, ResNeXt, Winner of ILSVRC 2017 Object Localization Challenge

Published in

Towards Data Science

6 min readMar 31, 2019

In this story, DPN (Dual Path Network) is briefly reviewed. This is a work by National University of Singapore, Beijing Institute of Technology, National University of Defense Technology, and Qihoo 360 AI Institute. ResNet enables feature re-usage while DenseNet enables new features exploration. DPN picks the advantages from both ResNet and DenseNet. Finally, it outperforms ResNet, DenseNet, PolyNet, and ResNeXt in image classification task. DPN won the ILSVRC 2017 Localization Challenge. With a better backbone, it also obtains state-of-the-art results for object detection and semantic segmentation tasks. And it is published as a 2017 NIPS paper with more than 100 citations. (Sik-Ho Tsang @ Medium)

Outline

ResNet, DenseNet and DPN
Comparison with State-of-the-art Approaches

1. ResNet, DenseNet and DPN

1.1. DenseNet

Authors tried to represent ResNet and DenseNet as Higher Order Recurrent Neural Network (HORNN) for explanation.
When DenseNet is represented as HORNN, the DenseNet can be represented as shown above.
The green arrow represents the sharing weight convolution.

1.2. ResNet

A new path is added to temporarily saves the outputs from green arrow for reuse.

**ResNet** **(Left)** **DenseNet** **(Right)**

The dotted rectangle is actually the residual path.

Residual Networks are essentially Densely Connected Networks but with shared connections.
ResNet: Feature refinement (Feature reuses).
DenseNet: Keep exploring new features.

Just like when managing a company:
Employees need to keep improving the skills (Feature refinement).
Also need to hire freshman to the company (Feature Exploration).
The paper has large passages and equations for ResNet and DenseNet interpretation. If interested, please read the paper.

1.3. DPN

To have both advantages, the network become as above left.
With combining two columns to single column, DPN is like as above right.

**Detailed Architecture and Complexity Comparison**

DPN is intentionally designed with considerably smaller model size and less FLOPs compared with ResNeXt.
DPN-92 costs about 15% fewer parameters than ResNeXt-101 (32×4d), while the DPN-98 costs about 26% fewer parameters than ResNeXt-101 (64×4d).
With 224×224 input, DPN-92 consumes about 19% less FLOPs than ResNeXt-101 (32×4d), and the DPN-98 consumes about 25% less FLOPs than ResNeXt-101 (64×4d).

2. Comparison with State-of-the-art Approaches

2.1. Image Classification

**ImageNet-1k Dataset Validation Set (+: Mean-Max Pooling)**

A shallow DPN with only the depth of 92 reduces the top-1 error rate by an absolute value of 0.5% compared with the ResNeXt-101 (32×4d) and an absolute value of 1.5% compared with the DenseNet-161 yet provides with considerably less FLOPs.
A deeper DPN (DPN-98) surpasses the best residual network — ResNeXt-101 (64×4d), and still enjoys 25% less FLOPs and a much smaller model size (236 MB v.s. 320 MB).
DPN-131 shows superior accuracy over the best single model — Very Deep PolyNet, with a much smaller model size (304 MB v.s. 365 MB).
PolyNet adopts numerous tricks such as Stochastic Depth (SD) to train, DPN-131 can be trained using a standard training strategy. And the actual training speed of DPN-131 is about 2 times faster than PolyNet.

**Comparison of total actual cost between different models during training.**

Actual cost is compared as above.
DPN-98 is 15% faster and uses 9% less memory than the best performing ResNeXt with a considerably lower testing error rate.
The deeper DPN-131 only costs about 19% more training time compared with the best performing ResNeXt, but achieves the state-of-the-art single model performance.
The training speed of PolyNet (537 layers) [23], is about 31 samples per second based on re-implementation using MXNet, showing that DPN-131 runs about 2 times faster than PolyNet during training.

4.2. Scene Classification

**Places365-Standard dataset Validation Accuracy**

The Places365-Standard dataset is a high-resolution scene understanding dataset with more than 1.8 million images of 365 scene categories.
DPN-92 requires much less parameters (138 MB v.s. 163 MB), which again demonstrates its high parameter efficiency and high generalization ability.

4.3. Object Detection

The model is trained on the union set of VOC 2007 trainval and VOC 2012 trainval, and evaluate them on VOC 2007 test set, using Faster R-CNN framework.
DPN obtains the mAP of 82.5%, which makes large improvements, i.e. 6.1% compared with ResNet-101 and 2.4% compared with ResNeXt-101 (32×4d).

4.4. Semantic Segmentation

The segmentation framework is based on DeepLabv2. The 3×3 convolutional layers in conv4 and conv5 are replaced with atrous convolution and Atrous Spatial Pyramid Pooling (ASPP) is used in the final feature maps of conv5.
DPN-92 has the highest overall mIoU accuracy, improves the overall mIoU by an absolute value 1.7%.
Considering the ResNeXt-101 (32×4d) only improves the overall mIoU by an absolute value 0.5% compared with the ResNet-101, the proposed DPN-92 gains more than 3 times improvement compared with the ResNeXt-101 (32×4d).

4.5. ILSVRC 2017 Object Localization

Faster R-CNN is used as the framework.
DPN also got the winner for both classification and localization tasks when with additional training data.
The leaderboard: http://image-net.org/challenges/LSVRC/2017/results