Review: DenseNet — Dense Convolutional Network (Image Classification)

Published in

Towards Data Science

6 min readNov 25, 2018

In this story, DenseNet (Dense Convolutional Network) is reviewed. This is the paper in 2017 CVPR which got Best Paper Award with over 2000 citations. It is jointly invented by Cornwell University, Tsinghua University and Facebook AI Research (FAIR). (Sik-Ho Tsang @ Medium)

With dense connection, fewer parameters and high accuracy are achieved compared with ResNet and Pre-Activation ResNet. So, let’s see how it works.

What Are Covered

Dense Block
DenseNet Architecture
Advantages of DenseNet
CIFAR & SVHN Small-Scale Dataset Results
ImageNet Large-Scale Dataset Results
Further Analysis on Feature Reuse

1. Dense Block

In Standard ConvNet, input image goes through multiple convolution and obtain high-level features.

In ResNet, identity mapping is proposed to promote the gradient propagation. Element-wise addition is used. It can be viewed as algorithms with a state passed from one ResNet module to another one.

In DenseNet, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Concatenation is used. Each layer is receiving a “collective knowledge” from all preceding layers.

**Dense Block in DenseNet with Growth Rate k**

Since each layer receives feature maps from all preceding layers, network can be thinner and compact, i.e. number of channels can be fewer. The growth rate k is the additional number of channels for each layer.

So, it have higher computational efficiency and memory efficiency. The following figure shows the concept of concatenation during forward propagation:

**Concatenation during Forward Propagation**

2. DenseNet Architecture

2.1. Basic DenseNet Composition Layer

For each composition layer, Pre-Activation Batch Norm (BN) and ReLU, then 3×3 Conv are done with output feature maps of k channels, say for example, to transform x0, x1, x2, x3 to x4. This is the idea from Pre-Activation ResNet.

2.2. DenseNet-B (Bottleneck Layers)

To reduce the model complexity and size, BN-ReLU-1×1 Conv is done before BN-ReLU-3×3 Conv.

2.4. Multiple Dense Blocks with Transition Layers

1×1 Conv followed by 2×2 average pooling are used as the transition layers between two contiguous dense blocks.

Feature map sizes are the same within the dense block so that they can be concatenated together easily.

At the end of the last dense block, a global average pooling is performed and then a softmax classifier is attached.

2.3. DenseNet-BC (Further Compression)

If a dense block contains m feature-maps, The transition layer generate θm output feature maps, where 0<θ≤1 is referred to as the compression factor.

When θ=1, the number of feature-maps across transition layers remains unchanged. DenseNet with θ<1 is referred as DenseNet-C, and θ=0.5 in the experiment.

When both the bottleneck and transition layers with θ<1 are used, the model is referred as DenseNet-BC.

Finally, DenseNets with/without B/C and with different L layers and k growth rate are trained.

3. Advantages of DenseNet

3.1. Strong Gradient Flow

The error signal can be easily propagated to earlier layers more directly. This is a kind of implicit deep supervision as earlier layers can get direct supervision from the final classification layer.

3.2. Parameter & Computational Efficiency

**Number of Parameters for ResNet and DenseNet**

For each layer, number of parameters in ResNet is directly proportional to C×C while Number of parameters in DenseNet is directly proportional to l×k×k.

Since k<<C, DenseNet has much smaller size than ResNet.

3.3. More Diversified Features

**More Diversified Features in DenseNet**

Since each layer in DenseNet receive all preceding layers as input, more diversified features and tends to have richer patterns.

3.4. Maintains Low Complexity Features

In Standard ConvNet, classifier uses most complex features.

In DenseNet, classifier uses features of all complexity levels. It tends to give more smooth decision boundaries. It also explains why DenseNet performs well when training data is insufficient.

4. CIFAR & SVHN Small-Scale Dataset Results

4.1. CIFAR-10

Pre-Activation ResNet is used in detailed comparison.

With data augmentation (C10+), test error:

Small-size ResNet-110: 6.41%
Large-size ResNet-1001 (10.2M parameters): 4.62%
State-of-the-art (SOTA) 4.2%
Small-size DenseNet-BC (L=100, k=12) (Only 0.8M parameters): 4.5%
Large-size DenseNet (L=250, k=24): 3.6%

Without data augmentation (C10), test error:

Small-size ResNet-110: 11.26%
Large-size ResNet-1001 (10.2M parameters): 10.56%
State-of-the-art (SOTA) 7.3%
Small-size DenseNet-BC (L=100, k=12) (Only 0.8M parameters): 5.9%
Large-size DenseNet (L=250, k=24): 4.2%

Severe overfitting appears in Pre-Activation ResNet while DenseNet performs well when training data is insufficient since DenseNet uses features of all complexity levels.

**C10+: Different DenseNet Variants (Left), DenseNet vs ResNet (Middle), Training and Testing Curves of DenseNet and ResNet (Right)**

Left: DenseNet-BC obtains the best results.

Middle: Pre-Activation ResNet already got fewer parameters compared with AlexNet and VGGNet, And DenseNet-BC (k=12) got 3× fewer parameters than Pre-Activation ResNet with the same test error.

Right: DenseNet-BC-100 with 0.8 parameters got similar test error than Pre-Activation ResNet-1001 with 10.2M parameters.

4.2. CIFAR-100

Similar trends in CIFAR-100 as below:

4.3. Detailed Results

**Detailed Results, + means data augmentation**

SVHN is the Street View House Numbers dataset. The blue color means the best result. DenseNet-BC cannot get a better result than the basic DenseNet, authors argue that SVHN is a relatively easy task, and extremely deep models may overfit the training set.

5. ImageNet Large-Scale Dataset Results

**Different DenseNet Top-1 and Top-5 Error Rates with Single-Crop (10-Crop) Results**

**ImageNet Validation Set Results Compared with Original ResNet**

Original ResNet implemented in this link is used in detailed comparison.

Left: DenseNet-201 with 20M parameters yields similar validation error as the ResNet-101 with more than 40M parameters.

Right: Similar trends for number of computations (GFLOPs)

Bottom: DenseNet-264 (k=48) got the best results of 20.27% Top-1 error and 5.17% Top-5 error.

6. Further Analysis on Feature Reuse

**Heat map on the average absolute weights of how Target layer (l) reuses the source layer (s)**

Features extracted by very early layers are directly used by deeper layers throughout the same dense block.
Weights of transition layers also spread their weights across all preceding layers.
Layers within the second and third dense blocks consistently assign the least weight to the outputs of the transition layers. (The first row)
At the final classification layer, weights seems to be a concentration towards final feature maps. Some more high-level features produce late in the network.

Authors have also published Multi-Scale DenseNet. Hope I can cover it later as well.

References

[2017 CVPR] [DenseNet]
Densely Connected Convolutional Networks

My Related Reviews

[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet]