Review: MSDNet — Multi-Scale Dense Networks (Image Classification)

Resource Efficient Image Classification Using Multi-Scale DenseNet for Limited Computational Power Devices

Published in

Towards Data Science

12 min readMar 10, 2019

In this story, MSDNet (Multi-Scale Dense Network), by Cornell University, Fudan University, Tsinghua University, and Facebook AI Research (FAIR), is reviewed. Authors invented DenseNet in 2017 CVPR (Best Paper Award with over 2900 citations), and they proposed MSDNet in 2018 ICLR with tens of citations, which is a Multi-Scale DenseNet. (Sik-Ho Tsang @ Medium)

With MSDNet, multiple classifiers with varying resource demands are trained such that the test time can be adaptively varied. Thus, with high computing device, image can be gone through the network for image classification. With resource limited device, image can early exit from the network for a relatively coarser image classification. Let’s see how it works.

Outline

Concepts of Image Classification with Computational Constraints
MSDNet Architecture
Evaluations in Anytime Classification and Budgeted Batch Classification
Network Reduction and Lazy Evaluation
Results

1. Concepts of Image Classification with Computational Constraints

There are two settings for image classification with computational constraints

1.1. Anytime Classification

The network can be forced to output a prediction at any given point in time.
For example: Mobile apps on Android devices.

1.2. Budgeted Batch Classification

A fixed computational budget is shared across a large set of examples which can be spent unevenly across “easy” and “hard” examples. This is useful for large-scale machine learning applications.
For example: Search engines, social media companies, online advertising agencies, all must process large volumes of data on limited hardware resources.
As of 2010, Google Image Search had over 10 Billion images indexed, which has likely grown to over 1 Trillion since. Even if a new model to process these images is only 1/10s slower per image, this additional cost would add 3170 years of CPU time.
In the budget batch classification setting, companies can improve the average accuracy by reducing the amount of computation spent on “easy” cases to save up computation for “hard” cases.

As shown above, MSDNet is a Multi-Scale DenseNet. Upper paths are for the image without downsizing, lower paths are for the image with smaller scales.
Say for example, there is a cat image we want to classify, by going through the network, maybe the classification of cat has a classification confidence of 0.6 which is larger than threshold, we can have an early exit. The following network can be skipped to save the computational time for “easy” images.
On the other hand, for “hard” images, we may go through a deeper network until the the classification confidence higher than the threshold.
Hence, the computational time can be saved by leveraging the time spent on “easy” and “hard” images.

2. MSDNet Architecture

2.1. Architecture

Its vertical layout as a miniature “S-layers” convolutional network (S=3).
For the first layer (l = 1), feature maps at coarser scales are obtained via down-sampling.
For the subsequent layers with l =1 and scale s, feature maps from all previous feature maps of scale s and s-1 are concatenated. Conv(1×1)-BN-ReLU-Conv(3×3)-BN-ReLU are used.
To be more precise, the below figure and table shows the feature maps used at certain s and l.

**The feature maps used at certain s and l**

At certain locations, there are some intermediate classifiers inserted into the middle of the network.
Each classifier has two down-sampling convolutional layers with 128 dimensional 3×3 filters, followed by a 2×2 average pooling layer and a linear layer.
During training, the logistic loss functions L(fk) for each classifier and minimize the weighted cumulative loss:

where D denotes the training set, and wk ⩾ 0 a weight of the classifier k.
wk=1 empirically.

2.2. Evaluation of Intermediate Classifiers Using Different Networks

You may ask, why not just insert intermediate classifiers into ResNet or DenseNet? Why we must need MSDNet? Authors have evaluate this as well. There are two main reasons.

**Evaluation of Intermediate Classifiers Using Different Networks on CIFAR-100 Dataset**

2.2.1. First Reason

Problem: The lack of coarse-level features. Traditional neural networks learn features of fine scale in early layers and coarse scale in later layers. Early layers lack coarse-level features and early-exit classifiers attached to these layers will likely yield unsatisfactory high error rates.
The above figure at the left shows the results of the intermediate classifiers which are also inserted into ResNet and DenseNet as well. The accuracy of a classifier is highly correlated with its position within the network. Particularly in the case of the ResNet (blue line), one can observe a visible “staircase” pattern, with big improvements after the 2nd and 4th classifiers — located right after pooling layers.
Solution: Multi-scale feature maps. MSDNet maintains a feature representation at multiple scales throughout the network, and all the classifiers only use the coarse-level features.
The horizontal connections preserve and progress high-resolution information, which facilitates the construction of high-quality coarse features in later layers. The vertical connections produce coarse features throughout that are amenable to classification.

2.2.2. Second Reason

Problem: Early classifiers interfere with later classifiers. The above figure at the right shows the accuracies of the final classifier as a function of the location of a single intermediate classifier, relative to the accuracy of a network without intermediate classifiers.
The introduction of an intermediate classifier harms the final ResNet classifier (blue line), reducing its accuracy by up to 7%. This accuracy degradation in the ResNet may be caused by the intermediate classifier influencing the early features to be optimized for the short-term and not for the final layers.
Solution: Dense connectivity. By contrast, the DenseNet (red line) suffers much less from this effect. This is because in DenseNet, feature maps are combined using concatenation instead of using addition in ResNet. Feature maps at earlier layers can bypass through dense connection to the later layers. The final classifier’s performance becomes (more or less) independent of the location of the intermediate classifier.

3. Evaluations in Anytime Classification and Budgeted Batch Classification

3.1. Anytime Classification

In anytime classification, there is a finite computational budget B > 0 available for each test example.
During testing in the anytime setting, the input propagates through the network until the budget B is exhausted and output the most recent prediction.

3.2. Budgeted Batch Classification

In budgeted batch classification, the model needs to classify a set of examples Dtest = {x1, …, xM} within a finite computational budget B > 0 that is known in advance.
It can potentially do so by spending less than B/M computation on classifying an “easy” example whilst using more than B/M computation on classifying a “difficult” example.
Therefore, the budget B considered here is a soft constraint when we have a large batch of testing samples.
Then, dynamic evaluation is used to solve this problem:
At test time, an example traverses the network and exits after classifier fk if its prediction confidence (The maximum value of the softmax probability as a confidence measure) exceeds a predetermined threshold θk.
Before training, we compute the computational cost, Ck, required to process the network up to the kth classifier.
We denote by 0 < q ≤ 1 a fixed exit probability that a sample that reaches a classifier will obtain a classification with sufficient confidence to exit.
The probability that a sample exits at classifier k:

where z is the normalizing constant such that:

nd we need to ensure that the overall cost of classifying all samples in Dtest does not exceed our budget B, which gives rise to the constraint:

Then, we can solve the above for q and assign the thresholds θk on a hold-out/validation set such that approximately a fraction of qk validation samples exit at the kth classifier.

4. Network Reduction and Lazy Evaluation

There are two straightforward ways to further reduce the computational requirements of MSDNets.

First, it is inefficient to maintain all the finer scales until the last layer of the network. One simple strategy to reduce the size of the network is by splitting it into S blocks along the depth dimension, and only keeping the coarsest (S-i+1) scales in the ith block as shown above. This reduces computational cost for both training and testing.
Second, since a classifier at layer l only uses features from the coarsest scale, the finer feature maps in layer l (and some of the finer feature maps in the previous S-2 layers) do not influence the prediction of that classifier. Therefore, the computation in “diagonal blocks” is grouped such that we only propagate the example along paths that are required for the evaluation of the next classifier. This minimizes unnecessary computations when we need to stop because the computational budget is exhausted. This strategy is called lazy evaluation.

5. Results

5.1. Datasets

CIFAR-10 & CIFAR-100: The two CIFAR datasets contain 50,000 training and 10,000 test images of 32×32 pixels. 5,000 training images are hold out as a validation set. The datasets comprise 10 and 100 classes, respectively. Standard data augmentation, randomly cropped and horizontal flipping is applied to training set. Mini-batch size is 64.
ImageNet: The ImageNet dataset comprises 1,000 classes, with a total of 1.2 million training images and 50,000 validation images. 50,000 images are hold out from the training set to estimate the confidence threshold for classifiers in MSDNet. Standard data augmentation is applied. At test time, 224×224 centre crop of images are resized to 256×256 pixels for classification. Mini-batch size is 256.
On ImageNet, 4 scales are used, i.e. S=4, respectively producing 16, 32, 64, and 64 feature maps at each layer. The original images are first transformed by a 7×7 convolution and a 3×3 max pooling (both with stride 2), before entering the first layer of MSDNets.

5.2. Ablation Study

An MSDNet with six intermediate classifiers are used, and the three main components, multi-scale feature maps, dense connectivity, and intermediate classifiers, are removed one at a time.
If all the three components in an MSDNet are removed, a regular VGG-like convolutional network is obtained.
To make our comparisons fair, we keep the computational costs of the full networks similar, at around 3.0×10⁸ FLOPs, by adapting the network width, i.e., number of output channels at each layer.
The original MSDNet (Black) has the highest accuracy of course.
With dense connectivity removed (Orange), the overall accuracy is hurt drastically.
Plus the multi-scale convolution removed (Light Blue), the accuracy is hurt only in the lower budget regions. This is consistent with authors’ motivation that the multi-scale design introduces discriminative features early on.
Authors also mentioned that with all 3 components removed, it (Star) performs similarly as MSDNet under the specific budget. (But I cannot find the Star from the figure…)

5.3. Anytime Classification

The MSDNet network has 24 layers.
The classifiers operate on the output of the 2×(i+1)th layers, with i=1,…,11.
On ImageNet, the ith classifier operates on the (k×i+3)th layer, with i=1,…,5, where k=4, 6 and 7.

**Top-1 Accuracy of Anytime Classification on ImageNet (Left),CIFAR-100 (Middle) & CIFAR-10 (Right)**

ResNetMC: ResNet with MC (Multiple Classifiers), 62 layers, with 10 residual blocks at each spatial resolution (for three resolutions): Early-exit classifiers are on the output of the 4th and 8th residual blocks at each resolution, producing a total of 6 intermediate classifiers (plus the final classification layer).
DenseNetMC: DenseNet with MC, 52 layers, with three dense blocks and each of them has 16 layers. The six intermediate classifiers are attached to the 6th and 12th layer in each block, also with dense connections to all previous layers in that block.
Both ResNetMC and DenseNetMC require about 1.3×10⁸ FLOPs when fully evaluated.
Ensemble of ResNets/DenseNets with varying depths are also evaluated. At test time, the networks are evaluated sequentially (in ascending order of network size) to obtain predictions for the test data. All predictions are averaged over the evaluated classifiers. On ImageNet, the ensemble of ResNets and DenseNets with depth varying from 10 layers to 50 layers, and 36 layers to 121 layers, respectively.
On CIFAR-100, MSDNet substantially outperforms ResNetMC and DenseNetMC at any ranges. This is due to the fact that after just a few layers, MSDNets have produced low-resolution feature maps that are much more suitable for classification than the high-resolution feature maps in the early layers of ResNets or DenseNets.
In the extremely low-budget regime, ensembles have an advantage because their predictions are performed by the first (small) network, which is optimized exclusively for the low budget. However, the accuracy of ensembles does not increase nearly as fast when the budget is increased.
Unlike MSDNets, the ensemble repeat the computation of similar low-level features repeatedly.
Ensemble accuracies saturate rapidly when all networks are shallow.

5.4. Budgeted Batch Classification

On CIFAR-10 & CIFAR-100, the MSDNet network ranges from 10 to 36 layers. The kth classifier is attached to {1+...+k} layers.
On ImageNet, same networks are used as the one in anytime classification.

**Top-1 Accuracy of Budgeted Batch Classification on ImageNet (Left),CIFAR-100 (Middle) & CIFAR-10 (Right)**

In budgeted batch classification, the predictive model receives a batch of M instances and a computational budget B for classifying all M instances. Dynamic evaluation is used.
On ImageNet, M=128, five DenseNets, five ResNets, one AlexNet and one GoogLeNet are compared.
Ensemble of five ResNets: “Easy” images are only propagated through the smallest ResNet-10, whereas “hard” images are classified by all five ResNet models. (predictions are averaged across all evaluated networks in the ensemble).
On CIFAR-100, M=256, ResNets, DenseNets of varying sizes, Stochastic Depth Networks, Wide ResNets, FractalNets, ResNetMC and DenseNetMC are compared.
As shown the figure above, three MSDNets with different depths are used so that they all combine together to cover large range of computational budgets.
On ImageNet, for instance, with an average budget of 1.7×10⁹ FLOPs, MSDNet achieves a top-1 accuracy of ~75%, which is ~6% higher than that achieved by a ResNet with the same number of FLOPs.
Compared to the computationally efficient DenseNets, MSDNet uses ~2 to 3× times fewer FLOPs to achieve the same classification accuracy.
On CIFAR-100, MSDNets consistently outperform all baselines across all budgets.
MSDNet performs on par with a 110-layer ResNet using only 1/10th of the computational budget.
MSDNet is up to 5 times more efficient than DenseNets, Stochastic Depth Networks, Wide ResNets, and FractalNets.
Similar to results in the anytime-prediction setting, MSDNet substantially outperform ResNetsMC and DenseNetsMC with multiple intermediate classifiers, which provides further evidence that the coarse features in the MSDNet are important for high performance in earlier layers.

5.5. Visualization

**Visualization on Easy and Hard Images on ImageNet**

Easy images (top row): Exited at the first classifier and correctly classified.
Hard images (bottom row): Exited at the first classifier and incorrectly classified while correctly classified by the last classifier in which they are non-typical images.

5.6. More Computational Efficient DenseNets

A more efficient DenseNet is found and investigated. Authors also think that it is an interesting finding during exploration of MSDNet.

**Anytime Classification (Left) and Budgeted Batch Classification (Right)**

DenseNet*: The original DenseNets are modified by doubling the growth rate after each transition layer, so that more filters are applied to low resolution feature maps.
DenseNet* (Green) significantly outperform the original DenseNet (Red) in terms of computational efficiency.
In anytime classification, an ensemble of DenseNets* of varying depths is just slightly worse than MSDNets.
In budgeted batch classification, MSDNets still substantially outperforms an ensemble of DenseNets* of varying depths.

For future work, authors plan to investigate beyond classification (e.g.: image segmentation), combining MSDNets with model compression, spatially adaptive computation and more efficient convolution operations. To me, this paper has many important facts and concepts, which makes me write such a long story.