Review: NIN — Network In Network (Image Classification)
Using Convolution Layers With 1×1 Convolution Kernels
In this story, Network In Network (NIN), by Graduate School for Integrative Sciences and Engineering and National University of Singapore, is briefly reviewed. Micro neural networks with more than complex structures to abstract the data within the receptive field. This is a 2014 ICLR paper with more than 2300 citations. (Sik-Ho Tsang @ Medium)
Outline
- Linear Convolutional Layer VS mlpconv Layer
- Fully Connected Layer VS Global Average Pooling Layer
- Overall Structure of Network In Network (NIN)
- Results
1. Linear Convolutional Layer VS mlpconv Layer
1.1. Linear Convolutional Layer
- Here (i, j) is the pixel index in the feature map, xij stands for the input patch centered at location (i, j), and k is used to index the channels of the feature map.
- However, representations that achieve good abstraction are generally highly nonlinear functions of the input data.
- Authors argue that it would be beneficial to do a better abstraction on each local patch, before combining them into higher level concepts.
1.2. mlpconv Layer
- n is the number of layers in the multilayer perceptron. Rectified linear unit is used as the activation function in the multilayer perceptron.
- The above structure allows complex and learnable interactions of cross channel information.
- It is equivalent to a convolution layer with 1×1 convolution kernel.
2. Fully Connected Layer VS Global Average Pooling Layer
2.1. Fully Connected Layer
- Usually, fully connected layers are used at the end of network.
- However, they are prone to overfitting.
2.2. Global Average Pooling Layer
- Here, global average pooling is introduced.
- The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer.
- One advantage is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories.
- Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer.
- Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.
3. Overall Structure of Network In Network (NIN)
- Thus, the above is the overall structure of NIN.
- With global average pooling at the end.
4. Results
4.1. CIFAR-10
- NIN + Dropout got only 10.41% error rate is better than Maxout + Dropout.
- With data augmentation (Translation & Horizontal Flipping), NIN even got 8.81% error rate.
- (If interested, there is a very short introduction of Maxout in NoC.)
- As shown above, introducing dropout layers in between the mlpconv layers reduced the test error by more than 20%.
4.1. CIFAR-100
- Similarly, NIN + Dropout got only 35.68% error rate which is better than Maxout + Dropout.
4.3. Street View House Numbers (SVHN)
- However, NIN + Dropout got 2.35% error rate which is worse than DropConnect.
4.4. MNIST
- In MNIST, NIN + Dropout got 0.47% error rate which is worse than Maxout + Dropout a bit.
4.5. Global Average Pooling as a Regularizer
- With Global Average Pooling, NIN got 10.41% error rate which is better than fully connected + dropout of 10.88%.
In NIN, with 1×1 convolution, more non-linearity is introduced which makes the error rate lower.
Reference
[2014 ICLR] [NIN]
Network In Network
My Previous Reviews
Image Classification
[LeNet] [AlexNet] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [MSDNet]
Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]
Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3]
Biomedical Image Segmentation
[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA]
Instance Segmentation
[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]
Super Resolution
[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]
Human Pose Estimation
[DeepPose] [Tompson NIPS’14] [Tompson CVPR’15]