Review: NIN — Network In Network (Image Classification)

Using Convolution Layers With 1×1 Convolution Kernels

Published in

Towards Data Science

5 min readApr 25, 2019

**A few example images from the CIFAR10 dataset.**

In this story, Network In Network (NIN), by Graduate School for Integrative Sciences and Engineering and National University of Singapore, is briefly reviewed. Micro neural networks with more than complex structures to abstract the data within the receptive field. This is a 2014 ICLR paper with more than 2300 citations. (Sik-Ho Tsang @ Medium)

Outline

Linear Convolutional Layer VS mlpconv Layer
Fully Connected Layer VS Global Average Pooling Layer
Overall Structure of Network In Network (NIN)
Results

1. Linear Convolutional Layer VS mlpconv Layer

1.1. Linear Convolutional Layer

Here (i, j) is the pixel index in the feature map, xij stands for the input patch centered at location (i, j), and k is used to index the channels of the feature map.
However, representations that achieve good abstraction are generally highly nonlinear functions of the input data.
Authors argue that it would be beneficial to do a better abstraction on each local patch, before combining them into higher level concepts.

1.2. mlpconv Layer

n is the number of layers in the multilayer perceptron. Rectified linear unit is used as the activation function in the multilayer perceptron.
The above structure allows complex and learnable interactions of cross channel information.
It is equivalent to a convolution layer with 1×1 convolution kernel.

2. Fully Connected Layer VS Global Average Pooling Layer

**An Example of Fully Connected Layer VS Global Average Pooling Layer**

2.1. Fully Connected Layer

Usually, fully connected layers are used at the end of network.
However, they are prone to overfitting.

2.2. Global Average Pooling Layer

Here, global average pooling is introduced.
The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer.
One advantage is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories.
Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer.
Furthermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.

3. Overall Structure of Network In Network (NIN)

Thus, the above is the overall structure of NIN.
With global average pooling at the end.

4. Results

4.1. CIFAR-10

NIN + Dropout got only 10.41% error rate is better than Maxout + Dropout.
With data augmentation (Translation & Horizontal Flipping), NIN even got 8.81% error rate.
(If interested, there is a very short introduction of Maxout in NoC.)