Overfeat Review[1312.6229]

Theoretical summary of Overfeat

Published in

Towards Data Science

8 min readApr 2, 2020

Credits: https://en.wikipedia.org/wiki/Object_detection

I have planned to read major object detection papers(although I have read most of them roughly I will be reading them in detail good enough to write a blog about them). The papers are related to deep learning-based object detection. Feel free to give suggestions or ask doubts will try my best to help everyone. I will write the arxiv codes of each paper below and will give a link to the blog(Will keep on updating them as I write) and their paper below. Anyone starting with the field can skip a lot of these papers. I will also write the priority/importance(according to there necessity to understand the topic) of the papers once I read them all.
I have written the blog considering readers similar to me and still learning. In case I made any mistake(I will try to minimize it by understanding paper in depth from various sources including blogs, codes, and videos) that anyone finds out feel free to highlight it or add a comment on the blog. I have mentioned the list of papers that I will be covering at the end of the blog.

Let’s get started :)

Overfeat paper explores three computer vision tasks classification, localization, and detection respectively in increasing order of difficulty. Each task is a sub-task of the next. All tasks are addressed using a single framework and a shared feature learning base.

Classification

Architecture for classification is similar to alexnet with some improvements. The authors prepared two different architecture: fast and accurate. The differences from alexnet architectures include no contrast normalization, pooling regions are non-overlapping, larger 1st and 2nd layer feature map because of smaller stride. I will add the table showing both fast and accurate architectures. The execution of training and inference steps is different. I will explain the training step here and the inference step later. The training of the classifier is done at a single scale of size 221*221. 5 random crops and their horizontal flips were extracted by first resizing images to 256 and then cropping to 221. I will not go into exact details of training such as learning rate, weight decay, etc. The architecture can be seen in Fig 1.

ConvNets and Sliding Window Efficiency(Applying sliding window internally in convnets)

We know the sliding window approach can improve the results of classification and localization but increases computation by many folds. ConvNets are inherently efficient in applying sliding window techniques because they share computations common to overlapping regions(most part will be overlapping). I will explain how this sliding window thing works internally in CNN’s. Take this figure for reference.

I hope you are aware of receptive fields(even if you are not read it after this blog it is an important concept). The above figure is divided into two subparts, considering the first part for now. After the first convolution is applied, the output changes to 10*10(read this if you calculating output dim of the convolution operation. ) followed by the max pool layer similarly in later layers giving a final output of 1*1. Here intuitively we can say that the final 1*1 encodes information of 14*14 input(basically this is a receptive field). No jump to the second part, there is no change in the filter size and the only change is in the input size(16*16). When the first convolution is applied followed by max-pooling the output size becomes 6*6 which was 5*5 in the previous case, now when a convolution with a 5*5 filter size will be applied the output size will be 2*2 instead of 1*1. If you see the blue point in the 2*2 output it encodes all information from the blue portion of the input and will never see the yellow region. Similarly, the second point((0,1) location on output matrix) of 2*2 output will never see the first two columns and bottom two rows of input(the receptive field of our model is 14*14 thus increase in input size have linearly increased the output size).

Multi-Scale Classification

At the time of inference to boost the classification accuracy, they used multi-view voting. They used a similar policy discussed in training to generate 10 different images and predictions were averaged. But this ignores many regions in an image(because unlike sliding window approach we are just using only 5 crops per image). They thus used a better approach which is explained below:

They used 6 different scales of input which results in layer 5 feature maps of different sizes(refer to the table below for these sizes).

Fig 3. Spatial Dimension for multiscale approach

Here we can see the change in the input size of the model in different scales is (36,72). The 36 value(subsampling ratio) here is analogous to the 2 pixels shift that we observed in the last section. The output from layer 5(pre-pool in the above table) will be different for each scale. A 3*3 max pooling is applied. Consider the first scale in which output shape is 17*17. Here if max-pooling(3*3) is applied without padding, the output generated will be of size 5*5. Here we can observe the output of the max pool does not have any input from the last two rows and columns of pre pooled feature map. As CNN’s have local connectivity these two columns and rows become around 30 in case of the original image. To counter this problem, max pooling is applied a total of 9 times considering starting pixel locations as (x{0,1,2},y{0,1,2}). Thus now the output will be of the shape 3*3(The 9 times told earlier is in 2d for both rows and columns). This thing is explained in fig 4 by the authors.

Continuing from the above paragraph. Thus total output we get is of shape (5*5)*(3*3). A 5*5 convolution is applied to get a final classifier map. It will be of shape (1*1)*(3*3)*C. Here C is because this is the final step where we get predictions. For other scales, the output map generated after max-pooling will be different. For other scales, it will be similar and I think you can now calculate how we get that output dimensions. The final output is now calculated by first taking spatial max(max score in 3*3*C or 6*9*C) at each scale. The output for all scales was then averaged to finally get one class output for each image.

We are done with classification, our main task was object detection which we have not even started yet.

Dont worry it is not that long now. We have almost covered all the theories.

Localization

Classification trained network is replaced by a regression network instead of a classification layer and train it to predict the bounding boxes at each spatial location and scale. The regression network takes pooled feature maps from layer 5 followed by two FC layers of size 4096 and 1024. The final output has 4 units(see fig 5.). The weights of the feature extraction layer(first 5 layers) are fixed and the model is trained using l2 loss. The training is done on multiple scales(unlike classification network which was trained on a single scale and only predictions were generated at multiple scales). This will make predictions match correctly across scales and increase the confidence of merged predictions.

To generate object bounding box predictions, we simultaneously run the classifier and regressor networks across all locations and scales. Since these share the same feature extraction layers, only the final regression layers need to be recomputed after computing the classification network.

Since the model will predict multiple boxes(and there will be a single box in case of localization) we need some strategy to eliminate all the bad predictions. A greedy merge strategy is applied which is shown in figure 6(you can skip it because I haven’t seen any other paper using this strategy and NMS(will be discussed in future blogs) is used more often).

Here, match_score is calculated as the sum of the distance between the centers of b1 and b2. box_merge computes the average of the bounding box coordinates.

Detection

In the case of detection, the main difference with the localization task is the necessity to predict a background class when no object is present. Thus the boxes we get now from the localization network which earlier belonged to some class now belong to background thus we can remove these predictions and get only the predictions in which class is predicted confidently.

Here comes the end to the overfeat paper summary. Please highlight if I have done something wrong and ask doubts as a comment.

References:

List of Papers:

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. ← You completed this blog.
Rich feature hierarchies for accurate object detection and semantic segmentation(RCNN). [Link to blog]
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition (SPPNet). [Link to blog]
Fast R-CNN [Link to blog]
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. [Link to blog]
You Only Look Once: Unified, Real-Time Object Detection. [Link to blog]
SSD: Single Shot MultiBox Detector. [Link to blog]
R-FCN: Object Detection via Region-based Fully Convolutional Networks. [Link to blog]
Feature Pyramid Networks for Object Detection. [Link to blog]
DSSD: Deconvolutional Single Shot Detector. [Link to blog]
Focal Loss for Dense Object Detection(Retina net). [Link to blog]
YOLOv3: An Incremental Improvement. [Link to blog]
SNIPER: Efficient Multi-Scale Training. [Link to blog]
High-Resolution Representations for Labeling Pixels and Regions. [Link to blog]
FCOS: Fully Convolutional One-Stage Object Detection. [Link to blog]
Objects as Points. [Link to blog]
CornerNet-Lite: Efficient Keypoint Based Object Detection. [Link to blog]
CenterNet: Keypoint Triplets for Object Detection. [Link to blog]
Training-Time-Friendly Network for Real-Time Object Detection. [Link to blog]
CBNet: A Novel Composite Backbone Network Architecture for Object Detection. [Link to blog]
EfficientDet: Scalable and Efficient Object Detection. [Link to blog]

Peace…