Analysis and Review of SPPNet

Understanding SPPNet for Object Classification and Detection

SPPNet allows variable size input image to CNN and can be used for Classification and Object Detection

Parth Rajesh Dedhia
Towards Data Science
9 min readNov 1, 2020

--

In the paper “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”, a technique called the Spatial Pyramid Pooling layer was introduced, which makes the CNN model agnostic of input image size. It was the 1st Runner Up in Object Detection and 2nd Runner up in Classification challenge in ILSVRC 2014 and hence is worth a read.

In this post, I have explained the SPP layer, followed by a review of the entire paper. The blog is structured as a conversation between a student and a teacher.

Student

I wish to learn more about object detection. Last time, you had explained the R-CNN network for object detection. Has there been any other research after R-CNN, in the domain of Object Detection?

Teacher

Soon after the R-CNN, the SPP-Net was introduced. The SPPNet made the model agnostic of input image size, and drastically improved the bounding box prediction speed as compared to the R-CNN, without compromising on the mAP. I will first explain how the made the model agnostic of input image size.

The fixed size constraint of a CNN network is not because of the convolution layer but because of the Fully Connected (FC) layer. A single convolution layer or a set of convolution layers takes an image and produces a feature map proportional to a particular ratio (called the sub-sampling ratio, explained in a while) w.r.t the input image. But for a fully connected layer, the input has to be a fixed-length vector. To overcome this issue, the authors replaced the last pooling layer (the one just before the FC layer) with a Spatial Pyramid Pooling(SPP) layer.

Note: The SPP approach is inspired by the Bag of Words approach. An in-depth understanding of the bag of words is not needed, but knowing about it will aid the understanding of the concept.

Student

The pooling layers that I have seen, have a fixed size of, let’s say, 2 x 2 and a stride of 2 in both the direction. So in such cases, the output remains proportional to the input. How can a pooling layer solve the problem?

Teacher

In the pooling layer you described above, you had set up fix stride (of 2) and fix window size (of [2 x 2]). Because of this, your output is always proportional to the input. Now if you make the pooling window and stride proportional to the input image, you can always get a fixed-sized output. Moreover, the SPP layers do not just apply one pooling operation, it applies a couple of different output sized pooling operations (that’s where then same comes from — Spatial Pyramid Pooling) and combines the results before sending them to the next layer.

Spatial Pyramid Pooling (SPP) layer — paper

In the above image, the authors have used three pooling operations, where one of them outputs only a single number for each map. The other one gives a 2 x 2 grid output for each map, and similarly, the last one gives a 4 x 4 output. This operation is applied to each feature map (256 maps in the above case) given out by the previous convolution operation. The SPP layer output is flattened and send to the FC layer.

Now to calculate the window size and stride size, let’s declare some variables and formulas.

  • Let the feature map input to the SPP layer be of size [a x a][13 x 13]
  • We need the output size for single map be [n x n][4 x 4]
  • window = ceiling(a/n)ceiling(13/4)↔4
  • stride = floor(a/n)floor(13/4)↔3

Now on a [13 x 13] map, apply the above window([4 x 4]) and stride([3 x 3]), we get the output to be [4 x 4]. This operation is applied to all the feature maps and for 256 feature maps we get an output of [4 x 4 x 256]. So now we can change the [n x n] grid size to get the desired output size.

Student

The SPP layer is an intuitive design for the network as it’s making the CNN model produce results irrespective of the input image size. So based on my understanding, I can apply this layer to any CNN before sending the features to the fully connected layer. Which models have the author used in this paper? Also, since a completely new layer is introduced before the FC layer, so did the authors use any pre-trained architecture or they trained a model from the start, for the classification task?

Teacher

One could apply the SPP layer to any of your CNN architectures. But, we are talking about the time of 2014. Not many models were present then. The authors used ZF-Net, AlexNet, and OverFeat (5 and 7 layered) architectures. However, these networks were modified by changing the padding to get the desired output feature map.

CNN model architecture used in SPPNet — paper

The authors trained the model on ImageNet 2012 dataset, and provided a detailed analysis of the training details, and compared them with the contemporary models without using the SPP layer over them. However, let’s just make some definitions clear before diving into the nitty-gritty details.

Multi-size/Multi-scale image: Changing the input image size

Multi-view: Using image augmentation — taking crops from the input image and flipping the same.

First, the authors train on the model with a single-size input [224 x 224] and then train with a variable input size [224 x 224] and [180 x 180]. The authors perform training on fixed-size images to take the advantage of the GPU implementation. While training for multi-size, they send the same image size for one epoch and then alter to the other epoch. The pyramid used for both of the above training is: {6 x 6, 3 x 3, 2 x 2, 1 x 1}.

Note: For training, the image is resized such that the minimum(height, width) is equated to 256 and the remainder is adjusted based on aspect ratio. After that 224 x 224 crops are picked from the center and four corners, giving us a total of 5 224 x 224 images. The image is flipped to produce more than 5 images from the same places, giving a total of 10 images from each input image.

ImageNet 12 validation set analysis — paper

The authors compared the single-size trained SPP model with a no-SPP counterpart and saw a reduction in the error. To prove that this decrement in error is due to the SPP layer and not due to the increased parameters, they changed the SPP layer’s pyramid to {4 x 4, 3 x 3, 2 x 2, 1 x 1}, which considerably reduce the parameter, but only with a minor increase in the error.

Validation on fill size and cropped images — paper

The author also tried to test the model, with and without 224 croppings and analyzed the results. A reduction in error was observed and is shown in the image above. From this loss reduction, the authors said:

This shows the importance of maintaining the complete content.

Student

From the above number, we can say that the novel layer introduces in the SPPNet paper did reduce the error for the classification task. I would also like to know how the authors extrapolated the model for the object detection task?

Teacher

The authors can use this model as it is for object detection, and can add a bounding box detector to shoot the mAP. However, to understand how this happens, we will first understand the concept of Sub-Sampling ratio (S)

Understanding sub-sampling ratio — Image by author

The sub-sampling ratio can be used to determine the output shape of the CNN feature map from the input image shape. Simply multiplying the input image dimensions with the sub-sampling ratio(S) will give us the feature map size. We will use this concept in a while.

Note: Padding does affect the Sub-sampling ration as we may need to add/subtract a constant based on the padding. Details provided in the appendix of the paper.

The SPPNet paper has shown analysis for ZF-Net paper for object detection. The ZF-Net paper has a sub-sampling ratio of 16. Now, for object detection, they still use the Selective Search Algorithm to determine the region (~ 2k regions per image) proposals. But unlike R-CNN, they don’t send each proposed region to the CNN, but instead, map these regions to the output feature map from the last Convolution layer (Conv5).

Mapping of an object from image to feature map — SS from video

In the above image, the object from an image is mapped to the feature map. Let’s look at the maths to see how to do that.

  • Let the image size be [img_height, img_width] [688 x 920]
  • Let an object present in the image have the center at [x, y][340, 450]
  • The height and width of the above object is [obj_height, obj_width][320, 128]
  • Now to map them to the corresponding spatial location on the feature map, we simply multiply them with the sub-sampling ratio (S)16.
  • The feature map will be of size [img_height * S, img_width * S]↔[43 x 58]
  • The object center on the feature map will be in the spatial location of [x * S, y * S]↔[21, 28]
  • The height and width of the object on the feature map will have a height and width of [obj_height * S, obj_width * S] ↔[20 x 8]

In this way, we can map any object from the input image to the output feature map. The object’s coordinates are also projected to the feature map, and only that region are then sent to the SPP layer for feature extraction and then to the FC layer.

Mapping of object window to feature map — paper

Student

Projecting the regions from the input image to the feature map to avoid redundant Convolution Operations on the same part of the image is a smart way to reduce the computation. Has this mechanism affected the mAP of the detection as compared to R-CNN?

Teacher

Before comparing its performance, there are some more details that we need to have a look at. The authors have used the single-size ([224 x 224] only) ImageNet pre-trained ZF-5 network. For detection, they have used two different methods for prediction: single scale — where minimum(height, width) = 688, multi-scale where minimum(height, width) = {480, 576, 688, 864, 1200} and other size is resized depending on the aspect ratio.

In the case of multi-scale detection, the authors have used a novel strategy of choosing the scale which has the total number of a pixel for the object to be closest to (224 x 224 = 50,176), to determine the presence of an object in that scale.

In comparison with R-CNN, on a single scale, the performs degrades by 0.5% but in the case of multi-scale, the performance increases by 0.7%. Though this increase in mAP is not large, there is a considerable increase in speed. The authors also compared the model with and without fine-tuning the FC-layers and compared the same with the R-CNN, and their results are shown below.

Note: The convolution layers were not fine-tuned, only the FC-layers.

Comparison with R-CNN on Pascal VOC 07 — paper

Note: We have discussed the object detection results on Pascal VOC 2007 dataset.

Most of the key concepts from the paper have been discussed. However, I would recommend you read the paper, as they have provided a more detailed analysis in Classification as well as Detection Section.

Understanding the SPP Layers makes it easier to learn the two-stage models build after this. I will be soon writing on the FastR-CNN paper. Stay tuned!!

References

K. He, X Zhang, S. Ren, J Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, ECCV, 2014

--

--

I like reading and implementing the ideas researched in Computer Vision and Deep Learning papers. I post my notes on Medium.