Review: STN — Spatial Transformer Network (Image Classification)

With STN, Spatially Transformed Data within Network, Learn Invariance to Translation, Scale, Rotation and More Generic Warping.

Sik-Ho Tsang
Towards Data Science

--

In this story, Spatial Transformer Network (STN), by Google DeepMind, is briefly reviewed. STN helps to crop out and scale-normalizes the appropriate region, which can simplify the subsequent classification task and lead to better classification performance as below:

(a) Input Image with Random Translation, Scale, Rotation, and Clutter, (b) STN Applied to Input Image, (c) Output of STN, (d) Classification Prediction

It is published in 2015 NIPS with more than 1300 citations. Spatial transformation such as affine transformation and homography registration has been studied for decades. But in this paper, spatial transformation is coped with neural network. With learning-based spatial transformation, transformation is applied conditioned on input or feature map. And it is highly related to another paper called “Deformable Convolutional Networks” (2017 ICCV). Thus, I decided to read this first. (Sik-Ho Tsang @ Medium)

Outline

  1. Quick Review on Spatial Transformation Matrices
  2. Spatial Transformer Network (STN)
  3. Sampling Kernel
  4. Experimental Results
  5. Some Other Tasks

1. Quick Review on Spatial Transformation Matrices

There are mainly 3 transformation learnt by STN in the paper. Indeed, more sophisticated transformation can also be applied as well.

1.1 Affine Transformation

Affine Transform
  • Depending the values in the matrix, we can transform (X1, Y1) to (X2, Y2) with different effects, as follows:
Translation, Scaling, Rotation, and Shearing
  • If interested, please Google “Registration”, “Homography Matrix”, or “Affine Transform”.

1.2 Projective Transformation

  • Projective transformation can also be learnt in STN as below:
Projective Transformation

1.3. Thin Plate Spline (TPS) Transformation

Thin Plate Spline (TPS) Transformation
An example
  • For TPS transformation, it is more complicated compared with the previous two transformation. (I have learnt affine and projective mapping before, but I haven’t touched about TPS, if there is mistakes, please tell me.)
  • To be brief, suppose we have a point (x, y) at a location other than the input points (xi, yi), we use the equations at the right to transform the point based on a bias, weighted sum of x and y, and a function of distance between (x, y) and (xi, yi). (Here, a radial basis function RBF.)
  • Therefore, if we use TPS, the network needs to learn a0, a1, a2, b0, b1, b2, Fi, and Gi, which are 6+2N number of parameters.
  • As we can see, a more flexible or higher degree of freedom of deformation or transformation can be achieved by TPS.

2. Spatial Transformer Network (STN)

Affine Transformation
  • STN is composed of Localisation Net, Grid Generator and Sampler.

2.1. Localisation Net

  • With input feature map U, with width W, height H and C channels, outputs are θ, the parameters of transformation . It can be learnt as affine transform as above. Or to be more constrained such as the used for attention which only contains scaling and translation as below:
Only scaling and translation

2.2. Grid Generator

  • Suppose we have a regular grid G, this G is a set of points with target coordinates (xt_i, yt_i).
  • Then we apply transformation on G, i.e. (G).
  • After (G), a set of points with destination coordinates (xt_i, yt_i) is outputted. These points have been altered based on the transformation parameters. It can be Translation, Scale, Rotation or More Generic Warping depending on how we set θ as mentioned above.

2.3. Sampler

(a) Identity Transformation, (b) Affine Transformation
  • Based on the new set of coordinates (xt_i, yt_i), we generate a transformed output feature map V. This V is translated, scaled, rotated, warped, projective transformed or affined, whatever.
  • It is noted that STN can be applied to not only input image, but also intermediate feature maps.

3. Sampling Kernel

  • As we can see in the example above, if we need to sample a transformed grid, we got sampling problem, how we sampling those sub-pel positions are depending on what sampling kernel we about to use.
  • General Form:
  • Integer Sampling Kernel (by rounding to the nearest integer):
  • Bilinear Sampling Kernel:
  • It is a (sub-)differentiable sampling mechanism so that it is convenient for backpropagation:

4. Experimental Results

4.1. Distorted MNIST

Errors of distorted MNIST datasets (Left), Some examples that are failed in CNN but successfully classified in STN (Right)
  • Distortion applied: TC: translated and cluttered, R: rotated, RTS: rotated, translated, and scaled, P: projective distortion, E: elastic distortion.
  • Spatial transformers: Aff: Affine Transformation, Proj: Projective Transformation, TPS: Thin Plate Spline Transformation.
  • FCN: FCN here means Fully Connected Network without convolutions (It is NOT Fully Convolutional Network here.)
  • As we can see, ST-FCN outperforms FCN and ST-CNN outperforms CNN.
  • And ST-CNN is consistently better than ST-FCN in all settings.

4.2. SVHN (Street View House Number)

Errors of SVHN datasets (Left), Some examples use in ST-CNN (Right)
  • ST-CNN Single: Only one ST at the beginning of network.
  • ST-CNN Multi: one ST before each conv.
  • Affine transformation is used here.
  • Similarly, ST-CNN outperforms Maxout and CNN. (I have a very brief introduction of Maxout in NoC, please read it if interested.)
  • And ST-CNN Multi outperforms ST-CNN Single a bit.

4.3. Fine-Grained Classification

Fine-Grained Bird Classification. Accuracy(left), 2×ST-CNN (Top Right Row), 4×ST-CNN (Bottom Right Row)
  • Here, ImageNet Pre-trained Inception-v2 is used as backbone for classifying 200 species, which has 82.3% accuracy.
  • 2/4×ST-CNN: 2/4 parallel STs, with higher accuracy.
  • It is interesting that one ST (red) has learnt to be a head detector, with other 3 STs (green) learn the central part of the body of a bird.

5. Some Other Tasks

5.1. MNIST Addition

MNIST Addition
  • 2×ST-CNN: It is interesting that each of ST learns to transform each of a digit though each ST also receives two input digits.

5.2. Co-localisation

Co-localisation
  • Triplet loss: Hinge loss is used to enforce the distance between the two outputs of the ST to be less than the distance to a random crop, hoping to encourage the spatial transformer to localise the common objects.

5.3. Higher Dimensional Transformers

  • STN can also be extended to be 3D affine transformation.

There are different network architectures and settings for different datasets. It is better to visit the paper if you want to know about the details. Next, I will probably review about Deformable Convolutional Networks.

--

--