Scale-Equivariant CNNs

How to design a network for scale-equivariance

Mark Lukacs
Towards Data Science

--

Note: Since it’s not possible to have inline equations, we made the formulas in a hacky way. However, we’ve noticed that on some devices the rendering fails. If the exponents of “a⁻², a⁻¹, a⁰, a¹, a²” are not from -2 to 2, then we recommend to check out our draft, made in LaTeX.

Convolutional neural networks (CNNs) have proven to be a dominant force in computer vision tasks, and have a broad range of applications, for e.g. in image and video recognition, image classification, image segmentation, etc. CNNs are designed to automatically and adaptively learn spatial hierarchies of features through backpropagation by using multiple building blocks, such as convolutional layers, pooling layers, and fully connected layers. An interesting property of CNNs is the fact that, by design, a CNN is translation equivariant i.e. the translation of input features results in an equivalent translation of outputs. As an intuitive example of what this means, if a CNN has been trained to detect the presence of a cat in an image, due to its translation equivariance property, it will be able to recognize the cat, regardless of where the cat appears in the image. This is achieved by weight sharing across the receptive field of the neurons in the convolutional layers. Thus, in the figure below, if a cat image is shifted, a CNN will still be able to detect the cat.

Image of a cat and a translated cat
Example of the translation equivariance property of CNNs; the ability to recognize the cat, regardless of how the image is shifted. [Image by authors (cat by Antonio Gravante — Shutterstock.com)]

Now, what happens when the same objects appear at different scales in images? For example, what would happen if cats would appear at different scales? Would a regular CNN still be able to recognize the cats? Unfortunately, although CNNs have very powerful and interesting properties, they are not designed to be equivariant to neither rotations nor scale changes of the input. This becomes a problem right? Because, in a real life application, these types of input transformations happen all the time — for e.g. think of images processed by an autonomous vehicle — it is desirable to detect pedestrians, regardless of the scale they may appear at.

Example of scaled objects; CNNs, are not inherently designed to recognize the same objects at different scales. [Image by authors (cat by Antonio Gravante — Shutterstock.com)]

So… how can this problem be tackled? In 2019, a very interesting solution was proposed by Ivan Sosnovik, Michał Szmaja, and Arnold Smeulders. Their group published a paper named “Scale-Equivariant Steerable Networks” that solved this issue. Not only did they fix this issue, but they managed to provide a computationally expensive comparable solution to ‘vanilla’ CNNs, obtaining state of the art results for the MNIST and STL-10 datasets. Therefore, as we found the paper to be very interesting & exciting, in the following sections, we analyze their contributions, provide some intuition behind their method, and, last but not least, present our attempt at replicating their results.

What is scale-equivariance?

As previously mentioned, one of the most important reasons for which CNNs excel in computer vision tasks is that convolutional layers are translation equivariant. This means that if we shift an input image by (x’, y’) pixels, the output of the layer also shifts.

Figure 3: Example of the translation equivariance property of CNNs; the ability to recognize the cat, regardless of how the image is shifted. [Image by authors, inspired by source, cat from STL-10]

A special case of equivariance, is invariance. Invariance means that no matter how we transform the input, the output remains the same. The transition from equivariance to invariance in CNNs is happening in the pooling layers. For example, if the biggest value in a 3x3 pooling block is in the center, an input shift of 1 doesn’t change the output of that block. However, an important remark has to be made, pooling is only quasi-invariant, and equivariance is limited by edge-effects in CNNs.
Now, let’s imagine the same for scaling an image. If an input image is scaled-up/down, the output should be also scaled-up/down. As previously mentioned, we know that, by default, a convolutional layer doesn’t have this property. To tackle this issue, scale-equivariant layers have to be defined. Scale-equivariant layers will be able to respond to scale differences in the same manner as a regular convolutional layer responds to input shifts. Scale-equivariance is derived from a mathematical concept: group-equivariance. Roughly speaking, group equivariant transformations mean that if the input of the layer is transformed by g, the output is also transformed by g. And g can be any homomorphism, for example: translation, rotation, scale, or the combination of these. By designing G-equivariant layers, we can further increase the weight sharing in a meaningful way.

How should we approach this problem?

Mathematical details

Reading the paper once, we see that the GitHub repository is public, we are happy that we can steal (I mean, reuse) the code, and have a free lunch… Well, before doing that, let’s understand what’s happening on a mathematical and algorithmic level.
The authors of the paper define everything for 1 dimensional signals, followed by the infamous “Generalization to higher-dimensional cases is straightforward”. We think that an intuitive explanation isn’t necessarily one dimensional, but we promise to keep the number of dimensions low.

For starters, let’s understand what is a steerable filter: A steerable filter is a type of kernel, where the scale of the kernel can be changed easily through a parameter. The mathematical definition provided by the paper is:

The inner σ ⁻¹ is the scaling the filter, while the outer σ ⁻¹ is normalizing the filter. In this way we can rescale an arbitrary ψ(x) filter by parameter σ. The next figure provides some intuitive visualization:

The same ψ(x,y) gaussian 2D filter with σ=0.5 on the left and σ=1 on the right. [Image by authors]

The second thing we should understand is the scale-translation group. The group is denoted by H and defined as a scaling operation followed by a translation operation. It is composed of two sub-groups, namely S and T.

The scaling group is denoted by S. The group operation is scaling, which is represented as multiplication by s, and the inverse for element s is s ⁻¹=1/s. The unit element is s * 1/s = 1
However, s is defined as a discrete scale group (to be more manageable mathematically at the Haar integral), consisting of the elements […a⁻², a⁻¹, 1, a¹, a², …] where a ∈ ℝ is a parameter of the model. In that case the inverse of aⁿ is a⁻ⁿ.

The translation group is denoted by T. The group operation is obviously translation, and represented as addition of t. The inverse of the element of t is -t. The unit element is t+(-t)=0. The translation group is left continuous, instead of being discretized to multiples of 1 pixel, because the continuous convolution on T is mathematically well defined (and it could be scaled afterwise).

To perform the group operation (i.e. scaling or addition), we have to apply the group element to the variable of the input function via the group operation (i.e. multiply or add to x respectively). A nice property of these two groups is that their semidirect product can be easily defined. A direct product is the group-theory equivalent of a cartesian product; a semi-direct product is just the generalization of the direct product. This operation can be imagined as an outer product of 2 vectors. The formal definition is H={(s, t) ∣ s S, t T}, which means that transformation h is a scaling s followed by a translation t. The group-operation is (s₂, t₂)⋅(s₁, t₁) = (ss₁, st₁+t₂). Transforming equation (s₂, t₂)⁻¹⋅(s₁, t₁) = (s₂⁻¹s₁, s₂⁻¹(t₁- t₂)) we can find the inverse element, which is (s⁻¹, (s⁻¹)*(-t)). The unit element is (1, 0).

By defining group H we transformed the problem of finding a scale-equivariant convolution to finding a group-equivariant convolution. We transformed the problem from a specific one to a general one, in hope that we can find a solution for that. Luckily we followed the citation of the paper and found the definition of group-equivariant convolution:

The mathematical details are insanely high, as multidimensional calculus and group theory are the minimum to have an idea what is going on. Therefore, we wouldn’t recommend looking them up, but, if you chose to do so, we found this, this, and this to be a great additional source.
In the group-equivariant convolution f(g’) denotes our input signal which corresponds to the image, or the scale equivariant input at later layers.
Lᵍ[ψ](g’), (or ψ(g⁻¹g’) after the transformation) is the filter, and μ(g’) denotes the Haar measure. After a bunch of mathematical transformation, we arrive at:

One thing we haven’t talked about yet are the channels. Regular convolutional layers sum over the input channels, and have a different filter for each input-output channel pair. Then, let’s make our formula do exactly the same. In the equation below, Cᵢₙ and Cₒᵤₜ are the number of input and output channels respectively.

Algorithm

The aforementioned equation can’t be implemented directly in code. Firstly, in the filter ψ the S-group is infinite; that needs to be further limited. Let Nₛ (later also denoted as S) be this limit, so group S becomes [a, a⁻₁, …, a^Nₛ]. Secondly, because the input image is defined on ℤ² instead of ℝ², the group T has to be discretized.

Following the author’s choice, instead of defining each pixel in the kernel as a weight, each filter will be composed as a linear combination of a complete basis. While constructing functions this way, the dimension of the basis is often infinite (for example Taylor or Fourier series). Thus, we limit the algorithm to have an Nb dimensional basis. This way we can set the weights of the network to be the coefficients of the linear combination. Mathematically speaking: ψ=∑ᵢ wᵢ ψᵢ, where wᵢ is a weight, and ψᵢ is a basis vector. After this trick we can implement the magical equivariant convolution.

Note 1: According to the paper, a basis of 2D Hermite polynomials with 2D Gaussian envelope works good enough
Note 2: This way, especially at larger filter sizes, the weight sharing is intensified. For example, a 7 by 7 convolutional kernel with 4th-order hermite polynomials takes only 10 parameters instead of the regular 49. While we can’t reason logically if this weight sharing makes sense or not, our intuition tells us, that it does; this extra weight sharing could be another source for the increased accuracy alongside the scale-equivariant layers.

Now, talking about the implementation, for each filter, we have a weight vector of length Nb (the coefficients of the linear combination), and the prescaled filter basis, of a shape [Nb, S, V, V], where Nb is the number of bases, S is the number of scalings, and V is the spatial (x, y) size of the filter.

When we have every Cₒᵤₜ — Cᵢₙ pair, this can be implemented efficiently, if the weights tensor have a shape of [Cₒᵤₜ , Cᵢₙ, Nb], multiplied by the precalculated bases of shape [Nb, S, V, V]. For the multiplication, we sum over the Nb dimension. For example by using the function:

torch.einsum(‘ijk, klmn -> ijlmn’), weights, bases)

For the output of this operation, we will have the filters in a shape of
[Cₒᵤₜ, Cᵢₙ, S, V, V], denoted as κ. This is visualized in the next figure:

Visualization of the filter basis for a singular Cₒᵤₜ→ Cᵢₙ channel. [Image adapted from the paper]

Using the scale-translation invariant convolution’s equation, we can define 2 scenarios. We tried to follow the author’s notation: we refer as TH to cases when the input of the layer has a scale dimension of 1, (also known as “image”), and the filter has multiple scale dimensions, and we refer as HH to cases when both the input of the layer and the filter has multiple scale dimensions.

In the case of TH, “the summation over S degenerates’’ was mentioned. We felt this line to be a little bit too ambiguous, therefore, we provide an alternative explanation: For each s in the scale dimension we perform a regular convolution between κ[:,:,s,:,:] and the input image. The results of convolution is stored as an array of images, thereby producing an operation leading from group T to H. To leverage the already optimized PyTorch libraries, the TH this can be implemented in the following form

convTH(f, w, ψ) = squeeze(conv2d(f, expand(w × ψ)))

In this case, the input κ filters are expanded from the shape of [Cₒᵤₜ , Cᵢₙ, S, V, V]$ to the shape of [CₒᵤₜS , Cᵢₙ, V, V]. After this, the expanded filter base is convolved with the input image (which has the shape of [Cᵢₙ, U, U], where U is the size of the image). The output of this convolution yields a tensor of shape [CₒᵤₜS, U, U], which squeezed the dimensions [Cₒᵤₜ, S, U, U]. While we have implemented this layer from scratch, we have not used it; due to time constraints, we were forced to chose another approach in order to meet the project’s deadline.

Note 1: For the shape of the expanded κ the paper defines the size
[Cₒᵤₜ , CᵢₙS, V, V] instead of [CₒᵤₜS , Cᵢₙ, V, V]. We believe the authors made quite a painful typo here.
Note 2: The input image has a size of [U, U] because the datasets used for benchmarking were using square images. Nothing restricts this to have the shape of [U₁,U₂], but we’ve decided to follow the author’s notation.
Note 3: When applying the convolution the output doesn’t always have the size of [U, U], it is modified by padding, kernel size, and stride in the normal way.

Visualization of convolution TH. Spatial components are hidden for simplicity. [Image adapted from the paper]

The case of HH can be imagined as doing a convolution in the direction of x, y, and s, where the direction s refers to interscale interaction. However, as this would extend the mathematical presentation even further, we would like to think that we have provided enough intuition for TH in order for one to be able to understand HH on his own.

Data preprocessing and augmentation

In regards to hand-down experiments, we had to process and augment data for two datasets, namely MNIST and STL-10. For each, as mentioned in the paper, we followed the specific steps.

MNIST

We rescaled the MNIST images using a uniformly sampled factor between 0.3–1 and padded the images with zeros to retain the resolution of the initial images.

Example of rescaling MNIST image. [Image by authors]

Furthermore, we generated 6 realizations of this dataset. 10.000 for training, 2.000 for evaluation, and 48.000 for testing.

STL-10

Similarly, for the STL-10 dataset, we followed the paper’s instructions for getting data ready for the experiments. We normalized the images subtracting the per-channel mean and dividing by the per-channel standard deviation, augmented by applying 12 pixel zero padding and randomly cropping back to the 96x96px dimensions. Also, we used the horizontal random flips (50% probability) and cutout of 1 hole of 32 pixels.

Data processing on an STL-10 image. [Image by authors]

Running the code we were able to replicate the experiments for the MNIST dataset. Although the code for STL-10 was available, we were unable to replicate the results, due to the fact that we did not have enough computational resources to run the experiments. After replicating the modified MNIST dataset (presented in the previous section), we were able to reproduce the results. These are somewhat similar, with small differences.

For the MNIST dataset, we ran the experiments on the following models:
mnist_ses_scalar_28
mnist_ses_scalar_56
mnist_ses_vector_28
mnist_ses_vector_56
mnist_ses_scalar_28p
mnist_ses_scalar_56p
mnist_ses_vector_28p
mnist_ses_vector_56p

After replicating the modified MNIST dataset, we were able to reproduce the results. For each of the models mentioned at the bottom of the previous section, 2 experiments were conducted: one for which the scaling factor was set to 1 and one for which the scaling factor was set to 0.5. Therefore, for each model, we obtained 12 results (6 for each realization). The results were stored in “results.yml” which was further processed using a python script written by us. We reproduced results similar to the author’s. The results are displayed in the table below.

Replicated results for the MNIST dataset. The ‘+’ denotes scaling data augmentation.

The STL10 models defined by the authors had 11M parameters. We tried to run these with a reduced batch size (to fit into our GPU limit of 12GB), but the model was training too slowly in order to reproduce meaningful results. Subsequently, these were not included.

Conclusion

By reading this, we hope that you now have a better understanding of what scale-equivariance is, and why designing scale-equivariant CNNs is valuable. Although, at the first glance, the paper looked really “mathy”, we can say that after spending countless hours on Wikipedia articles trying to extend our knowledge in group theory, we kind of understood the authors’ intent. We believe that G–equivariant convolutions (such as scale–equivariant) are a promising direction for CNNs, as their applicability is undeniable. In conclusion, for this project, we have reproduced the data augmentation steps, re-run the MNIST experiments, and produced quasi-equivalent results. We tried to run the STL–10 experiments but the lack of computational resources got the better of us. We tried to implement the layers themselves and, although we weren’t able to test it, we believe that we have successfully implemented the T→H layer, but ran out of time for the rest.

Acknowledgments

This blog was created in the context of TU Delft’s CS4240 Deep Learning course, as a group project.
We would like to thank Nergis Tömen and Tomasz Motyka for their advice and helpful insights.

Useful links:
Link to the original paper
Link to this project’s website
Link to our GitHub repository

Authors:
Mark Erik Lukacs
GitHub
Stefan Petrescu
GitHub

If you liked this article, don’t forget to share it!

--

--