
Convolutional neural networks (CNNs) are widely spread these days. Regardless of their success, convolutions are inefficient. The sliding window requires many computations and limits the size of the kernel. At the same time, a small kernel, typically between [3,3] to [7,7], limits the perceptive field and many layers are required to capture the global context of an input tensor (e.g. 2D images). The larger the image, the worse the impact of small filters becomes. That’s why you will have a hard time finding CNNs which input high resolution images.
What if I tell you there is a way to scale your kernel size to [1024,1024] and beyond, that there is a way you can increase the kernel size for a given input resolution with almost no impact on inference time, and what if there would be a way to drastically decrease the spatial dimension of a feature map without losing almost any information unlike, methods like max-pooling?
All these promises are based upon a simple mathematical property: the convolution theorem (cross-correlation theorem to be accurate) of the Fourier transformation and I will show you how to exploit it the right way!
Note: Along with this article I published a notebook on my GitHub with all the code
Outline
- The Flaws of Convolutions
- 2D Discrete Fourier Transformation to the Resucue
- Discovering 2D FFT Sprectra
- Implementation in TensorFlow
- Conclusion
- Further Readings and Links
The Flaws of Convolutions
Let’s recap some basics. A convolution is a mathematical operation applied on two functions. Let’s start with the one-dimensional case:


In other words: take two signals, leave one as it is and flip the other signal around the coordinate axis. Shift the flipped signal across the fix signal from minus infinity to plus infinity (or until all non-zero parts of the signal have been overlapped). For each step, calculate the element-wise product and sum all values. The resulting value is the result of the convolution for this particular step. The convolution can also be applied upon 2-dimensional signals (e.g. images):


But why did I mention the cross-correlation before? Well, that’s because the convolution and the cross-correlation are actually computed in the same way, with the only difference that the filter is flipped. This is indicated by the different sign:


TensorFlow and PyTorch are actually calculating the cross-correlation of an input signal and a learnable filter rather than the convolution. Does it matter? Not really! Since the filter is learned by the network it does not matter if the filter is flipped or not. The network will figure it out itself what is best. It even saves some computation to not flip the filter. There is a difference though, if the filter is fixed, which means if you load a trained model, you should know whether it was trained using cross-correlation or convolution and must flip the filter’s weight eventually.
Two things you should remember from this:
- A lot of computations are required to calculate a single point in the output sequence.
- The larger the input signal (i.e. the more resolution an image has) the more often the filter must be shifted, hence more calculations are required. Same applies for a larger filter.
More computations mean more memory and larger latency until the result is available. The consequences for CNNs are obvious: lower input resolution and smaller filters. The issue: less pixels mean less details and smaller filter result in smaller receptive fields. Networks need to have multiple consecutive convolutional layers, to increase the receptive field. Networks become deeper which again brings new challenges during training.

2D Discrete Fourier Transformation to the Rescue
Mathematically speaking, the Fourier transform of a real or complex function x(t) of the time variable t is a complex function X(f) of the real frequency variable f:

You could also say we project a signal from the time domain into the frequency domain. By doing so, we can benefit from the Fourier transform’s special properties, i.e. the convolution theorem and the correlation theorem.


These properties are very important and the foundation of this article: a convolution/correlation in time-domain corresponds to a simple element-wise multiplication in frequency-domain. But what is so thrilling about this? As mentioned before, a convolution requires many computations, especially for large images and filters. Its complexity scales quadratic with the sequence length i.e. O(N²). Following the convolution theorem, we only need to perform an element-wise multiplication of the transformed input and the transformed filter. There are efficient algorithms to calculate the Fourier transform, i.e. the fast Fourier transform (FFT), that reduces the complexity down to O(N log(N)). And the best part, as long as the filter is smaller as the input signal, the computational requirements are constant. It doesn’t matter if our filter has a kernel size of [3,3] or [1024, 1024]. Details in the section on the implementation in TensorFlow.
The Fourier transform also exist for real or complex discrete signals x[k], which assigns a complex discrete signal X[n] of the real variable n:


The discrete Fourier transform (DFT) is predestined for digital signal processing since a computer stores signals in discrete values.
Note: A discrete signal is time-discrete and value-discrete. Time-discrete because it is sampled at certain time intervals and value-discrete because each value is represented by a certain amount of bits e.g. 32bit for INT32.
There are some implications that we need to keep in mind when working with the DFT:
- The input signal is assumed to be periodic and that a complete period is sampled
- The resulting spectrum is periodic
Note: An image can be interpreted as a spatial-signal and not as a time-signal. On a computer, an image is space-discrete, because values are stored in pixels that are sampled from an image sensor with spatially distributed cells or are digitized.
The two-dimensional DFT (as well as the 2D continuous Fourier transform) can be separated into sequential 1D DFTs, where row and columns can be calculated separately.

This has at least two advantages: first, one can reuse the algorithm of the 1D DFT and second, it helps to build an intuition for the 2D DFT, since rows and columns can be interpreted individually.
But of course, there is one minor detail with the discrete Fourier Transform: The convolution theorem does not hold for the DFT.
The multiplication of two signals’ DFTs corresponds to their circular convolution denoted by the operator ⊛, not their linear convolution.

The circular convolution is a periodic signal that repeats with the signal lenght N, whereas a linear convolution has the length of (N+F-1), where F is the length of the filter signal. So if you blindly take the product in the frequency domain, you would squeeze your signal of length (N+M-1) into the length N. It can be seen as aliasing in the time domain which creates undesired artefacts in your end result. Fortunately, the circular and the linear convolution share some of the values, i.e. (N-F+1). The remaining (F-1) values are wrapped around and interfere with other values of the signal.
Here is an idea: what if the values the wrapped values interfere with would be zero meaning there is nothing to interfere with? This would mean we could reconstruct the linear convolution from the circular convolution. When the signal is padded with at least (F-1) zeros, the wrapped values do not interfere with real values. We could then circularly shift the wrapped values back to its position and crop the padded values. Details will be shown in the implementation section.
I spare you with all the mathematical details and link further resources at the end of the article.
Discovering 2D DFT Sprectra
Now that we have covered the theory, let’s discover some 2D Fourier transforms and strengthen our intuition for the 2D Fourier Transform.
Fundamental Test Signals and the implication on CNNs
Consider an image which pixel intensities follow a diagonal sine wave. What amplitude spectrum would you expect? As mentioned before, the 2D Fourier transform can be calculated by separating the 2D Fourier transform into multiple 1D Fourier transforms along each axis of the image. If you imagine walking along the horizontal axis you will encounter a repetitive pattern. The same is true if you imagine walking along the vertical axis. So, it’s natural to expect a high value in the spectrum within the fourth quadrant (bottom-right), the quadrant with positive valued frequency components.

Note: the 2D amplitude spectrum is usually scaled with a log function when plotted, since images have a high offset regardless of their content, because they are usually represented in unsigned integer, which only represent positive values.
Now, let’s consider an input image of a rectangle with different side lengths. If you again imagine walking along each axis you will encounter a rectangle with a short pulse width on the horizontal axis and a rectangle with a wider pulse width along the vertical axis. If you are familiar with signal theory you would immediately expect your spectrum to have some sort of sinc-function, where sinc(x)=sin(x)/x.

If you expected a sinc-function you were totally right. The spectrum consists of sinc-functions along both axes. Here you can make a fundamental observation: **** the horizontal axis has higher frequency components as the vertical axis and the zero crossings are more spread in the horizontal axis. This observation has two implications:
- Narrow spatial features in the input image have high frequency components in the amplitude spectrum, hence they have a high bandwidth. High bandwidth filters are prone to noise.
- The spectrum scales inversely with the spatial length of features in the input image. Narrow features result in wide spectrums and wide features result in narrow spectrums.
What does this imply for our CNNs with small filters with let’s say of 3×3 pixels? According to our observation above, this should imply that CNNs with small filters act as high bandwidth filters and are therefore prone to input noise. The larger the filter size, the lower the bandwidth of the filter and the more selective it is.
2D DFTs of Images and Filtering in Frequency Domain
Now that we have talked about some fundamental signals, let us investigate 2D DFTs of real images.

The center of the spectrum represents a frequency of zero, also called offset. The further away from the center the higher the frequency component in the input. Having this in mind, you can easily derive a high pass filter and a low pass filter.
A high pass filter suppresses low frequencies and keeps high frequency components. This behavior can be realized by a filter that has zeros close to the center and ones for values further away from the center. The filter is applied by multiplying the filter with the spectrum and then calculating the inverse Fourier transform.

As you can see in the figure above, high pass filter can be used as edge detectors. An edge in an image is characterized by an abrupt change of pixel values, hence it has a high gradient. The higher the gradient the higher the involved frequencies.
A low pass filter on the other hand suppresses high frequency components and keeps low frequency components. A low pass filter can be implemented by a mask with ones in the center region and zeros in the outer region.

A lowpass filtered image appears blurry and loses its sharpness. A typical filter used in computer vision is a gaussian filter. It is also a lowpass filter, but instead of an abrupt cut-off frequency, the filter effect gradually increases for higher frequencies. It smoothens the image. The image bellow is filtered with a gaussian filter which has a variance (sigma squared) that is equal to the cutoff frequency of the circular lowpass filter before.

The image is blurry but has less distortions. It appears smoother.
A quite interesting filter for Machine Learning applications is the rectangular filter. Convolutional neural networks often gradually decrease the spatial width and increase the number of channels. Pooling, such as max-pooling or average-pooling is often used to decrease the spatial width. What if we would pool in the frequency domain?

By applying a rectangular filter in the frequency domain, we can drastically remove frequency components without having a big impact on the image quality in the spatial domain.
The DFT has an interesting property for real inputs: it is conjugate symmetric with respect to the origin. The symmetry implies that the spectrum contains redundancies that can be omitted during calculation to further speed up the process. The figure bellow shows this transformation and its inverse that is reconstructed from the spectrum.

Take away from this section:
- Low frequency components are in the center of the 2D spectrum, while high frequency components are farther away from the center
- You can take advantage of the symmetry property of the DFT to reduce the required computational resources and only calculate the rDFT.
- Filter of small sizes are prone to noise due to their high bandwidth.
Here is the notebook I’ve created to generate the plots:
ML_Notebooks/2D_FFTs.ipynb at main · sascha-kirch/ML_Notebooks
Implementation in TensorFlow
We have all we need to implement a linear convolution using the discrete Fourier Transformation.
On a high level, we need to implement following 6 Steps:
- Pad input image to avoid aliasing in the time domain
- Pad filter to image size to prepare element-wise multiplication
- Calculate 2D rFFT of input image and filter
- Element-wise multiplication of transformed input and transformed filter
- Calculate 2D inverse rFFT of filtered input to obtain circular convolution
- Reconstruct linear convolution from circular convolution
Step 1 – Pad Input Image
To avoid aliasing effects in the time domain we need to pad the image with at least (F-1) zeros, where F is the side length of the filter. Furthermore, the FFT algorithm that calculates the DFT is especially efficient for signal lengths that are a power 2 (e.g. 128,512,1024).
There are at least two options to pad the input image: first, we manually pad the image. Second, we set the sequence length of the FFT to the length of the padded signal. I prefer the later one.
Bellow the code to manually pad the image.
And here my preferred way to specify a higher sequence length when calculating the FFT:
# image is of shape [b,c,h,w]
padding = GetImagePadding(filter_shape[0])
image_shape = (input_shape[0],
input_shape[1],
input_shape[2]+2*padding,
input_shape[3]+2*padding)
image_shape = FillImageShapeToPower2(image_shape)
F_image = tf.signal.rfft2d(image, fft_length=[image_shape[-2],image_shape[-1]])
Step 2 – Pad Filter to Image Size
Since we need to calculate the element-wise product of the transformed image with the transformed filter, we need to pad the filter to the size of the padded image before we calculate the Fourier transform. The filter is padded with zeros. Again, I would suggest to pad the filter by setting the fft_lenght parameter of the FFT calculation properly, i.e.
F_filter = tf.signal.rfft2d(filter, fft_length=[image_shape[-2],image_shape[-1]])
Step 3 – Calculate 2D rFFTs
As we have prepared our input signals, we can now calculate the FFTs for the padded image and the padded filter:
# Image shape [b,c,h,w], Filter shape [out, in , k, k]
F_image = tf.signal.rfft2d(image, fft_length=[image_shape[-2],image_shape[-1]])
F_filter = tf.signal.rfft2d(filter, fft_length=[image_shape[-2],image_shape[-1]])
We take advantage of the conjugate symmetry for real inputs and only calculate the real FFT for 2D signals using rfft2d. To be speciffic, we input the unpadded signals and set fft_length to a value bigger than the length of the input. This automatically pads the signal with zeros.
Important: TensorFlow‘s implementation of the rfft2d calculates the FFT over the last two dimensions of the input. Unlike in numpy’s implementation, you cannot change the dimension via a parameter. For that reason, the shape of the image is [batch, channel, height, width] and the shape of the kernel [output_filter, input_filter, kernel_height,kernel_width]
If we would implement a convolution in the frequency domain, we would be done. Since TensorFlow actually implements the cross-correlation, we need to conjugate the transformed filter to get consistent results:
F_filter = tf.math.conj(F_filter)
Step 4— Multiplication of Transformed Image and Transformed Filter
If we would only have a single image and a single filter, the element-wise multiplication would simply be F_image*F_folter
. In a real scenario we usually have multiple images in form of a batch and we apply multiple filters in parallel. We need to rearrange the dimensions of the input signals and take advantage of array broadcasting to perform this operation without any loops involved.
TensorFlow’s einsum() function can be used to easily reshape the dimensions. The characters on the lefthand side of the arrow describe the input shape and the characters on the righthand side the output shape. The dimensions of the image and the filter are realigned in such a way, that all batches and all output filters will be broadcasted when calculating the element-wise product. After the multiplication, the initial shape is restored by reshaping and reducing the input filter dimension.
Step 5 – Calculate inverse 2D rFFT
The inverse FFT is simply taken from the filtered signal with the same fft_length parameter as for the FFT:
out = tf.signal.irfft2d(filterd_image, fft_length=[image_shape[-2],image_shape[-1]])
Step 6- Reconstruct Linear Convolution from Circular Convolution
Remember, to reconstruct the linear convolution from the circular convolution, we need to perform two steps: first, we need to circularly shift the result by the amount of padding. Second, we need to truncate to our region of interest, the initial shape of the image for all batches and the new number of channels.
#Circular shift
out = tf.roll(out,shift = [2*padding,2*padding],axis = [-2,-1])
#Truncation
out = tf.slice(out,
begin = [0, 0, padding, padding],
size=[input_shape[0],
filter_shape[-1],
input_shape[2],
input_shape[3]]
)
And that’s all! The code bellow shows the entire implementation of steps 1 to 6:
Note: The convolution in the Fourier Domain is also implemented in form of a TensorFlow layer as part of my DeepSaki package and can be found on GitHub or downloaded from PyPi.
Validation
You might ask yourself: But does it really work? Let’s find out! All the validation steps are included in the notebook for this article.
First, we will look on the execution time in seconds over the size of the kernel for both functions: tf.nn.conv2d() and our implementation.

As you might have expected, the execution time for the 2D convolution keeps growing with increasing kernel sizes. The 2D DFT convolution on the other hand is constant in the execution time regardless of the filter size. This comes from the fact that the filter is padded to the size of the image. If the filter is larger, less values are padded.
Now let’s take a look on the investigate the difference in the result. For this we apply a kernel of size 3×3 with 8 filters on an image of 720×720 pixels. We run it through both algorithms and calculate the mean and the standard deviation of the absolute difference.
convResult = CalcConv(image, filter)
dftResult = CalcDFT2D(image, filter)
error = tf.math.abs(convResult-dftResult)
mean = tf.math.reduce_mean(error)
std = tf.math.reduce_std(error)
# Mean Absolute Error: 0.001560982083901763
# Standard deviation: 0.0015975720016285777
Both, the mean and the standard deviation are pretty low. The difference comes from numerical inaccuracies. When observing the filtered images and there corresponding amplitude spectra we can see that they are indistinguishable.


Conclusion
We have seen the mathematical foundation behind convolutions and the DFT, have gained some intuition by observing different spectra, have reviewed the code in TensorFlow and have finally validated that the results are correct.
Designing layers that work in the frequency domain instead of the spatial domain bring new opportunities especially for large input images and large filter sizes. Taking the detour through the frequency domain seems counter intuitive, but actually speeds up the computation.
Further Readings and Links
ML_Notebooks/2D_FFTs.ipynb at main · sascha-kirch/ML_Notebooks
GitHub – sascha-kirch/DeepSaki: Collection of reusable machine learning code including models…