Image registration problem overview
Image registration is an important problem in computer vision and image processing. If we have two images of the same object, this problem can be formulated as: how can we determine a coordinate system transformation, that allows us to match the displayed object?
There are multiple approaches available. Correlation-based methods match entire images or sub-images by comparing their pixel values. Feature-based methods detect and match control points representing the same part of the object on both pictures. This article presents a GPU implementation of a correlation method, operating in the frequency domain after Fast Fourier Transform, which was proposed in the paper [1]. By using the recent advances in GPU development and custom highly-optimized Fft library [2] it was possible to reduce the time taken by a match from minutes to a few milliseconds, which allows for real-time usage of the algorithm.
How FFT-based image recognition works?
Operation in the Fourier domain allows utilizing many of its wonderful properties. For example, let’s define two images, where one is a simple translation of another by (x0, y0):

In the Fourier domain this relation looks like:

This fact is known as the Fourier shift theorem and allows to calculate cross-power spectrum of two images, defined at each point as:

Here * denotes complex conjugation. If we take the inverse FFT of this spectrum, we will obtain a single peak at the location (x0, y0), which can be easily determined by applying a maximizer routine. The approach described above is also known as a phase correlation method.
Paper [1] proposes an extension of this method to cover rotation and scaling transformations. It is based on another property of Fourier transform – the magnitude of Fourier transforms is translation invariant:

Let’s now define two images, where one is a combination of translation by (x₀, y₀), rotation by θ₀ and scaling by a factor s of another image:

If we now switch to polar coordinates:

the magnitudes of images Fourier transform can then be rewritten as:

If we apply the log transformation to ρ:

we get the following representation of the magnitude in the log-polar coordinate system:

As can be seen, rotation by θ₀ and scaling by a factor s become a simple translation in this coordinate system. So, applying the already defined phase correlation procedure, we can determine both of these parameters. It can also be advantageous to switch from magnitude to the logarithm of magnitude beforehand.
If we rotate and scale the image with higher resolution, we can determine translation with another step of phase correlation. One note: this method can’t distinguish θ₀ angle from 180°+θ₀. To do so, we simply do another step of translation match with 180°+θ₀ angle and select the angle with the highest peak value.
The current limitations of this approach are that scaling is uniform in both axes and images must be square (or padded to square).
What can GPUs bring to the table?
The recent advances of GPUs in terms of their computational power, data transfer speed and available memory size make them extremely appealing for general purpose use, expanding their original rendering and visualization target tasks. The single instruction – multiple threads (SIMT) design approach used in GPUs has been proved extremely effective in data-level parallel tasks. One of such tasks is a multidimensional FFT, which is the core of the phase correlation method.
This project is implemented by the means of Vulkan API (contrary to Nvidia’s CUDA, which is typically used in data science). It heavily utilizes the VkFFT library (also developed by the author). VkFFT is an open-source and cross-platform Fast Fourier Transform library in Vulkan with better performance than proprietary Nvidia’s cuFFT library.
By having no dependency on the proprietary black-box solutions, it was possible to optimize every small part of the application layout, which will be discussed in the next section.
GPU implementation of the algorithm
Full algorithm of translation, rotation and scaling detection can be summarized as the following list of commands:
- Forward FFT of both images
- Conversion of log-magnitudes to the log-polar coordinate system
- Forward FFT of log-polar representation of log-magnitudes of both images
- Cross-power spectrum calculation
- Inverse FFT of log-polar cross-power spectrum
- Maximizer reduce routine to determine the angle and scaling factor
- Rotation and scaling of the image with the highest resolution
- FFT of the rotated image
- Cross-power spectrum calculation
- Inverse FFT of cross-power spectrum
- Maximizer reduce routine to determine translation
- Steps 7–11 with 180°+θ₀ angle to resolve 180° uncertainty in angle
- Final rotation, scaling and translation of an input image
The total amount of operations: 6x FFTs of image size and 3x FFTs of log-polar system size, 3x cross-power calculations, 3x maximizer reduce routine launches, 3x rotation/scaling/translation transformations merged in one operation. All operations are performed on three color channels. It is also possible to use different system size in the log-polar coordinate system.
Let’s test the algorithm performance. The test will consist of four parts. First, we will test translation, rotation and scaling separately to determine the range of correct detection. Then we will proceed with all of the transformations combined with additional external filters applied to the input image.
Test will be performed on Nvidia 1660Ti GPU. The log-polar system size is set to 512px x 512px. The reference image is a 1024px x 1024px part of a screenshot from the game Cyberpunk 2077:

Translation transformation
This test checks the range of how far can the image be shifted diagonally to still be matched against the reference. The rspeak and tpeak are the peak values detected by the maximizer reduce routine, which are equal to 1.0 if we match the reference image against itself. As can be seen, FFT-based image recognition can exactly match translated systems until the overlap is as small as 30% of image size.

The time taken by the GPU to perform all 13 steps of matching is ~3.3ms. The deviation is mostly influenced by the rotation/scaling/translation routine, as setting the color to black (for missing data) doesn’t require additional memory transfers.
FFT on GPU is a bandwidth-limited problem. That makes all optimizations aimed at reducing the amount of memory transferred from the GPU memory to the chip very important. The 6x steps of image FFTs are performed using real-to-complex optimization implemented in VkFFT, which reduces the memory required to store an image in half. FFT is done in single precision. Each FFT still requires two uploads to the chip and two downloads from the chip to the GPU memory, as we have two axes. We also have three colors, which are processed separately. The total amount of memory transferred between chip and VRAM per one image FFT is equal to 50MB. The log-polar FFTs have a smaller resolution and require 24MB of memory transfers. Total memory transferred is 6×50+3×24=372MB. The theoretical bandwidth of 1660Ti is 288GB/s, so it will take at least 1.26ms if we only transfer data at this rate. Indeed, if we only do FFTs with VkFFT (without other parts of the algorithm), the total time taken will be ~1.70ms, which is close to the obtained theoretical value.
The reduction is efficiently done with the use of subgroup functionality in Vulkan. The amount of memory transfers needed for a reduction is close to the system size (we reduce by comparing 1024 numbers stored by 1024 threads, so at each launch of routine the amount of compared numbers is reduced by the factor of 1024). The total amount of transferred memory is ~32MB and it takes 0.11ms to transfer at theoretical bandwidth. The measured timings show that it takes 0.14ms to compute all reductions in the algorithm.
Computing one cross-power spectrum uses two uploads of a system (with colors) and one download from the chip (without colors). The algorithm has two such computations to determine translation and one for the log-power spectrum. This takes a total of 75MB and can be done in 0.25ms. The measured timings show that it takes 0.36ms to compute the cross-power spectrum in the algorithm.
Calculation of log-polar coordinate system conversion also takes two uploads of a system (with colors) and two downloads (with colors). This takes a total of 50MB and can be done in 0.17ms. The measured timings show that it takes 0.22ms to compute the conversion in the algorithm.
The rotation/scaling/translation routine takes the rest of the time (0.5-0.9ms) and uses bilinear interpolation. This routine is not optimized to minimize memory transfers and coalescence, so the timings of it can vastly vary between different angles and scales, as can be seen in the next tests.
Overall, the 3ms range on a medium-range GPU is suitable for real-life image processing and is expected to scale linearly with the bandwidth of a GPU. Another note can be made regarding the L3 cache of last-gen AMD GPUs. The full system can be stored there during execution, fully utilizing its speed (this was tested to be true in VkFFT benchmark).
Rotation transformation
This test checks if the algorithm can detect all angles on the range from 0° to 360°. As can be seen, FFT-based image recognition can detect rotations on the whole range. Here, unoptimized (yet) bilinear interpolation takes even more time than FFT, which shows how important optimization is to GPU programming.

Scaling transformation
This test checks the range of scaling transformation the algorithm can detect. As can be seen, FFT-based image recognition can detect scaling in the great range from 0.4x to 3.2x of the reference image. The timing discrepancy is once again explained by the fact, that setting the color to black (for missing data) doesn’t require additional memory transfers.

Translation+Rotation+Scaling+Filter
This test pushes the algorithm to the limit by applying all supported for detection transformations in addition to external filters, distorting the image. It will be used to determine the validity conditions of rspeak and tpeak.

First of all, without filters the image has undergone the following transformations: translation by 200px diagonally (top-left direction), then rotation around the center by 30° clockwise and scaling by a factor of 1.6.
Then the following tests were applied:
- Gaussian noise – works up until 50% of the image
- Gaussian blur – works up until 4px blur
- Dust and scratches filter (different blur filter) – works up until 5px blur
- Ripple filter (distorts the image in a wave-like way) – works up until 200%
- Old image filter (changes color palette)
As can be seen, FFT-based Image Registration provides really good noise and blur robustness out of the box. Tests also show that a good measure of whether the algorithm has succeeded or failed can be the comparison of tpeak with tpeak of 180° rotated image, used to remove 180° uncertainty of angle. On all successful tests, the correct tpeak was at least two times bigger than the other value (usually a magnitude bigger), while on all failed tests they were comparable in value.
Conclusion
This article presents a GPU implementation of the FFT-based image registration algorithm (firstly proposed in the paper [1]), which can match translated, rotated and scaled images. The algorithm is robust to noise and blur and can perform a match of two 1024px x 1024px images in 3ms on a medium-range GPU, which allows for real-time usage. The algorithm is written with the help of Vulkan API and has no dependencies on the proprietary software. It uses Vkfft library [2] as the core module. Vulkan API supports a wide range of platforms, which makes the algorithm implementation cross-platform.
The registration accuracy has been tested on many different transformations and filters. As of now, the algorithm implementation supports translations with the overlap of as small as 30% of image size, full 360° range of rotations and scaling from 0.4 to 3.2 of the reference image size.
A new method of determining the result validity based on tpeak comparison is tested on a wide range of filters.
The accuracy of the FFT-based image registration can be further improved by applying different frequency filters, windowing, using subpixel detection techniques, different interpolation techniques and further experimentation with the log-polar conversion.
References
[1] B. Srinivasa Reddy and B. N. Chatterji, An FFT-Based Technique for Translation, Rotation and Scale-Invariant Image Registration (1996), IEEE Transactions on Image Processing. 5 (8): 1266–1271
[2] D. Tolmachev, VkFFT – Vulkan Fast Fourier Transform library (2020), GitHub