The world’s leading publication for data science, AI, and ML professionals.

Computer Vision: Intuition behind Panorama snitching

There are 3 main parts to understand behind how Panorama snitching works.

View from top of Lookout Mountain, Tennessee (1864) Source:https://en.wikipedia.org/wiki/Panoramic_photography
View from top of Lookout Mountain, Tennessee (1864) Source:https://en.wikipedia.org/wiki/Panoramic_photography

There are 4 main parts behind how Panorama stitching works. In this article, I will be giving a very brief overview that is able to sufficiently (hopefully) build the intuition behind how image stitching works. This also means that I will most probably be skipping any mathematical concepts and calculations involved.

  1. Detecting interest points
  2. Describing these interest points
  3. Matching these descriptors of our interest points
  4. Perform Homography to finish the stitching

1. Detecting interest points

There are several characteristics when we look for interest points in an image. They are:

a) Repeatable interest points

We want to be able to find features in an image that can ultimately tell us where to match between different images of the same scene (from nearby viewpoints).

b) Distinctiveness of the interest points

We want to be able to reliably determine which interest point in an image match to the corresponding interest point in the other image.

c) Invariance to scale and rotation

We want to be able to find the same interest points even if an image has been rotated, scaled or translated.

d) Locality

A local feature will enable our detection of interest points to be more robust to clutter and occlusion.

Figure 1. Screenshots from my mac default background.
Figure 1. Screenshots from my mac default background.

Imagine the 2 pictures above (Figure 1), it is easy to see that if we were to pick the ocean as an interest point, it is hard to match the patch of ocean from the left image to the right image specifically (as the ocean looks the same in the large region of space). We also notice that choosing areas such as the peak of the small rock (orange circle) is able to provide much more valuable information for us to match later.

It turns out that corners are actually a very good feature to have as interest points! There is an algorithm with the name of Harris Corner Detector that helps us find such corners as interest points in an image.

Harris Corner Detector
Harris Corner Detector

Note that Harris Corner Detector is just one of the many algorithms that helps us find these interest points. There are also other methods such as the SIFT which uses the Difference of Gaussian (DoG) to detect interest points of different scales. I personally find this article to be a good read on SIFT. It is important to understand SIFT in the later parts as we will be using SIFT descriptor to describe our interest points found.

Essentially, Harris Corner algorithm computes a corner score from the gradients of the image (using a second moment H matrix) and label values above a set threshold as corners, before taking the points of the local maxima (Non-Maximum Suppression). Harris Corner is invariant to rotation (since the eigenvalues of the H matrix remains the same even after rotation), translation and additive changes to intensity. However, it is not invariant to scaling of intensity and to scale. In order for Harris Corner to be scale invariant, we will need an additional step of Automatic Scale Selection to find a scale that gives a local maximum of our corner score.

To summarise this first part of detecting our interest points from an image, corners are good representation of an interest point and can be found using Harris Corner Detector.

2. Describing our interest points

Descriptors are basically vector representations that mathematically characterise a region in the image. Descriptors should be invariant to rotation, scaling and translation too. I will dive right in to the SIFT descriptor which we can use on our interest points found by our Harris Corner Detector.

Scale Invariant Feature Transform (SIFT) was published by David Lowe in 2004. If his paper is too much to read, you can refer here on a much quicker read of SIFT. SIFT is actually an algorithm that detect interest points and describing them. However in this case, I will just be focusing on the descriptors itself.

Figure 2 (shows a downscaled version). Credits: Richard Szeliski
Figure 2 (shows a downscaled version). Credits: Richard Szeliski

What SIFT descriptor basically does is it takes a 16×16 window around the detected interest point (Figure 2), and then partitioning it into a 4×4 grid of cells. Gradient orientations and magnitudes are computed at every pixels in the window weighted by a Gaussian function. A weighted gradient orientation histogram with 8 bins is then computed in each cell (orientation weighted by its magnitude). At the end, we collapse these histograms into a vector of 128 (16 x 8) dimensions (there are 16 4×4 cells of 8bins each).

The partitioning of the cells gives the descriptor a sense of spatial knowledge while the shifting of histogram binning by the dominant orientation makes the descriptor rotation invariant. We can normalise the final vector to unit length, clamp values based on threshold and re-normalise again to make it relatively more robust to changes in illumination.

All in all, SIFT is highly robust for a descriptor. It is invariant to scale and rotation, can handle changes in viewpoint (up to 60 degrees out of plane rotation) and can handle significant changes in illumination.

Now we have found our interest points through Harris Corner and also described these interest points as a region using SIFT descriptors. Now what is left is to form the matching between the different points and to perform homography to stitch the different images together to form a panorama!

3. Matching the descriptors of our interest points

To form a match between descriptors of 2 images, we use a ratio distance approach of the distance of the best match / distance of the 2nd best match.

Figure 3. Picture from Unsplash. Author: Ishan Wazalwar
Figure 3. Picture from Unsplash. Author: Ishan Wazalwar

In Figure 3, if we were to only use the absolute distance of target and best match (distance here refers to the similarity of descriptors – if they are very similar, then distance is small and vice versa), it is easy to see that there are multiple ambiguous matches available (there are many similar fences here as interest points in the image and thus, we cannot be sure that the best match is the correct feature to match).

In order to reject these ambiguous matches, we use the ratio method – a high value close to 1 would suggest that the match is ambiguous as the distance to the best match is very close to the distance to the second best match. We then set a threshold (normally around 0.5–0.7) and accept matches under this threshold. In other words, we reject matches that contains high uncertainty – in the case of Figure 3, there is high uncertainty as many wooden tips are close to the target.

4. Performing Homography

Wikipedia’s explanation of homography is as such: In the field of computer vision, any two images of the same planar surface in space are related by a homography.

It is easy to understand homography once we are able to visualise what we are trying to do when we aim to stitch the different images together. Imagine 2 scenarios.

(1) – Take a step to the left each time you capture a photo, while holding the camera still.

(2) – Standing on a fixed position, you rotate your body while holding a camera and capture the different photos while you rotate.

If we want to stitch the images together in (1), we can simply overlay a photo on top of another in the sequence and have a good result. However in (2), if we want to stitch the images together by simply overlaying the images on top of one another in sequence, we will realise that the result of the stitching is bad (regions will be missed out due to the different planes of the captured images). As such, we will need homography to project an image onto the same plane of the other image before stitching them together.

Next, you can think of homography as basically a matrix – a matrix that transform points from an image to another image of the same plane. So the next question is how do we solve for the homography matrix H?

We use Direct Linear Transform (DLT) and solve for H by computing the Singular Value Decomposition (SVD) with a minimum of 4 correspondences for this calculation to work. I will skip on the math but free feel to look them up if you are interested.

However, DLT can produce bad results in noisy environment as DLT is a linear least squares estimation that takes into account of bad outliers in our matches.(Note that even if we had set a threshold in our previous step when matching our descriptors, it is still possible to have incorrect matches). We can then produce much robust results with Random Sample Consensus (RANSAC) in which we only include inliers (approximately correct matches) in our calculation of the H matrix. We then utilise both DLT and RANSAC to give us much better results.

Once we solve for the homography matrix, we can then calculate the points from an image A to image B using the matrix and warp it nicely to finish up our panorama stitching!

Conclusion

Here, I have given a brief overview as to how panorama stitching works. Specifically, I have talked about using Harris Corner to detect corners as interest points, SIFT descriptors to describe the region around our interest points, how we match these descriptors and how we calculate the homography to form the stitching of the different images. Note that since it is a brief overview, I have intentionally left out the mathematical details and also tried not to dive too deep into each of the specific algorithms/concepts as I aim to build the intuition behind the entire concept!

Also, this is actually my first article on medium! I hope you enjoyed this and have learnt something from the article. Thanks!

References

Thanks for reading! I hope you enjoyed it and that the article was helpful for you!


Related Articles