An Evolution in Single Image Super Resolution using Deep Learning

From classical interpolation to deep learning methods with Generative Adversarial Networks

Published in

Towards Data Science

6 min readDec 3, 2019

Reconstructing a high resolution photo-realistic image from its counterpart low resolution image has been a long challenging task in the fraternity of computer vision. This task becomes even more difficult when all you have is a single low resolution image as input to recreate its high resolution image.

What is super resolution? The estimation of a high-resolution (HR) image from a single low-resolution (LR) counterpart is referred to as super-resolution (SR). In other words, LR is a single image input, HR is the ground truth, and SR is the predicted high resolution image. When applying ML/DL solutions, the LR images are generally the down-sampled HR image with some blurring and noise added to them.

Generation #0 of Evolution: Interpolation

To start with, a very early solution was the method of interpolation in image processing. Here, the low resolution image is resized by a factor of 2x or 4x using some interpolation method like nearest-neighbor, bilinear or bicubic method of interpolation.

“Interpolation works by using known data to estimate values at unknown points. Image interpolation works in two directions, and tries to achieve a best approximation of a pixel’s intensity based on the values at surrounding pixels.” — ebook on Digital Image Processing from University of Tartu.

Figure: Effect of interpolation (source)

As from above illustration it is very clear that is resultant image is blurred and unrealistic.

Generation #1 of Evolution: SRCNN

With the success of fully convolutional neural network (FCNN) in solving semantic segmentation, it popularity in other fields of computer vision spread rapidly. FCNN is a CNN without any dense connections (fully connected layer) at the rear of it. Every CNN has two main functional block, i) Feature extractor and ii) Classifier. The dense connections at the rear of CNN is the classifier whose task is to map the extracted features to class probabilities. I consider FCNN as a basic design practice in DL to generate/predict an output map from an input image. Output map can be semantic segmentation map, style transfer image or even super resolution image. In other words, FCNN is a image-to-image mapping engine. One such preliminary application of FCNN in super resolution is SRCNN.

In SRCNN, first the image is upsampled using bicubic interpolation and then feed to a simple FCNN. It is important to note that there is no pooling operations involved. Thus, resulting in an output of the same spatial size as that of the upsampled input image. Finally, we compute the MSE loss between the target HR image with the output.

Generation #2 of Evolution: SRResNet and Sub-pixel convolution for upsampling

With some success in super resolution from single image using SRCNN, inspired others to bring further improvement the architecture. As it is known that ResNet (CNN with skip connections) are better than conventional CNNs. SRResNets replaced simple convolution blocks with residual blocks. As a result accuracy increased significantly.

Many deep learning models also incorporated transposed convolution for upsampling. Bilinear and bicubic upsampling are not learnable, that means, it can only be used before or after the deep learning architecture and not in between. Other advantages of learnable upsampling is its speed and accuracy.

But as one may have observed the upsampling operation as implemented above with strided convolution gradients adds zero values to upscale the image, that have to be later filled in with meaningful values. Maybe even worse, these zero values have no gradient information that can be back-propagated through.

“To cope with that problem, Shi et. al proposed what we argue to be one the most useful recent convnet tricks (at least in my opinion as a generative model researcher!) They proposed a subpixel convolutional neural network layer for upscaling. This layer essentially uses regular convolutional layers followed by a specific type of image reshaping called a phase shift. In other words, instead of putting zeros in between pixels and having to do extra computation, they calculate more convolutions in lower resolution and resize the resulting map into an upscaled image. This way, no meaningless zeros are necessary.” — [https://github.com/atriumlts/subpixel]

Image reshaping using phase shift is also called “pixel shuffle”, which rearranges the elements of H × W × C · r² tensor to form rH × rW × C tensor as shown below.

Figure: Sub-pixel convolution operation (source)

Generation #3 of Evolution: Perceptual Loss

The main disadvantage of using MSE or MSE type error methods as loss function in applications like super resolution is that it is computed pixel-wise. That is, it only measures the change between two corresponding pixels in the predicted and the target images. This encourages finding pixel-wise averages of plausible solutions which are typically overly-smooth and thus have poor perceptual quality. This argument also holds good for not using only PSNR for quality index as it is also computed pixel-wise. Therefore, I would advise not to check PSNR alone while comparing performance of any two method in such tasks.

Perceptual loss is computed by comparing two images based on high-level representations from pre-trained CNN model. The function is used to compare high level differences, like content and style discrepancies, between images.

In other words, both target and predicted input are passed through a pre-trained network and compute the euclidean distance between the two resulting feature maps (at the same stage). The perceptual loss function works by summing all the squared errors between all the pixels and taking the mean. This is in contrast to a per-pixel loss function which sums all the absolute errors between pixels.

Generation #4 of Evolution: SRGAN

Generative adversarial networks (GANs) provide a powerful framework for generating plausible-looking natural images with high perceptual quality. The GAN procedure encourages the reconstructions to move towards regions of the search space with high probability of containing photo-realistic images and thus closer to the natural image manifold.

SRGAN is a GAN based network, where the generator (G) learns to generates SR images from LR images as close as possible to HR. The discriminator (D) learns to distinguish generated SR images from real images. The G takes advantage of ResNet and sub-pixel convolution for upsampling. It also combines perceptual loss with generative or adversarial loss for the computation of its loss.

Figure: Architecture of Generator and Discriminator Network in SRGAN. (source)

Loss

Conclusion

Studying the evolution of estimating single image super resolution using deep learning, it is evident that ResNet based GAN that combines perceptual loss with generative loss and applying sub-pixel convolution for upsampling can generate better photo realistic super resolved images.

Notable Credit

I give credit to Katarzyna Kańska for her great inspiring presentation video on “Single Image Super Resolution”: Youtube Video — Can you enhance that? Single Image Super Resolution — Katarzyna Kańska.