The world’s leading publication for data science, AI, and ML professionals.

What is ‘Image Super Resolution’, and why do we need it?

An introduction to the field, its applications, and current issues

Have you ever seen old monochrome pictures (most often grayscale) which have several artefacts, that are then colorised and made to look as if they were taken only recently with a modern camera? That is an example of image restoration, which can be more generally defined as the process of retrieving the underlying high quality original image given a corrupted image.

Example of image restoration, involving colorisation and removal of artefacts such as noise and blurring. Image by Wilfredor from Wikimedia
Example of image restoration, involving colorisation and removal of artefacts such as noise and blurring. Image by Wilfredor from Wikimedia

A number of factors can affect the quality of an image, with the most common being suboptimal photographing conditions (e.g. due to motion blur, poor lighting conditions), lens properties (e.g. noise, blur, lens flare), and post-processing artefacts (e.g. lossy compression schemes, which are methods that perform compression in such a way that it cannot be reversed and thus leads to a loss of information).

Another factor that can affect Image Quality is resolution. More specifically, low-resolution (LR) images contain a low number of pixels representing an object of interest, making it hard to make out the details. This can be either because the image itself is small, or because an object is far away from the camera thereby causing it to occupy a small area within the image. Super-Resolution (SR) is a branch of Artificial Intelligence (AI) that aims to tackle this problem, whereby a given LR image can be upscaled to retrieve an image with higher resolution and thus more discernible details that can then be used in downstream tasks such as object classification, face recognition, and so on. Sources of LR images include cameras that may output low quality images, such as mobile phones and surveillance cameras.

A meme highlighting the irony of being able to capture high quality images of planets that are billions of kilometres away while then having pictures taken right here on Earth where objects just a few metres away from the camera are practically indiscernible. Image by author, inspired by Sarcasm on Facebook. Picture of Neptune by NASA on Unsplash, picture of person holding gun by Vitaliy Izonin on Pexels.
A meme highlighting the irony of being able to capture high quality images of planets that are billions of kilometres away while then having pictures taken right here on Earth where objects just a few metres away from the camera are practically indiscernible. Image by author, inspired by Sarcasm on Facebook. Picture of Neptune by NASA on Unsplash, picture of person holding gun by Vitaliy Izonin on Pexels.

In the rest of this article, the following will be discussed:

  1. So why not use better cameras?
  2. Don’t there already exist tools that can upscale an image?
  3. Deep Learning-based Image Super-Resolution
  4. How do we check if Super-Resolution methods are any good?
  5. How can we get the low-resolution images used for training and evaluating Super-Resolution Methods?
  6. Applications of Super-Resolution
  7. Ethical Considerations
  8. What can be done to counteract these concerns?
  9. Conclusion

So why not use better cameras?

At this point, you may be asking yourself "so why don’t we just use better quality cameras, instead of going through the trouble of developing algorithms that can give us the same result anyway?"

That’s a good point, but there do exist practical considerations. For example, whilst modern mobile phone cameras do capture fairly good quality images, they still yield several imperfections caused primarily by the need to use lenses and image sensors that are compact enough to fit on a phone without making it too bulky, while also being relatively cheap.

In the case of CCTVs, it is not hard to understand that the cost of a camera generally increases with higher quality, while higher-quality footage will also need more storage space (unless aggressive compression schemes are used that significantly degrade the picture quality and thus make the use of high quality cameras pointless), leading to additional costs. This is particularly relevant for buildings that need to have tens, hundreds, or maybe even thousands of cameras; at such scales, any costs are multiplied considerably and may cause such systems to be unattractive to the point where they are not installed altogether, compromising security.

Lastly – and perhaps most obviously – there already exist numerous images that have been captured with low quality cameras and which may contain important information. Hence, at the very least, we need a way to improve the quality of these existing images.

Don’t there already exist tools that can upscale an image?

You may also be asking yourself "but can’t we already increase the size of an image using even basic image editing software programs, like Microsoft Paint"?

Indeed, interpolation methods such as bilinear interpolation and bicubic interpolation are often the tools of choice in many applications, including web browsers which generally use bilinear interpolation. However, these are relatively simple algorithms that are fast – thereby satisfying user’s demands for speedy reactions by software programs – but are then incapable of producing high fidelity images. In fact, the resultant images tend to contain a-lot of pixelations, and do not actually make it that much easier to discern details.

Increasing the resolution means increasing the number of pixels, which also means that the missing information somehow needs to be inferred. This is perhaps the main reason why simple techniques like interpolation do not yield satisfactory results – because they do not leverage any knowledge garnered from looking at other similar samples to learn how to infer the missing data and create high quality images, as SR approaches are designed to do.

Overview of the super-resolution task: given low-resolution images, a super-resolution network is then tasked with improving the quality of the images to yield super-resolved images. The images can depict anything from buildings, to faces, to satellite imagery, and so on. The original high-resolution images are shown for comparison, with the photo on top by Sorasak on[ Unsplash](https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) and the bottom photo by Lars Bo Nielsen on Unsplash. Low-resolution and super-resolved images created by the author.
Overview of the super-resolution task: given low-resolution images, a super-resolution network is then tasked with improving the quality of the images to yield super-resolved images. The images can depict anything from buildings, to faces, to satellite imagery, and so on. The original high-resolution images are shown for comparison, with the photo on top by Sorasak on[ Unsplash](https://unsplash.com/?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText) and the bottom photo by Lars Bo Nielsen on Unsplash. Low-resolution and super-resolved images created by the author.

The creation of new information also means that the SR task is a non-trivial ill-posed problem, since there are multiple ways in which to super-resolve an image. In other words, several plausible images may exist for any given LR image. The problem is compounded in the presence of the other factors mentioned above that further degrade the image quality, with sensor noise, blur, and compression being the most common. Despite this, most methods only output a single super-resolved image, although a fairly recent stream of research is also exploring ways to enable the generation of multiple plausible images for a given LR image.

The problem of SR: Many HR images can be downsampled to a single LR image...
The problem of SR: Many HR images can be downsampled to a single LR image
...and, conversely, a single LR image can be super-resolved to multiple HR images
…and, conversely, a single LR image can be super-resolved to multiple HR images

Deep Learning-based Image Super-Resolution

Researchers have long been developing methods that allow the retrieval of the underlying good quality images using a variety of techniques such as sparse representation-based methods. However, it was the advent of deep learning and convolutional neural networks that arguably brought about the most significant leaps forward, with the seminal work being the Super-Resolution Convolutional Neural Network (SRCNN) proposed by Dong et al. in 2014. Much work has been done since then, not only on the design and structure of the neural networks but also on the data used to train and evaluate these networks.

Examples of the results produced by various SR methods, including the simple bicubic interpolation and what is considered to be the first deep convolutional neural network-based SR method, SRCNN. Image available in the paper of Wang et al. (2018) where ESRGAN was proposed.
Examples of the results produced by various SR methods, including the simple bicubic interpolation and what is considered to be the first deep convolutional neural network-based SR method, SRCNN. Image available in the paper of Wang et al. (2018) where ESRGAN was proposed.

It is well-known that deep learning-based methods need substantial amounts of data for robust training, typically requiring a ground-truth in a supervised setting so that the model knows what is the end goal. In the case of SR, this means having the original high-resolution image AND the low-resolution image; however, it is hard to acquire such pairs of images in real-world scenarios – usually a photo is either high-resolution or low-resolution, and they cannot be both captured simultaneously.

Whilst one could capture two images using different camera settings or different cameras (one capturing the good quality image and the other capturing a low quality image), in practice this is not an easy task since images ideally need to have identical content. This avoids adding an additional layer of complexity that may make a network harder to train. For example, there may be differences in the scene between one capture and another, such as a moving vehicle or bird, while it is unlikely that the images would be perfectly aligned with each other.

How do we check if Super-Resolution methods are any good?

Image quality whilst training and evaluating SR methods is typically judged using objective metrics such as the Peak Signal-to-Noise Ratio (PSNR) and the Structural SIMilarity index (SSIM), which are known as Full-Reference Image Quality Assessment (FR-IQA) metrics.

Whilst a detailed explanation of such methods is beyond the scope of this article, suffice to say that FR-IQA algorithms generally attempt to produce quality ratings that correlate with human subjective perceptions of quality by computing differences between the pixels in the high-resolution image and the corresponding pixels in the image to be evaluated, and thus assume that the images are perfectly aligned; hence, even the smallest shift in either vertical or horizontal directions can wreak havoc and cause the metrics to indicate that the image under evaluation is of poor quality, even if it is actually identical to the high-resolution image. This is the main reason why it is important that the images to be compared are perfectly aligned with each other.

Whilst there do exist No-Reference (NR-IQA) metrics that predict subjective quality using solely an image to be evaluated (i.e. without comparing it to the high-resolution counterpart) such as BRISQUE and NIQE, these tend to be non-differentiable and thus cannot be used during training of neural networks. However, such metrics are being used more often during the evaluation process of developed SR methods.

How can we get the low-resolution images used for training and evaluating Super-Resolution Methods?

To counteract the above issues, high-resolution images are typically synthetically degraded using a degradation model, defining the type and magnitude of artefacts to be applied to the images in a dataset in order to yield the corresponding synthetic low-resolution images. Since the degraded image is directly acquired from a high-resolution image, these pairs are perfectly aligned and can be used for training and evaluation in combination with FR-IQA metrics.

However, such images may not be truly representative of real-world images, so that models trained on the synthetic images tend to break down in real-world applications. Thus, there have also been efforts, such as the "Zero-Shot" SR (ZSSR) method, which attempt to avoid the use of low/high resolution image pairs in an self-supervised setting, by exploiting the observation that images tend to have recurring patterns within them at different scales and locations. This allows an algorithm to be trained only on low-resolution images. However, it has been argued that not all images conform to this assumption (e.g. images having a-lot of diverse content or, conversely, a-lot of uniform areas), leading this type of methods to be unsuitable for such images.

Applications of Super-Resolution

Super-resolution methods can be applied for virtually any type of image content, be it natural scenes, buildings, or even anime images. Super-resolution methods designed for these content types are typically aimed more at the entertainment industry, for instance to improve the end user experience by ameliorating the image quality that can in turn make the viewing experience more pleasing.

There also exist applications which are targeted more towards Security and law enforcement, such as super-resolution of face images, vehicle licence plates, and multispectral satellite imagery among others.

Super-resolution of a real multi-exposure sequence of 9 SkySat images; original low-resolution images with varying exposures shown on top, with results using five different methods shown in the middle row. The last row shows a closer look at a particular region of the images. Image acquired from Nguyen et al. (2022)
Super-resolution of a real multi-exposure sequence of 9 SkySat images; original low-resolution images with varying exposures shown on top, with results using five different methods shown in the middle row. The last row shows a closer look at a particular region of the images. Image acquired from Nguyen et al. (2022)

So far, the methods described above have been assumed to operate on one image at a time, so that they are labelled as Single Image SR (SISR) methods. However, it is also possible to perform SR on images in a video sequence (where they are called ‘frames’), allowing the exploitation of temporal information (i.e. information across time) in addition to the spatial information (i.e. the image content). The additional information should enable superior results when compared to the use of a single image only.

All in all, given a low quality image – whatever the source – methods capable of improving the image quality can be designed and implemented so as to yield clearer details and more usable information.

Ethical Considerations

As in most applications, ethical concerns need to be considered. This is particularly relevant in image super-resolution, where an entire image is essentially being generated by a machine.

Without constraints, a super-resolution model could theoretically generate images that bear absolutely no relationship to the original image. While this may sound a-bit far-fetched, this is actually a problem for some methods, especially those based on Generative Adversarial Networks (GANs).

GAN-based methods are typically trained to prioritise the generation of perpetually pleasing images by employing the use of a discriminator that is trained to tell apart ‘real’ images from ‘fake’ images. Its aim is to drive the generator (the part of a network responsible for creating the images) to produce more realistic-looking images that are capable of fooling the discriminator into thinking the images are real photographs. However, this may come at the expense of losing some fidelity to the original image content, especially if the image to be super-resolved is of particularly poor quality.

This trait is actually desirable for some applications, such as image synthesis using semantic segmentation masks or text descriptions, as done by the popular DALLE 2 system.

However, some applications require much more care and attention, such as security and law enforcement. When super-resolving images or videos that contain potentially useful information to identify perpetrators, it would be unacceptable to yield information that was not present within the original image. For instance, the super-resolution of a vehicle licence plate cannot yield the information of a vehicle that wasn’t actually present anywhere near the crime scene area.

This is perhaps even more crucial when it comes to people. Specifically, an algorithm cannot output the face of a person who was not in the scene. In these cases, it is imperative that any method employed is able to preserve the identity of the subject to ensure that suspects are brought to justice, while at the same time avoiding a change in identity lest the wrong person be wrongly accused and convicted – while the actual criminal remains at large.

The process in law courts is also such that a case may be jeopardised by the use of super-resolved images where extraneous information could have been inferred. Even if the right criminal is put to trial, any doubts about the evidence garnered against the criminal could well lead to their release. The robustness of such methods is thus paramount to ensure not only fair trials, but to also avoid accusing the wrong people and the risk of basing investigations on erroneous information.

The way that super-resolved images appear – and the balance between creating images that are faithful to the original content at the expense of potentially not being nice to look at, or creating images that are pleasing to look at but which may contain some different details than what was depicted in the original image – is largely controlled by what are known as loss functions.

Loss functions essentially evaluate the outputs of neural networks whilst they are being trained, and more than one may be used at a single time. Indeed, the choice of loss functions is part of the problem in GANs, where one function tends to measure the real-ness of the generated image. In more technical terms, given that a GAN essentially recreates probability distributions, then loss functions determine the differences between the generated distributions and the distribution of the real data that is being modelled.

However, this kind of loss function cannot be used by itself due to the risk that a neural network may take the easy way out and generate images that look very nice but bear no resemblance to the original image. Hence, functions which measure the pixel-level differences are also employed (typically using the Mean Squared Error (MSE) and closely related Peak Signal-to-Noise Ratio (PSNR)), in an attempt to constrain the outputs of the models to at least bear a resemblance to the original image.

It is then up to the designers of the models to determine how the weights of the loss terms are employed, depending on the intended application. Thus, it may still be possible that more importance is given to perceptually pleasing content rather than faithful recreations of the original content.

Most super-resolution methods employ at least one pixel-based loss, to prevent obvious attributes such as the skin or hair colours from being changed drastically. However, smaller details such as the eye colour are easily lost in the degradation process and it is then up to the super-resolution algorithm to determine the colour to use. Hence, this will be highly influenced by the distribution of eye colours of the subjects in the training set.

However, research has also been conducted on designing super-resolution models that not only process a low-resolution image, but also exploit the attributes of the person depicted in the image. This has the potential of improving the robustness of super-resolution algorithms by removing some of the guess-work which they need to employ in the absence of any information.

Law enforcement officers already ask eyewitnesses for descriptions of attributes such as the person’s gender, age, presence of facial hair, and so on. These are then sometimes used to create forensic sketches which can be disseminated to the public so that anyone recognising the suspect depicted in the sketch can come forward with information leading to an arrest.

Methods have also been created to automatically determine the identity of subjects depicted in sketches by comparing them with real-world photographs, for a variety of sketch types such as those obtained using software programs and those hand-drawn by forensic artists for use in real-world investigations.

These attributes can also be used as supplementary information to help guide a super-resolution model in yielding images that are even more representative of the actual high-resolution image.

But what happens if the wrong attributes are supplied? The effect, at least in the work by Yu et al. (2018), is generally quite subtle, where the change in attributes serves more to fine-tune the end result. That said, some attributes such as gender can lead to noticeably different results, as shown in Figure 4 of Yu et al. (2018). Hence, care must be taken to ensure that the wrong attributes are not supplied, or to at least bear in mind that the attributes may be incorrect and thus the end result may have been negatively influenced when interpreting the super-resolved images.

However, the ability to adjust attributes allows the creation of multiple images (in contrast to most super-resolution methods that only generate one image), and thus could actually allow greater robustness when the exact attributes are unknown or uncertain by yielding multiple plausible examples rather than just one single image.

What about the data used to train the models? How does it affect the performance of super-resolution models? It is well-known that deep learning-based methods require vast amounts of data to avoid effects such as over-fitting, where a model tunes its parameters in such a way that it works very well on the training data but then falls apart when tested on unseen data. Hence, the more data that is used to train these models, such risks are lowered.

Example of over-fitting, with the training error shown in blue and the validation set error shown in red. Both errors are a function of the number of training cycles. When the error on the validation set starts increasing while the training set error continues to decrease, it is likely that the model is being over-fit. Image by Dake~commonswiki on Wikimedia Commons.
Example of over-fitting, with the training error shown in blue and the validation set error shown in red. Both errors are a function of the number of training cycles. When the error on the validation set starts increasing while the training set error continues to decrease, it is likely that the model is being over-fit. Image by Dake~commonswiki on Wikimedia Commons.

Another concern is in how the various [hyper-parameters](https://en.wikipedia.org/wiki/Hyperparameter(machinelearning)) are chosen. Hyper-parameters control how a network is trained, so it is important to tune them carefully. Specifically, data in a dataset is typically divided into three components: a training set, a validation set, and a testing set.

The training set, as the name implies, is used to train the model, while the testing set is used simply to evaluate the performance of the model on data which it has never seen. This helps determine the generalisability of the model, and whether phenomena such as over-fitting have occurred.

Meanwhile, the validation set contains data that is used to evaluate the performance of the model whilst training. Hence, it can be viewed as representing a test set whilst training a model. Parameters should be tuned on this set to prevent the performance being biased to the data in the training set (which might not make the model generalisable), and to also prevent the performance being tuned on the test data (the test data gives us an indication of the real-world performance on unseen data, so it is critical that it is used for evaluation purposes only; otherwise, performance in real-world applications might be worse than expected).

Another major concern regards any differences in the representations (in terms of quantity) of the various properties and groups in the data, more commonly known as class imbalance. In the case of face super-resolution, these properties can range from the number of images per subject, to the number of subjects belonging to specific demographic groups. For instance, many datasets tend to contain more ‘white’ people than any other demographic, which could cause a model to perform well on these images but perform poorly on under-represented classes.

While these effects are not very well-studied in the field of super-resolution (especially in terms of objective performance metrics), some observations can be made when studying the super-resolved images in more detail. The potential difference in eye colour, as mentioned above, is one such observation. It has also been observed that men’s lips may appear to have lipstick on them if the model is trained on a dataset containing many female subjects with lipstick applied, and the model determines that the subject is female. The age of the subject may also be affected by the proportion of young subjects to older subjects in the training dataset; for example, a model may tend to make subjects look younger if the dataset contains predominantly young subjects (which may not always be a bad thing…).

Work to counteract issues arising from class imbalance in a given dataset has also been done in other domains. For instance, _data augmentation_ is often used as a means to artificially increase the amount of data used. Moreover, the functions used to determine the robustness of a model can also be tuned to cater for any class imbalances. ‘Balanced datasets’ such as FairFace and Balanced Faces in the Wild (BFW), where demographic groups are represented by an equal number of subjects, have also been created and could thus help ensure that any biases arising from differences in class representations are virtually eliminated.

What can be done to counteract these concerns?

There are several approaches that can be employed to avoid the generation of erroneous information for applications that are of a particularly sensitive nature. A few potential solutions have already been given in the previous section, but some other approaches will now also be considered.

Perhaps one of the most obvious ways to prevent the generation of wrong data is to avoid the use of GANs in such applications. While it is true that they tend to yield images that look quite realistic, they would be of no use if the information within them is incorrect. More traditional CNN-based methods could thus be more suitable. However, care must be taken if perceptual losses are used in such architectures, which are also aimed at making an image look more appealing.

Another rather obvious approach is to use models that are specifically trained and developed for the task at hand. For example, if the task is to super-resolve face images, then the use of a model trained on generic scenes cannot be expected to yield reliable results.

As mentioned above, the choice of losses/loss functions is also critical. Losses measuring the pixel-level differences should be employed to help ensure that the super-resolved result bears a substantial resemblance to the content in the original high-resolution image, in terms of texture, structure, and colour.

Another approach that could be considered is in the use of a higher amount of training data. It is well-known that deep learning-based models tend to require vast amounts of data to avoid issues such as overfitting, but the use of more diverse data may also help the network to learn more possibilities as to how to super-resolve an image and make it more robust when operating on images outside of its training set (i.e. on images which it has not encountered).

As previously mentioned, a recent stream of research is exploring ways to develop super-resolution methods that are able to predict the space of plausible super-resolution images, given that the loss of information actually means that multiple images could have been degraded to yield the same low-quality image. This means that rather than generating just a single image (as is normally done), a number of images can be output instead. The variations between images would mitigate the risk of basing any conclusions on just one potentially erroneous result.

Models that are able to use supplementary information can also be used to yield a range of images with varying details (in lieu of just a single image), such as in the work of Yu et al. (2018) mentioned above that can use attributes describing a person for face super-resolution.

Attributes do not need to be related solely to the image content, but they may also describe the degradations in the image. For example, approaches have been designed to incorporate this information into existing deep learning-based methods to help guide the super-resolution models in reversing the degradations afflicting the image and yield better images.

Degradation information can be input into existing deep learning-based image super-resolution models, for example using the meta-attention block
Degradation information can be input into existing deep learning-based image super-resolution models, for example using the meta-attention block

Lastly, the level of quality of the original image could be considered. For example, if the quality is too low (which could be determined either subjectively or objectively using IQA metrics as mentioned above), then the image could be discarded outright. However, given the potentially limited amount of data to work with, it may be undesirable to throw away what limited data is available. In this case, the image could still be super-resolved, but then it would need to be kept in mind that the results may not be entirely trustworthy.

Conclusion

I hope this article has given you a flavour of the super-resolution field, which has become very vast with a substantial number of approaches that can be applied to practically any kind of low-quality image, such as face images, vehicle licence plates, satellite imagery, and old photos, with numerous applications including in important domains such as security and law enforcement.

However, SR is not yet a solved problem (contrary to what you may be led to believe in movies and TV series), and there remain many challenges to be tackled. In the next article, I will delve deeper into the various techniques that are being used in modern SR methods, along with remaining challenges and directions for future work.


Do you have any thoughts about this article? Please feel free to post a note, comment, or message me directly on LinkedIn!

Also, make sure to **[Follow](https://medium.com/m/signin?actionUrl=https%3A%2F%2Fmedium.com%2F%2Fsubscribe%2Fuser%2Fa9be78db0c9b&operation=register&redirect=https%3A%2F%2Ftowardsdatascience.com%2Fwhat-is-image-super-resolution-and-why-do-we-need-it-9c3bd9dc233e&user=Christian+Galea&userId=a9be78db0c9b&source=post_page-a9be78db0c9b–three_column_layout_sidebar———————–followprofile———–)** me to ensure that you’re notified upon publication of future articles.

The author is currently a post-doctoral researcher at the University of Malta in the Deep-FIR project, which is being done in collaboration with Ascent Software and is financed by the Malta Council for Science & Technology (MCST), for and on behalf of the Foundation for Science & Technology, through the FUSION: R&I Technology Development Programme.


Related Articles