SRGAN, a TensorFlow Implementation

Sam Maddrell-Mander
Towards Data Science
10 min readOct 27, 2018

--

(Find the code to follow this post here.)

We’ve all seen that moment in a crime thriller where the hero asks the tech guy to zoom and enhance an image, number plates become readable, pixelated faces become clear and whatever evidence needed to solve the case is found.

And, we’ve all scoffed, laughed and muttered something under our breath about how lost information can’t be restored.

Not any more. Well, sort of. It turns out the information is only partly lost. In a similar way that as humans we might infer the detail of a blurry image based on what we know about the world, now we can successfully apply the same logic to images to recover ‘photorealistic’ details lost to resolution effects.

This is the essence of Super Resolution, unlocking information on the sub pixel scale through a complicated understanding of the translation of low to high resolution images.

The CSI cliche aside, the real life applications of super resolution are numerous and incredibly lucrative.

Old family photos lacking in detail can be restored and enhanced to see peoples faces, the camera on your phone, now captures images like an SLR, all the way up to sensor data for medical imaging or autonomous vehicles.

Then there’s the business side to it, data is the new oil. Regardless of how stale that cliché may be, what’s certainly true is that high quality data is expensive, and people will pay through the nose for it. High quality data can mean the difference between coal fired and rocket fuelled as far a data science projects go. So the idea that it would be possible to simply ‘enhance’ the image sets companies already have? That’s an incredibly tempting proposition.

At the rate camera technology has improved over the last ten years we now expect pixel perfect, rich, images on everything we see. It might sound funny but an early adopter of this kind of tech has been a user curated recipe website, with images dating back over a decade. By enhancing old images they hope to preserve the value of older recipes. (here)

Turning a smart phone into an SLR hints at a subtler detail of the super resolution method, what’s being learnt is a mapping from one space into another, in our case low resolution to high. But there’s nothing that says that’s all it can do, why not include style transfer as well? Enhance the image to a high resolution and while we’re at it tweak the exposure and contrast, add some depth, and maybe open peoples eyes? These are all examples of the same methodology. For an excellent post covering more examples of ‘vanilla’ style transfers look here.

And in many ways the most interesting example is in sensor technology. Enormous amount of time and money is spent on developing sensors for medical imaging, safety and surveillance, which are then often deployed in challenging conditions without the budget to take advantage of cutting edge hardware.

This culminates in many ways in this recent paper on 3D SRGANs on MRI data (here), or used for microscopy in the lab (here). In the future a hospital or lab could spend the money required to get one state of the art machine, or buy several less expensive models, employ additional staff and see more patients with the same outcomes?

Regardless of the application super resolution is here to stay, but the reality is it’s been on the edges for a really long time.

Various methods have existed almost as long as image processing has been around, (bicubic, linear, etc.. ) culminating reccently in some very promising neural network approaches to describe the complicated multidimensional transition matrix between the LR space to the HR space.

However all these methods have fallen at the most important stumbling block. They all fail to consistently produce images that look natural to the human eye.

A breakthrough was made in 2017 by a group from Twitter (here), where rather than doing anything radically different architecturally from their peers in their neural network, they turned their attention to the humble loss function.

They implemented something called an perceptual loss function, that better tuned the network to produce images pleasing to the human eye.

In part by using a clever representation trick, where a pre-trained state of the art CNN model (VGG from the group at Oxford) was to calculate a loss based on the feature mapping of generated images compared to their high resolution truths.

These improvements yielded staggering results. The details are not always perfect but unlike most attempts the details are there, and the feel of the image as a whole is excellent.

From the SRGAN paper, the proposed image is almost identical to the original even with a four times downsampling factor. Look at the details around the eyes or the whiskers.

It’s not immediately intuitive why generating realistic images is any harder when there is a reference image to start with compared to pure generation, so to explore this idea a little lets return to our forger and expert to consider what this new paradigm would mean for them.

Instead of the gallery selling any old pieces of extremely valuable art, they are hosting several well known pieces expected to be sold for record sums.

Our forger has learnt very little about the art world (likes to keep work and home life separate) and has no idea about what these paintings are supposed to look like. However just before they sit down to paint their submission they see a small image on a flyer with the paintings that are up for auction. Great, now they have a reference image!

The only problem is the flyer is tiny and the real painting is huge, and they know the expert will be looking incredibly closely. They have to get the details right.

So now it’s not just good enough to paint a great paining, but it has to be that specific painting. Knowing nothing about the detail doesn’t deter the forger though.

Initially the forger decides the best approach is to vary the detail so that on average over the size of the pixel in the flyer the forgery matches the colour of the pixel, and as an attempt at realism tries to avoid sharp discontinuities of colour and makes sure the brush strokes (pixels) run together smoothly. They ask a friend to sneak into the auction house and check the individual brushstrokes against the real image for them, one by one.

This first approach is akin to MSE (Mean Square Error, with artistic licence.)

This works well for the forger initially but reaches a stumbling block, the expert can’t quite put their finger on it but something seems off about these images. There’s a fuzziness or lack of sharpness that doesn’t match the canvas size, and these images are mostly rejected as fakes. By matching brushstrokes the forger has lost the sense of the whole image, the technique in the brush stroke is perfect, but as a collection the strokes don’t fully capture the style, and generalising from image to image is hard work.

So the forger takes a different approach, rather than worrying exclusively about the individual brush strokes matching the image, they want the painting to resemble real objects in the world. So they ask another friend to describe the painting they produce, this friend is not an artist so doesn’t care about technique, but meticulously describes the objects in the painting.

They then get the same friend to go to the auction house and take notes on the painting the forger is trying to replicate. Then the forger compares the notes, does the forgery match the descrition of the real image? This is the VGG perceptual loss.

So finally by combining the feedback from these two friends the forger learns the details of the images in such exquisit detail that they can produce replicas of the masterpieces from nothing more than the small image on the flyer and a couple of insiders to feed back information, and by doing so makes them all a great deal of money.

If we think about this more technically for a minute. In my previous article we generated digits from the MNIST dataset using a conditional network, i.e. specifying the class of the image produced. This is a constraint, restricting the image produced to be within a certain region of the learnt distribution. However by choosing a specific image to fill in, we restrict the freedom of the generator much more significantly.

We now require the continuity over a long range and detail in such a way to look convincing when so much of that information has been lost.

The breakthrough comes in the advent of the perceptual loss function. This is the second method used by the forger above.

This surprisingly simple idea just combines the content loss (VGG) with the appropriately weighted adversarial loss at a ratio of 1000:1. This is enough to encourage the generator to find solutions that lie within the PDF of natural images without overly conditioning the network to reproduce rather than generate.

While it’s important to reproduce is the correct pixels, learning this representation through MSE lacks context, the idea of using the VGG network is that it has an excellent ‘feel’ for features in general and the spatial relation of pixels to each other carries more weight, so by comparing the latent space of feature representations at each layer of the VGG network both high and low level features are encouraged in a realistic way and can guide the style of the generated image.

It’s worth taking a minute to look at the maths behind these loss functions to understand the implementation, but for those not interested skip ahead to the results section.

The first equation shows the standard min/max game played by the discriminator and generator. This is the standard way to tune GANs relying on some equilibrium to be found, but trusts the discriminator to be the guiding force on the generator.

The expectation values are minimised with respect to the generator parameters, and maximised w.r.t. the discriminator parameters, until an equilibrium is reached

The perceptual loss is described in the second equation, this is the key original contribution in the paper, where a content loss (MSE or VGG in this case) is paired with a standard generator loss trying to fool the discriminator.

The Super Resolution loss is a sum of content loss (either MSE or VGG based) and the standard generator loss (to best fool the discriminator.)

In the case of MSE loss (third equation) it’s just a difference sum over the generate image and the target image, this is clearly minimised when the generated image is close to the target, but makes generalisation hard as there’s nothing to explicitly encourage contextually aware generation.

The MSE loss is summed over the width (W) and height (H) of the images, this is minimised by perfectly matching the pixels of the generated and original image

The fourth equation shows the breakthrough in the SRGAN paper, by taking a difference sum of the feature space from the VGG network instead of the pixels, features are matched instead. Making the generator much more capable of producing natural looking images than by pure pixel matching alone.

The VGG loss is similar to MSE, but instead of summing over the image pixels, summing over the feature mapping of the image from the VGG network.

The structure of the network is similar to a typical GAN in most respects, the discriminator network is just a standard CNN binary classification with a single dense layer at the end, the generator is a little less standard with the deonvolutional layers (conv2d_transpose) and the addition of skip connections to produce the 4x upscaling.

The skip connections are a regular feature in recurrent blocks of networks, essentially all it means is the state of the matrix is saved at the start of a block and added to the result at the end of the block. This happens for each of the first five blocks, as well as a skip connection that bypasses the entire first five blocks. The output from the final layer, deconv5 is of the desired image dimensions. note that consists of some 15 deconvolutional layers, each with a batch normalisation (except the first layer as is typical) with relu activations throughout.

The VGG network is also 15 convolutional layers deep (with three dense layers) but is otherwise fairly standard, the only addition is extracting the state of the matrix at various stages through thee layers to be fed into the perceptual loss.

Details on the implementation can be found here.

As can be seen the training preceded rapidly and after only a few batches realistic looking images start to appear, however the long tail of the graph shows that finding the photorealistic details is a slow process. This network clearly isn’t producing state of the art results but for the training time (few hours of CPU) the results are striking. The three GIFs below show the process as the images are honed and details emerge.

Example 1. From initial haze a clear face starts to emerge quickly, then fluctuates in exact tone before settling on a stable solution. Impressive detail reconstruction can be seen in the glasses frame, largely invisible in the input image. (Although less well recovered off the face.)
Example 2. The detail here is subtle, lines around the eyes and shapes of features. Features like mouth shape are close, bags under the eyes come out appropriately, and the appearance becomes much sharper.
Example 3. Details like the dark makeup and clean smile come out fairly well, but finer grain details like teeth and eyebrows are less well recovered. This may reflect the training data, this face is more of a portrait and closer up than typical, and shows the limitations of this implementation shown here. Despite this the result is still a significant improvement.

--

--

AI Engineer at Graphcore - interested in applying machine learning to hard problems