Pix2Pix

Connor Shorten
Towards Data Science
5 min readJan 29, 2019

--

Shocking result of Edges-to-Photo Image-to-Image translation using the Pix2Pix GAN Algorithm

This article will explain the fundamental mechanisms of a popular paper on Image-to-Image translation with Conditional GANs, Pix2Pix, following is a link to the paper:

Article Outline

I. Introduction

II. Dual Objective Function with Adversarial and L1 Loss

III. U-Net Generator

IV. PatchGAN Discriminator

V. Evaluation

Introduction

Image-to-Image translation is another example of a task which Generative Adversarial Networks (GANs) are perfectly suited for. These are tasks in which it is nearly impossible to hard-code a loss function for. Most studies on GANs are concerned with novel image synthesis, translating from a random vector z into an image. Image-to-Image translation converts one image to another such as the edges of the bag above to the photo image. Another interesting example of this is shown below:

Interesting idea of translating from satellite imagery into a Google Maps-style street view

Image-to-Image translation is also useful in applications such as colorization and super-resolution. However, many of the implementation ideas specific to the pix2pix algorithm are also relevant for those studying novel image synthesis.

A very high-level view of the Image-to-Image translation architecture in this paper is depicted above. Similar to many image synthesis models, this uses a Conditional-GAN framework. The conditioning image, x is applied as the input to the generator and as input to the discriminator.

Dual Objective Function with Adversarial and L1 Loss

A naive way to do Image-to-Image translation would be to discard the adversarial framework altogether. A source image would just be passed through a parametric function and the difference in the resulting image and the ground truth output would be used to update the weights of the network. However, designing this loss function with standard distance measures such as L1 and L2 will fail to capture many of the important distinctive characteristics between these images. However, the authors do find some value to the L1 loss function as a weighted sidekick to the adversarial loss function.

The Conditional-Adversarial Loss (Generator versus Discriminator) is very popularly formatted as follows:

The L1 loss function previously mentioned is shown below:

Combining these functions results in:

In the experiments, the authors report that they found the most success with the lambda parameter equal to 100.

U-Net Generator

The U-Net architecture used in the Generator of the GAN was a very interesting component of this paper. Image Synthesis architectures typically take in a random vector of size 100x1, project it into a much higher dimensional vector with a fully connected layer, reshape it, and then apply a series of de-convolutional operations until the desired spatial resolution is achieved. In contrast, the Generator in pix2pix resembles an auto-encoder.

The Skip Connections in the U-Net differentiate it from a standard Encoder-decoder architecture

The Generator takes in the Image to be translated and compresses it into a low-dimensional, “Bottleneck”, vector representation. The Generator then learns how to upsample this into the output image. As illustrated in the image above, it is interesting to consider the differences between the standard Encoder-Decoder structure and the U-Net. The U-Net is similar to ResNets in the way that information from earlier layers are integrated into later layers. The U-Net skip connections are also interesting because they do not require any resizing, projections etc. since the spatial resolution of the layers being connected already match each other.

PatchGAN Discriminator

The PatchGAN discriminator used in pix2pix is another unique component to this design. The PatchGAN / Markovian discriminator works by classifying individual (N x N) patches in the image as “real vs. fake”, opposed to classifying the entire image as “real vs. fake”. The authors reason that this enforces more constraints that encourage sharp high-frequency detail. Additionally, the PatchGAN has fewer parameters and runs faster than classifying the entire image. The image below depicts results experimenting with the size of N for the N x N patches to be classified:

The 70 x 70 Patch is found to produce the best results

Evaluation

Evaluating GAN outputs are difficult and there are many different ways of doing this. The authors of pix2pix use two different strategies to evaluate their results.

The first strategy is to use human scoring. Real images and images created with pix2pix are randomly stacked together and human scorers label each image as real or fake after seeing it for 1 second. This is done using the Amazon Mechanical Turk platform.

Another evaluation strategy which I found to be very interesting was the use of a semantic segmentation network on synthetically generated network. This is analogous to another very popular quantitative evaluation metric for GAN outputs known as the “Inception Score” where the quality of synthesized images are rated based on a pre-trained Inception model’s ability to classify them.

Far Left: Semantic Segmentation Label, Second: Ground Truth Image, Third: L1 Distance Used, Fourth: cGAN used, Far Right: L1 Distance + cGAN

Conclusions

Pix2Pix is a very interesting strategy for Image-to-Image translation using a combination of L1 Distance and Adversarial Loss with additional novelties in the design of the Generator and Discriminator. Thanks for reading, please check out the paper for more implementation details and explanations of experimental results!

--

--