Image-to-Image Translation using CycleGAN Model

An unsupervised approach for Image-to-Image translation.

Rahil Vijay
Towards Data Science
14 min readNov 15, 2019

--

If you are completely new to the GANs field, I recommend you to check my previous article before beginning with this post.

Even if you are not new to GANs, I still recommend you to go through this article, as we will be requiring all the basic concepts that we learn in the previous article to understand this post. That being said let begin with this post.

Cycle Generative Adversarial Network(CycleGAN), is an approach to training deep convolutional networks for Image-to-Image translation tasks. Unlike other GANs models for image translation tasks, CycleGAN learns a mapping between one image domain and another using an unsupervised approach. For-eg If we are interested in translating an image of a horse into an image of a zebra, we do not require the training dataset of the horse physically converted into a zebra. The way CycleGAN does it is by training Generator Networks to learn a mapping from domain X into an image that looks like it came from domain Y(and vice-versa).

You will get a deeper understanding of how it’s done as you walk through this post. So let’s begin…

CycleGAN

CycleGAN architecture. Image from https://modelzoo.co/model/mnist-svhn-transfer

We would be referring to the above image for the intuition.

For the paired set of images, we can directly create a GAN to learn a mapping from x to y with the help of Pix2Pix. You can read more about Pix2Pix Networks here.

But preparing paired sets of data is time-intensive and difficult. For-eg By paired set what I mean is that we need to have an image of a zebra in the same position as a horse or with the same background to be able to learn to map.

To be able to solve this problem CycleGAN architecture was developed. CycleGANs enable learning a mapping from one domain X to another domain Y without having to find perfectly-matched, training pairs! Let’s look at how CycleGAN does it.

Suppose we have a set of images from the domain X and an unpaired set of images from the domain Y. We want to be able to translate an image from one set to another. To do this we define a mapping G(G: X->Y) that tries its best to map X to Y. But with unpaired data, we no longer have the ability to look at real and fake pairs of data. But we know that we can change our model to produce an output that belongs to a target domain.

So when you push an image of a horse(domain X), we can train a Generator to produce realistic-looking images of zebras(domain Y). But the problem with that is that we can’t force the output of the Generator to correspond to its input(In the above image the first transformation is the correct image-to-image translation). This leads to a problem called mode collapse in which a model might map multiple inputs from domain X into the same output from domain Y. In such cases, given an input horse(domain X), all we know is that the output should look like a zebra(domain Y). But to get the correct mapping of the input in the corresponding target domain we introduce an additional mapping as inverse mapping G’(G’: Y->X) which tries to map Y to X. This is called cycle-consistency constraint.

Think of it like this, if we translate an image of a horse(domain X) to an image of a zebra(domain Y), and then we translate back from a zebra(domain Y) to a horse(domain X), we should arrive back at the same image of the horse with which we started.

A complete translation cycle should bring you back the same image with which you started with. In the case of image translation form domain X to Y, if the following condition is met, we say that the transformation of an image from domain X to domain Y was correct.

condition -1

With the help of cycle-consistency constraint, CycleGAN makes sure that the model learns a correct mapping from domain X to domain Y.

Image-to-Image Translation task

The following task is broken down into a series of small tasks from loading and visualizing data to training models.

Visualizing Dataset

Specifically, we’ll look at a set of images of Yosemite National Park taken either during the summer or winter. The seasons are our two domains!

Images from the summer season domain.
Images from the winter season domain.

In general, you can see that the summer images are brighter and more green than the winter images. The winter contains things like snow and cloudy images. In this dataset, our main objective will be to train a Generator that learns to transform an image from summer into winter and vice versa. These images do not contain labels and are referred to as unpaired training data. But by using CycleGAN we can learn a mapping from one image domain to another using the unsupervised approach.

You can download the following data by clicking here.

Defining Models

CycleGAN comprises of two Discriminators(D_x and D_y) and two Generators(G_xtoy and G_ytox).

  • D_x — Identifies training images from domain X as real and translated images from domain Y to domain X as fake.
  • D_y — Identifies training images from domain X as real and translated images from domain Y to domain X as fake.
  • G_xtoy — Translates images from domain X to domain Y.
  • G_ytox — Translates images from domain Y to domain X.

Discriminators

The discriminators D_x and D_y, in this CycleGAN, are convolutional neural networks that see an image and attempt to classify it as real or fake. In this case, real is indicated by an output close to 1 and fake as close to 0. The discriminators have the following architecture:

# helper conv function
def conv(in_channels, out_channels, kernel_size, stride=2, padding=1, batch_norm=True):
"""Creates a convolutional layer, with optional batch normalization.
"""
layers = []
conv_layer = nn.Conv2d(in_channels=in_channels, out_channels=out_channels,
kernel_size=kernel_size, stride=stride, padding=padding, bias=False)

layers.append(conv_layer)

if batch_norm:
layers.append(nn.BatchNorm2d(out_channels))
return nn.Sequential(*layers)

class Discriminator(nn.Module):

def __init__(self, conv_dim=64):
super(Discriminator, self).__init__()

# Define all convolutional layers
# Should accept an RGB image as input and output a single value
self.layer_1 = conv(3,conv_dim,4,batch_norm = False)
self.layer_2 = conv(conv_dim,conv_dim*2,4)
self.layer_3 = conv(conv_dim*2,conv_dim*4,4)
self.layer_4 = conv(conv_dim*4,conv_dim*8,4)
self.layer_5 = conv(conv_dim*8,1,4,1,batch_norm = False)

def forward(self, x):
# define feedforward behavior
x = F.relu(self.layer_1(x))
x = F.relu(self.layer_2(x))
x = F.relu(self.layer_3(x))
x = F.relu(self.layer_4(x))

x = self.layer_5(x)
return x

Explanation

  • The following architecture consists of five convolutional layers which output a single logit. This logit defines whether the image is real or not. There is no fully connected layer in this architecture.
  • All the convolutional layers, except the first and last ones, are followed by a batch normalization(defined in conv helper function).
  • For the hidden units, the ReLU activation function is used.
  • The number of feature maps after each convolution is based on the parameter conv_dim(In my implementation conv_dim = 64).

Both D_x and D_y have the same architecture, so we only need to define one class, and later instantiate two discriminators.

Residual Blocks and Residual Function

While defining Generator architecture we will be using something called Resnet block and residual function in our architecture. The idea behind using Resnet Block and residual function is the following:

Residual Block

Residual Block connects encoder and decoder. The motivation behind this architecture is as follows: deep neural networks can be very difficult to train, as they are more likely to have exploding or vanishing gradients and, therefore, have trouble reaching convergence; batch normalization helps with this a bit.

One solution to this problem is to use Resnet blocks that allow us to learn so-called residual functions as they are applied to layer inputs.

Residual Function

When we create a deep learning model, the model (several layers with activations applied) is responsible for learning a mapping, M, from an input x to an output y.

M(x) = y

Instead of learning a direct mapping from x to y, we can instead define a residual function.

F(x) = M(x)-x

This looks at the difference between a mapping applied to x and the original input, x. F(x) is, typically, two convolutional layers + normalization layer and a ReLU in between. These convolutional layers should have the same number of inputs as outputs. This mapping can then be written as the following; a function of the residual function and the input x.

M(x) = F(x) + x

You can read more about deep residual learning here. Here is the code snippet for implementing Residual Block.

class ResidualBlock(nn.Module):
"""Defines a residual block.
This adds an input x to a convolutional layer (applied to x) with the same size input and output.
These blocks allow a model to learn an effective transformation from one domain to another.
"""
def __init__(self, conv_dim):
super(ResidualBlock, self).__init__()
# conv_dim = number of inputs

# define two convolutional layers + batch normalization that will act as our residual function, F(x)
# layers should have the same shape input as output; I suggest a kernel_size of 3
self.layer_1 = conv(conv_dim,conv_dim,3,1,1,batch_norm = True)
self.layer_2 = conv(conv_dim,conv_dim,3,1,1,batch_norm = True)

def forward(self, x):
# apply a ReLu activation the outputs of the first layer
# return a summed output, x + resnet_block(x)
out_1 = F.relu(self.layer_1(x))
out_2 = x + self.layer_2(out_1)

return out_2

Generator

The Generator G_xtoy and G_ytox are composed of an encoder, a conv net that turns an image into a small feature representation and a decoder, a transpose_conv net that is responsible for turning that feature representation into a transformed image. Here is the code snippet for implementing Generator.

def deconv(in_channels, out_channels, kernel_size, stride=2, padding=1, batch_norm=True):
"""Creates a transpose convolutional layer, with optional batch normalization.
"""
layers = []
# append transpose conv layer
layers.append(nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride, padding, bias=False))
# optional batch norm layer
if batch_norm:
layers.append(nn.BatchNorm2d(out_channels))
return nn.Sequential(*layers)

class CycleGenerator(nn.Module):

def __init__(self, conv_dim=64, n_res_blocks=6):
super(CycleGenerator, self).__init__()

# 1. Define the encoder part of the generator
self.layer_1 = conv(3,conv_dim,4)
self.layer_2 = conv(conv_dim,conv_dim*2,4)
self.layer_3 = conv(conv_dim*2,conv_dim*4,4)
# 2. Define the resnet part of the generator
layers = []
for n in range(n_res_blocks):
layers.append(ResidualBlock(conv_dim*4))
self.res_blocks = nn.Sequential(*layers)
# 3. Define the decoder part of the generator
self.layer_4 = deconv(conv_dim*4,conv_dim*2,4)
self.layer_5 = deconv(conv_dim*2,conv_dim,4)
self.layer_6 = deconv(conv_dim,3,4,batch_norm = False)

def forward(self, x):
"""Given an image x, returns a transformed image."""
# define feedforward behavior, applying activations as necessary

out = F.relu(self.layer_1(x))
out = F.relu(self.layer_2(out))
out = F.relu(self.layer_3(out))

out = self.res_blocks(out)

out = F.relu(self.layer_4(out))
out = F.relu(self.layer_5(out))
out = F.tanh(self.layer_6(out))

return out

Explanation

  • The following architecture consists of three convolutional layers for the encoder and three transpose convolutional layers for the decoder, and both of them are connected using a series of residual blocks(in our case 6).
  • All convolutional layers are followed by a batch normalization.
  • All transpose convolutional layers, except for the last one, are followed by a batch normalization.
  • For hidden units, the ReLU activation function is used, except for the last layer where we are using a tanh activation function based on our discussion from the previous article(tips for training DCGAN).
  • The number of feature maps after each convolution in encoder and decoder is based on parameter conv_dim.

Both G_xtoy and G_ytox have the same architecture, so we only need to define one class, and later instantiate two Generators.

Training Process

The training process comprises defining the loss functions, selecting optimizer and finally training the model.

Discriminator And Generator Loss

We’ve seen that regular GANs treat the discriminator as a classifier with the sigmoid cross-entropy loss function. However, this loss function may lead to the vanishing gradient problem during the learning process. To overcome such a problem, we’ll use a least-squares loss function for the discriminator. This structure is often referred to as Least Square GANs, you can read more about them from the original paper of LSGANs.

Discriminator Loss

The discriminator losses will be mean squared errors between the output of the discriminator, given an image, and the target value, 0 or 1, depending on whether it should classify that image as fake or real. For example, for a real image, x, we can train D_x by looking at how close it is to recognizing and image x as real using the mean squared error:

out = D_x(x)

real_error = torch.mean((out-1)²) (for Pytorch)

Generator Loss

In this, we will generate fake images that look like they belong to domain X but are based on images from domain Y, and vice versa. We will compute real loss on those generated images by looking at the output of the discriminator as it’s applied to these fake images.

In addition to the adversarial loss, the Generator loss term will include cycle consistency loss. This loss is a measure of how good a reconstructed image is when compared to an original image. For-eg we have a fake generated image x^ and a real image y, we can generate y^ from x^ with the help of G_xtoy(G_xtoy(x^) = y^). Here the cycle consistency loss will be the absolute difference between the original and reconstructed image.

Cycle consistency loss. Image from https://ssnl.github.io/better_cycles/report.pdf

Here is the code snippet for defining losses.

def real_mse_loss(D_out):
# how close is the produced output from being "real"?
return torch.mean((D_out - 1)**2)


def fake_mse_loss(D_out):
# how close is the produced output from being "fake"?
return torch.mean(D_out**2)

def cycle_consistency_loss(real_im, reconstructed_im, lambda_weight):
# calculate reconstruction loss
# return weighted loss
loss = torch.mean(torch.abs(real_im - reconstructed_im))
return loss*lambda_weight

In cycle consistency loss the lambda term is the weight parameter that will weight the mean absolute error in a batch. It’s recommended that you take a look at the original, CycleGAN paper to get a starting value for lambda_weight.

Optimizer

For CycleGAN we define three optimizers for Generators(G_xtoy and G_ytox) and D_x and D_y. For all the optimizers we are using Adam. All the values hyperparameters are chosen from the original CycleGAN paper.

# hyperparams for Adam optimizers
lr= 0.0002
beta1= 0.5
beta2= 0.999

g_params = list(G_XtoY.parameters()) + list(G_YtoX.parameters()) # Get generator parameters

# Create optimizers for the generators and discriminators
g_optimizer = optim.Adam(g_params, lr, [beta1, beta2])
d_x_optimizer = optim.Adam(D_X.parameters(), lr, [beta1, beta2])
d_y_optimizer = optim.Adam(D_Y.parameters(), lr, [beta1, beta2])

Training

When a CycleGAN train, and sees one batch of real images from set X and Y, it trains by performing the following steps:

For Discriminator:

  • Compute the Discriminator D_x loss on real images.
  • Generate fake images with the help of G_ytox using images from the set Y, and then calculate the fake loss for D_x.
  • Compute total loss and perform backpropagation and optimization. Do the same with D_y and your domain switched.

For Generator:

  • Generate fake images that look like domain X based on real images in domain Y, and then compute generator loss based on how D_x responds to fake X.
  • Generate reconstructed image Y^ images based on the fake X images in step 1.
  • Compute cycle consistency loss on reconstructed and real Y images.
  • Repeat steps 1 to 4, only swapping domains and add all the Generator loss and perform backpropagation and optimization.

Here is the code snippet for doing so.

def training_loop(dataloader_X, dataloader_Y, test_dataloader_X, test_dataloader_Y, 
n_epochs=1000):

print_every=10

# keep track of losses over time
losses = []

test_iter_X = iter(test_dataloader_X)
test_iter_Y = iter(test_dataloader_Y)

# Get some fixed data from domains X and Y for sampling. These are images that are held
# constant throughout training, that allow us to inspect the model's performance.
fixed_X = test_iter_X.next()[0]
fixed_Y = test_iter_Y.next()[0]
fixed_X = scale(fixed_X) # make sure to scale to a range -1 to 1
fixed_Y = scale(fixed_Y)

# batches per epoch
iter_X = iter(dataloader_X)
iter_Y = iter(dataloader_Y)
batches_per_epoch = min(len(iter_X), len(iter_Y))

for epoch in range(1, n_epochs+1):

# Reset iterators for each epoch
if epoch % batches_per_epoch == 0:
iter_X = iter(dataloader_X)
iter_Y = iter(dataloader_Y)

images_X, _ = iter_X.next()
images_X = scale(images_X) # make sure to scale to a range -1 to 1

images_Y, _ = iter_Y.next()
images_Y = scale(images_Y)

# move images to GPU if available (otherwise stay on CPU)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
images_X = images_X.to(device)
images_Y = images_Y.to(device)


# ============================================
# TRAIN THE DISCRIMINATORS
# ============================================

## First: D_X, real and fake loss components ##

# 1. Compute the discriminator losses on real images
d_x_optimizer.zero_grad()
real_D_loss = real_mse_loss(D_X(images_X))
# 3. Compute the fake loss for D_X
fake_D_loss = fake_mse_loss(D_X(G_YtoX(images_Y)))
# 4. Compute the total loss and perform backprop
d_x_loss = real_D_loss + fake_D_loss
d_x_loss.backward()
d_x_optimizer.step()

## Second: D_Y, real and fake loss components ##
d_y_optimizer.zero_grad()
real_D_y_loss = real_mse_loss(D_Y(images_Y))

fake_D_y_loss = fake_mse_loss(D_Y(G_XtoY(images_X)))

d_y_loss = real_D_y_loss + fake_D_y_loss
d_y_loss.backward()
d_y_optimizer.step()


# =========================================
# TRAIN THE GENERATORS
# =========================================

## First: generate fake X images and reconstructed Y images ##
g_optimizer.zero_grad()
# 1. Generate fake images that look like domain X based on real images in domain Y
out_1 = G_YtoX(images_Y)
# 2. Compute the generator loss based on domain X
loss_1 = real_mse_loss(D_X(out_1))
# 3. Create a reconstructed y
out_2 = G_XtoY(out_1)
# 4. Compute the cycle consistency loss (the reconstruction loss)
loss_2 = cycle_consistency_loss(real_im = images_Y, reconstructed_im = out_2, lambda_weight=10)

## Second: generate fake Y images and reconstructed X images ##
out_3 = G_XtoY(images_X)
# 5. Add up all generator and reconstructed losses and perform backprop
loss_3 = real_mse_loss(D_Y(out_3))
out_4 = G_YtoX(out_3)
loss_4 = cycle_consistency_loss(real_im = images_X, reconstructed_im = out_4, lambda_weight=10)

g_total_loss = loss_1 + loss_2 + loss_3 + loss_4
g_total_loss.backward()
g_optimizer.step()

# Print the log info
if epoch % print_every == 0:
# append real and fake discriminator losses and the generator loss
losses.append((d_x_loss.item(), d_y_loss.item(), g_total_loss.item()))
print('Epoch [{:5d}/{:5d}] | d_X_loss: {:6.4f} | d_Y_loss: {:6.4f} | g_total_loss: {:6.4f}'.format(
epoch, n_epochs, d_x_loss.item(), d_y_loss.item(), g_total_loss.item()))


sample_every=100
# Save the generated samples
if epoch % sample_every == 0:
G_YtoX.eval() # set generators to eval mode for sample generation
G_XtoY.eval()
save_samples(epoch, fixed_Y, fixed_X, G_YtoX, G_XtoY, batch_size=16)
G_YtoX.train()
G_XtoY.train()

# uncomment these lines, if you want to save your model
# checkpoint_every=1000
# # Save the model parameters
# if epoch % checkpoint_every == 0:
# checkpoint(epoch, G_XtoY, G_YtoX, D_X, D_Y)

return losses

The training is performed over 5000 epochs using a GPU, that’s why I had to move my model and inputs from CPU to GPU.

Results

  • The following is the plot of the training losses for the Generator and Discriminator recorded after each epoch.

We can observe that the Generators start with a very high error, but with time it starts to produce decent image translations, thus helps in bringing down the error.

Both the discriminator errors show very little fluctuation in error. But by the end of 5000 epochs, we can see that both discriminator errors have decreased, thus forcing Generators to do more realistic image translations.

  • Visualizing samples.

After 100 iterations —

Translation from X to Y after 100 iterations
Translation from Y to X after 100 iterations

After 5000 iterations —

Translation from X to Y after 5000 iterations
Translation from Y to X after 5000 iterations

We can observe that CycleGAN models produce low-resolution images, this is an ongoing area of research, you can read more about the high-resolution formulation that uses multiple Generators by clicking here.

This model struggles with matching colors exactly. This is because, if G_xtoy and G_ytox may change the tint of an image; the cycle consistency loss may not be affected and can still be small. You could choose to introduce a new, color-based loss term that compares G_ytox(y) and y, and G_xtoy(x) and x, but then this becomes a supervised learning approach. That being said CycleGAN was able to do satisfactory translations.

If you want to stay connected, you can find me on LinkedIn .

References

Check out my Github repo regarding this post.

--

--