Colorizing black & white images with U-Net and conditional GAN — A Tutorial

Published in

Towards Data Science

18 min readNov 18, 2020

Left: Input black & white images from test set | Right: the colorized outputs by the final model of this tutorial, Image by author

One of the most exciting applications of deep learning is colorizing black and white images. This task needed a lot of human input and hardcoding several years ago but now the whole process can be done end-to-end with the power of AI and deep learning. You might think that you need huge amount of data or long training times to train your model from scratch for this task but in the last few weeks I worked on this and tried many different model architectures, loss functions, training strategies, etc. and finally developed an efficient strategy to train such a model, using the latest advances in deep learning, on a rather small dataset and with really short training times. In this article, I’m going to explain what I did to make this happen, including the code!, and the strategies that helped and also those that were not useful. Before that, I will explain the colorization problem and give you a short review of what has been done in recent years.

My whole project on image colorization is now available as a Jupyter Notebook on my GitHub repository. You can also open it directly in Google Colab and run the code to better understand it and also to colorize your images! I’ve also provided the weights of the final model which you can download via the notebook. I’ll assume you have basic knowledge about deep learning, GAN, and PyTorch library for the rest of the article. Let’s get started!

Introduction to colorization problem

About a month ago, I didn’t know much about the problem of image colorization so I started to study deep learning papers related to this task. At the beginning, it seemed really difficult but by doing a lot of Google searches, asking people on different forums, and studying a lot more papers on the problem, I gradually felt more comfortable with the colorization problem and the different solutions for it. Here I’m going to give you some basic knowledge that you may need to understand what the models do in the following codes.

RGB vs Lab

As you might know, when we load an image, we get a rank-3 (height, width, color) array with the last axis containing the color data for our image. These data represent color in RGB color space and there are 3 numbers for each pixel indicating how much Red, Green, and Blue the pixel is. In the following image you can see that in the left part of the “main image” (the leftmost image) we have blue color so in the blue channel of the image, that part has higher values and has turned dark.

Red, Green, and Blue channels of an image | Image by author (the leftmost image is by Lucas Benjamin from Unspash)

In L*a*b color space, we have again three numbers for each pixel but these numbers have different meanings. The first number (channel), L, encodes the Lightness of each pixel and when we visualize this channel (the second image in the row below) it appears as a black and white image. The *a and *b channels encode how much green-red and yellow-blue each pixel is, respectively. In the following image you can see each channel of L*a*b color space separately.

Lightness, *a, and *b channels of Lab color space for an image | Image by author

In all papers I studied and all codes I checked out on colorization on GitHub, people use L*a*b color space instead of RGB to train the models. There are a couple of reasons for this choice but I’ll give you an intuition of why we make this choice. To train a model for colorization, we should give it a grayscale image and hope that it will make it colorful. When using L*a*b, we can give the L channel to the model (which is the grayscale image) and want it to predict the other two channels (*a, *b) and after its prediction, we concatenate all the channels and we get our colorful image. But if you use RGB, you have to first convert your image to grayscale, feed the grayscale image to the model and hope it will predict 3 numbers for you which is a way more difficult and unstable task due to the many more possible combinations of 3 numbers compared to two numbers. If we assume we have 256 choices (in a 8-bit unsigned integer image this is the real number of choices) for each number, predicting the three numbers for each of the pixels is choosing between 256³ combinations which is more than 16 million choices, but when predicting two numbers we have about 65000 choices (actually, we are not going to wildly choose these numbers like a classification task and I just wrote these numbers to give you an intuition).

How to solve the problem

During the last few years, many different solutions have been proposed to colorize images by using deep learning. Colorful Image Colorization paper approached the problem as a classification task and they also considered the uncertainty of this problem (e.x. a car in the image can take on many different and valid colors and we cannot be sure about any color for it); however, another paper approached the problem as a regression task (with some more tweaks!). There are pros and cons to each approach but in this article, we are going to use a different strategy.

The strategy we are going to use

Image-to-Image Translation with Conditional Adversarial Networks paper, which you may know by the name pix2pix, proposed a general solution to many image-to-image tasks in deep learning which one of those was colorization. In this approach two losses are used: L1 loss, which makes it a regression task, and an adversarial (GAN) loss, which helps to solve the problem in an unsupervised manner (by assigning the outputs a number indicating how “real” they look!).

In this article, I will first implement what the authors did in the paper and then I will introduce a whole new generator model and some tweaks in the strategy of training which significantly helps reduce the size of needed dataset while getting amazing results. So stay tuned :)

A deeper dive into GAN world

As mentioned earlier, we are going to build a GAN (a conditional GAN to be specific) and use an extra loss function, L1 loss. Let’s start with the GAN.

As you might know, in a GAN we have a generator and a discriminator model which learn to solve a problem together. In our setting, the generator model takes a grayscale image (1-channel image) and produces a 2-channel image, a channel for *a and another for *b. The discriminator, takes these two produced channels and concatenates them with the input grayscale image and decides whether this new 3-channel image is fake or real. Of course the discriminator also needs to see some real images (3-channel images again in Lab color space) that are not produced by the generator and should learn that they are real.

So what about the “condition” we mentioned? Well, that grayscale image which both the generator and discriminator see is the condition that we provide to both models in our GAN and expect that the they take this condition into consideration.

Let’s take a look at the math. Consider x as the grayscale image, z as the input noise for the generator, and y as the 2-channel output we want from the generator (it can also represent the 2 color channels of a real image). Also, G is the generator model and D is the discriminator. Then the loss for our conditional GAN will be:

conditional GAN loss function | Image from this paper

Notice that x is given to both models which is the condition we introduce two both players of this game. Actually, we are not going to feed a “n” dimensional vector of random noise to the generator as you might expect but the noise is introduced in the form of dropout layers (there is something cool about it which you will read in the last section of the article) in the generator architecture.

Loss function we optimize

The earlier loss function helps to produce good-looking colorful images that seem real, but to further help the models and introduce some supervision in our task, we combine this loss function with L1 Loss (you might know L1 loss as mean absolute error) of the predicted colors compared with the actual colors:

If we use L1 loss alone, the model still learns to colorize the images but it will be conservative and most of the time uses colors like “gray” or “brown” because when it doubts which color is the best, it takes the average and uses these colors to reduce the L1 loss as much as possible (it is similar to the blurring effect of L1 or L2 loss in super resolution task). Also, the L1 Loss is preferred over L2 loss (or mean squared error) because it reduces that effect of producing gray-ish images. So, our combined loss function will be:

combined loss function we’ll optimize | Image from this paper

where λ is a coefficient to balance the contribution of the two losses to the final loss (of course the discriminator loss does not involve the L1 loss).

Okay. I think it’s enough for theory! Let’s get our hands dirty with the code! In the following section, I first introduce the code to implement the paper and in the section after that, I will introduce a better strategy to get really amazing results in one or two hours of training and without needing huge amount of data!

As a reminder, the code for both sections is provided as a Jupyter notebook on my GitHub repo which you can also open it directly in Colab by clicking here.

I highly recommend that you follow this article with the code provided in the notebook on my GitHub or Colab to fully understand what every line of code is doing. Check out the shape of input and output tensors and investigate every function or class to better understand what is happening. I try my best to explain the most important parts here but obviously I cannot explain every line of code as the article gets too long but make sure to this yourself.

1 — Implementing the paper — Our Baseline

1.1- Loading Image Paths

Loading image file names and dividing to training and validation sets

The paper uses the whole ImageNet dataset (with 1.3 million images!) but here I’m using only 8,000 images from COCO dataset for training which I had available on my device. So our training set size is 0.6% of what was used in the paper! The dataset is provided in the notebook on Colab.

You can use almost any dataset for this task as long as it contains many different scenes and locations which you hope it will learn to colorize. You can use ImageNet for example but you will only need 8000 of its images for this project.

1.2- Making Datasets and DataLoaders

Dataset and DataLoader

I hope the code is self-explanatory. I’m resizing the images and flipping horizontally (flipping only if it is training set) and then I read an RGB image, convert it to Lab color space and separate the first (grayscale) channel and the color channels as my inputs and targets for the models respectively. Then I’m making the data loaders.

1.3- Generator proposed by the paper

U-Net architecture for the generator model

This one is a little complicated and needs explanation. This code implements a U-Net to be used as the generator of our GAN. The details of the code are out of the scope of this article but the important thing to understand is that it makes the U-Net from the middle part of it (down in the U shape) and adds down-sampling and up-sampling modules to the left and right of that middle module (respectively) at every iteration until it reaches the input module and output module. Look at the following image that I made from one of the images in the article to give you a better sense of what is happening in the code:

how the U-Net is built | Image from this paper with some modifications

The blue rectangles show the order in which the related modules are built with the code. The U-Net we will build has more layers than what is depicted in this image but it suffices to give you the idea. Also notice in the code that we are going 8 layers down, so if we start with a 256 by 256 image, in the middle of the U-Net we will get a 1 by 1 (256 / 2⁸) image and then it gets up-sampled to produce a 256 by 256 image (with two channels). This code snippet is really exciting and I highly recommend to play with it to fully grasp what every line of it is doing.

1.4- Discriminator

Patch Discriminator architecture

The architecture of our discriminator is rather straight forward. This code implements a model by stacking blocks of Conv-BatchNorm-LeackyReLU to decide whether the input image is fake or real. Notice that the first and last blocks do not use normalization and the last block has no activation function (it is embedded in the loss function we will use). Let’s take a look at its blocks:

Discriminator architecture

And the shape of its output:

The output shape of the discriminator

We are using a “Patch” Discriminator here. Okay, what is it?! In a vanilla discriminator, the model outputs one number (a scaler) which represents how much the model thinks the input (which is the whole image) is real (or fake). In a patch discriminator, the model outputs one number for every patch of say 70 by 70 pixels of the input image and for each of them decides whether it is fake or not separately. Using such a model for the task of colorization seems reasonable to me because the local changes that the model needs to make are really important and maybe deciding on the whole image as in vanilla discriminator cannot take care of the subtleties of this task. Here, the model’s output shape is 30 by 30 but it does not mean that our patches are 30 by 30. The actual patch size is obtained when you compute the receptive field of each of these 900 (30 multiplied by 30) output numbers which in our case will be 70 by 70.

1.5- GAN Loss

GAN Loss

This is a handy class we can use to calculate the GAN loss of our final model. In the __init__ we decide which kind of loss we’re going to use (which will be “vanilla” in our project) and register some constant tensors as the “real” and “fake” labels. Then when we call this module, it makes an appropriate tensor full of zeros or ones (according to what we need at the stage) and computes the loss.

1.6- Putting everything together

putting everything together

This class brings together all the previous parts and implements a few methods to take care of training our complete model. Let’s investigate it.

In the __init__ we define our generator and discriminator using the previous functions and classes we defined and we also initialize them with init_model function which I didn’t explain here but you can refer to my GitHub repository to see how it works. Then we define our two loss functions and the optimizers of the generator and discriminator.

The whole work is being done in optimize method of this class. First and only once per iteration (batch of training set) we call the module’s forward method and store the outputs in fake_color variable of the class.

Then, we first train the discriminator by using backward_D method in which we feed the fake images produced by generator to the discriminator (make sure to detach them from the generator’s graph so that they act as a constant to the discriminator, like normal images) and label them as fake. Then we feed a batch of real images from training set to the discriminator and label them as real. We add up the two losses for fake and real and take the average and then call the backward on the final loss.

Now, we can train the generator. In backward_G method we feed the discriminator the fake image and try to fool it by assigning real labels to them and calculating the adversarial loss. As I mentioned earlier, we use L1 loss as well and compute the distance between the predicted two channels and the target two channels and multiply this loss by a coefficient (which is 100 in our case) to balance the two losses and then add this loss to the adversarial loss. Then we call the backward method of the loss.

Okay great! We just covered almost all of the training procedure. The training function is now a trivial one:

1.7- Training function

training function

I hope this code is self-explanatory. There are some simple functions used in this code which you can refer to my GitHub repo to check them out. Every epoch takes about 3 to 4 minutes on Colab. After about 20 epochs you should see some reasonable results.

Okay. I let the model train for some longer (about 100 epochs). Here are the results of our baseline model:

Baseline’s model output | Image by author

As you can see, although this baseline model has some basic understanding of some most common objects in images like sky, trees, … its output is far from something appealing and it cannot decide on the color of rare objects. It also displays some color spillovers and circle-shaped mass of color (center of first image of second row) which is not good at all. So, it seems like that with this small dataset we cannot get good results with this strategy. Therefore, we change our strategy!

2- A new strategy — the final model

Here is the focus of this article and where I’m going to explain what I did to overcome the last mentioned problem. Inspired by an idea in Super Resolution literature, I decided to pretrain the generator separately in a supervised and deterministic manner to avoid the problem of “the blind leading the blind” in the GAN game where neither generator nor discriminator knows anything about the task at the beginning of training.

Actually I use pretraining in two stages: 1- The backbone of the generator (the down sampling path) is a pretrained model for classification (on ImageNet) 2- The whole generator will be pretrained on the task of colorization with L1 loss.

In fact, I’m going to use a pretrained ResNet18 as the backbone of my U-Net and to accomplish the second stage of pretraining, we are going to train the U-Net on our training set with only L1 Loss. Then we will move to the combined adversarial and L1 loss, as we did in the previous section.

2.1- Using a new generator

Building a U-Net with a ResNet backbone is not something trivial so I’ll use fastai library’s Dynamic U-Net module to easily build one. You can simply install fastai with pip or conda. Here’s the link to the documentation.

Building a U-Net with ResNet18 backbone

That’s it! With just these few lines of code you can build such a complex model easily. create_body function loads the pretrained weights of the ResNet18 architecture and cuts the model to remove the last two layers (GlobalAveragePooling and a Linear layer for the ImageNet classification task). Then, DynamicUnet uses this backbone to build a U-Net with the needed output channels (2 in our case) and with an input size of 256.

2.2 Pretraining the generator for colorization task

pretraining the generator

With this simple function, we pretrain the generator for 20 epochs and then we save its weights. Again, every epochs takes about 3 to 4 minutes so the whole training will be done in one hour. In the following section, we will use this model as the generator for our GAN and train the whole network as before:

2.3 Putting everything together, again!

training the whole model using the pretrained generator

Here I’m first loading the saved weights for the generator and then I’m using this model as the generator in our MainModel class which prevents it from randomly initializing the generator. Then we train the model for 10 to 20 epochs! (compare it to the 100 epochs of the previous section when we didn’t use pretraining). Every epoch will take about 3 to 4 minutes on Colab which is really great!

I’ve provided the weights of the final model which I trained in the notebook (should be downloaded from my google drive). You can see check it out on my GitHub or directly in Colab.

2.4 The fun part! Looking at the final results

Here I will show the results of this final model on the test set (the black and white images that it has never seen during training) including the main title image of this article at the very beginning:

Just amazing! I personally did not expect this much improvement from the results of the last section and when I observed these I was really shocked and first I thought that I’ve mistakenly visualized the actual colorful images instead of the model’s predictions! It was of those moments that I felt “creating” something, building something that is actually working. This gives me a really great feeling.

An accidental finding: You can safely remove Dropout!

Remember that when I was explaining the theory of conditional GAN in the beginning of this article, I said that the source of the noise in the architecture of the generator proposed by authors of the paper was the dropout layers. However, when I investigated the U-Net we built with the help of fastai, I did not find any dropout layers in there! Actually I first trained the final model and got the results and then I investigated the generator and found this out.

So, was the adversarial training useless? If there is no noise, how possibly the generator can have a creative effect on the output? Is it possible that the input grayscale image to the generator plays the role of noise as well? These were my exact questions at the time.

Therefor, I decided to email Dr. Phillip Isola, the first author of the same paper we implemented here, and he kindly answered these questions. According to what he said, this conditional GAN can still work without dropout but the outputs will be more deterministic because of the lack of that noise; however, there is still enough information in that input grayscale image which enables the generator to produce compelling outputs.

Actually, I saw this in practice that the adversarial training was helpful indeed. In the next and last section, I’m going to compare the results of the pretrained U-Net with no adversarial training against the final outputs we got with adversarial training.

Comparing the results of the pretrained U-Net with and without adversarial training

One of the cool thing I found in my experiments was that the U-Net we built with the ResNet18 backbone is already awesome in colorizing images after pretraining with L1 Loss only (a step before the final adversarial training). But, the model is still conservative and encourages using gray-ish colors when it is not sure about what the object is or what color it should be. However, it performs really awesome for common scenes in the images like sky, tree, grass, etc.

Here I show you the outputs of the U-Net without adversarial training and U-Net with adversarial training to better depict the significant difference that the adversarial training is making in our case:

Left: pretrained U-Net without adversarial training | Right: pretrained U-Net with adversarial training | Image by author

You can also see the GIF below to observe the difference between the images better:

better visualization of the significant difference that adversarial training makes | GIF by author

As you see, although the pretrained U-Net is already good, but it cannot correctly choose colors in many cases and instead uses the brown/gray color to fill those. For example in the second column of the third row of the above GIF you can see that without adversarial training, the U-Net is not able to colorize the jacket of the guy in the image or in the third column of the last row you can see that it fails to colorize the bus while the adversarial training does a perfect job. It makes sense because those gray-ish images are far from being real to discriminator of our GAN so it sends feedback to the U-Net to makes those colors more natural which hopefully the U-Net has learned to do so.

Final words

This project was full of important lessons for myself. I spent a lot of time during the last month to implement lots of different papers each with different strategies and it took quite a while and after A LOT of failures that I could come up with this method of training. Now you can see that how pretraining the generator significantly helped the model and improved the results.

I also learned that some observations, although at first feeling like a bad mistake of yours, are worth paying attention to and further investigation; like the case of dropout in this project. Thanks to the helpful community of deep learning and AI, you can easily ask experts and get the answer you need and become more confidant in what you were just guessing.

I also want to thank the authors of this wonderful paper for their awesome work and also for the great GitHub repository of this paper from which I borrowed some of the codes (with modification and simplification). I truly love the community of computer science and AI and all their hard work to improve the field and also make their contributions available to all. I’m happy to be a tiny part of this community.

Don’t forget to leave your questions, comments, suggestion, etc. below. I’d be happy to hear from you.

About me: I’m a medical student. I love deep learning and the cool things we are able to build with it to improve our quality of life. I spend a lot of time studying deep learning beside my medical courses and I am really enjoying being in these two worlds at the same time :)