The code for this project can be found at: https://github.com/evanhu1/pytorch-CelebA-faCeGAN
If you like this content, follow me on X: https://x.com/evanhuuu, I regularly write about ai + art + founding a Y Combinator startup at 21.
GANs (Generative Adversarial Networks) are a subset of unsupervised learning models that utilize two networks along with adversarial training to output "novel" data which resembles the input data. More specifically, GANs typically involve "a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G [1]."

Conditional GANs are a modification of the original GAN model, later proposed by Mehdi Mirza and Simon Osindero in the paper, "Conditional Generative Adversarial Nets" (2014). In a cGAN (conditional GAN), the discriminator is given data/label pairs instead of just data, and the generator is given a label in addition to the noise vector, indicating which class the image should belong to. The addition of labels forces the generator to learn multiple representations of different training data classes, allowing for the ability to explicitly control the output of the generator. When training the model, the label is usually combined with the data sample for both the generator and discriminator. The code to accomplish this will be given below.
In this article, I’ll describe the implementation of a conditional deep convolutional GAN in PyTorch that uses English text as labels instead of single numbers. Our model is a deep convolutional GAN (DCGAN), which is to say it uses deep convolutional layers in its architecture instead of fully connected layers as in the original paper. We will train on the CelebA data set of celebrity faces, with images cropped to 64×64. Our model architecture contains five convolutional/transpose convolutional layers with Batch Normalization and Leaky ReLU activation as well as a sigmoid activation for the discriminator output layer and tanh activation for the generator output layer. Adam optimizer and Binary Cross Entropy loss are used.
Parameters and data cleaning steps are guided by "conventional wisdom" regarding GAN training, gathered from sources such as GANHacks and the original paper. These include learning rate of 0.0002 (beta 0.5) for the Adam optimizer, strided convolutions instead of down/upsampling, custom weight initializations using gaussian distribution (mean 0.0, std 0.02), and scaling real images from [0, 1] to [-1, 1] to match the tanh output in fake images.

After settling on the design, we wrote our model in PyTorch.
As for the dataset, CelebA comes with 40 binary attributes annotations per image as you can see in the examples below.

Using these binary attributes as our labels, or conditions, we can train our cGAN to generate faces with specific features that we control. Furthermore, we can select multiple attributes to train on and learn by simply concatenating them to form multidimensional labels. Given enough attributes, we can generate a wide variety of faces with different human features. For example, our preliminary training result showed that the model could learn the binary attributes, [Male, Young], allowing us to generate 4 different combinations of face types (young male, "not young" male, young female, "not young" female)

In our code, we built a custom PyTorch Dataset with a parameter to specify which binary attributes out of the total 40 to include as training on all 40 would be far too difficult.
For the actual input to our models, we created multi-hot encoded tensors with 2 dimensions along the label axis. Specifically, assuming we only use two binary attributes with 64×64 images in training, the generator would receive a 32x2x1x1 (batch size, number of labels, and 2 "fake" image dimensions for the deconvolution layer to act on) multi-hot encoded tensor as a label for each training sample. Conversely, the discriminator would receive a 32x2x64x64 multi-hot encoded tensor, filled with zeros and ones based on the binary attributes of the training sample.
During training, these labels are concatenated with the data sample using a torch.cat() layer, after each being passed through a convolutional/deconvolutional layer. The code to create and process the labels into the correct dimensions is in the training loop, which is essentially the same as a normal GAN training loop except with additional steps for feeding in labels along with data samples.
On Google Collaboratory GPUs, training took around 30 minutes per epoch, and produced interesting results by epoch 5 on average. We experimented with steadily increasing the number of attributes used to describe the face in the training data, wanting to see how much the cGAN was capable of learning.
With 5 binary attributes per image:

10 attributes:
![['Bald', 'Black_Hair', 'Blond_Hair', 'Brown_Hair', 'Gray_Hair', 'Male', 'No_Beard', 'Receding_Hairline', 'Straight_Hair', 'Wavy_Hair']. Image by Author](https://towardsdatascience.com/wp-content/uploads/2021/06/1YdZUHHRdPDMEKBQ06LFmeA.png)
As you might be able to see, results do suffer as the number of attributes increases. Given our limited compute resources and time, we did not push too far past 10 attributes, but it would be interesting to see just how far it is possible to go and if a large enough model could learn all 40 of the celebA binary attributes.
As additional work, we also experimented with using rotated transformations on the celebA data samples to see if the cGAN could learn not only the binary attributes but also augmentations such as rotation. Taking inspiration from the paper, Unsupervised Representation Learning by Predicting Image Rotations, we generated samples that were rotated 90, 180, and 270 degrees for training.

Results demonstrate that, surprisingly, the cGAN model is indeed capable of learning both facial attributes as well as image augmentations.
In conclusion, this was a fascinating look into the explicit capacity of GANs to learn from data, as well as the impressive flexibility of cGANs. We hope you found it interesting, as we certainly did.
Acknowledgements
This project was a collaborative effort along with Jake Austin, Aryan Jain, and Brian Liu from Machine Learning @ Berkeley.
References
[1] Generative Adversarial Networks: https://arxiv.org/abs/1406.2661