Text to Image

Connor Shorten
Towards Data Science
8 min readJan 25, 2019

--

This article will explain the experiments and theory behind an interesting paper that converts natural language text descriptions such as “A small bird has a short, point orange beak and white belly” into 64x64 RGB images. Following is a link to the paper “Generative Adversarial Text to Image Synthesis” from Reed et al.

Article Outline

  1. Introduction
  2. Architecture Used
  3. Constructing a Text Embedding for Visual Attributes
  4. Manifold Interpolation
  5. Results / Conclusions

Introduction

Converting natural language text descriptions into images is an amazing demonstration of Deep Learning. Text classification tasks such as sentiment analysis have been successful with Deep Recurrent Neural Networks that are able to learn discriminative vector representations from text. In another domain, Deep Convolutional GANs are able to synthesize images such as interiors of bedrooms from a random noise vector sampled from a normal distribution. The focus of Reed et al. [1] is to connect advances in Deep RNN text embeddings and image synthesis with DCGANs, inspired by the idea of Conditional-GANs.

Conditional-GANs work by inputting a one-hot class label vector as input to the generator and discriminator in addition to the randomly sampled noise vector. This results in higher training stability, more visually appealing results, as well as controllable generator outputs. The difference between traditional Conditional-GANs and the Text-to-Image model presented is in the conditioning input. Instead of trying to construct a sparse visual attribute descriptor to condition GANs, the GANs are conditioned on a text embedding learned with a Deep Neural Network. A sparse visual attribute descriptor might describe “a small bird with an orange beak” as something like:

 [ 0 0 0 1 . . . 0 0 . . . 1 . . . 0 0 0 . . . 0 0 1 . . .0 0 0]

The ones in the vector would represent attribute questions such as, orange (1/0)? small (1/0)? bird (1/0)? This description is difficult to collect and doesn’t work well in practice.

Word embeddings have been the hero of natural language processing through the use of concepts such as Word2Vec. Word2Vec forms embeddings by learning to predict the context of a given word. Unfortunately, Word2Vec doesn’t quite translate to text-to-image since the context of the word doesn’t capture the visual properties as well as an embedding explicitly trained to do so does. Reed et al. [1] present a novel symmetric structured joint embedding of images and text descriptions to overcome this challenge which is presented in further detail later in the article.

In addition to constructing good text embeddings, translating from text to images is highly multi-modal. The term ‘multi-modal’ is an important one to become familiar with in Deep Learning research. This refers to the fact that there are many different images of birds with correspond to the text description “bird”. Another example in speech is that there are many different accents, etc. that would result in different sounds corresponding to the text “bird”. Multi-modal learning is also present in image captioning, (image-to-text). However, this is greatly facilitated due to the sequential structure of text such that the model can predict the next word conditioned on the image as well as the previously predicted words. Multi-modal learning is traditionally very difficult, but is made much easier with the advancement of GANs (Generative Adversarial Networks), this framework creates an adaptive loss function which is well-suited for multi-modal tasks such as text-to-image.

Architecture Used

The picture above shows the architecture Reed et al. used to train this text-to-image GAN model. The most noteworthy takeaway from this diagram is the visualization of how the text embedding fits into the sequential processing of the model. In the Generator network, the text embedding is filtered trough a fully connected layer and concatenated with the random noise vector z. In this case, the text embedding is converted from a 1024x1 vector to 128x1 and concatenated with the 100x1 random noise vector z. On the side of the discriminator network, the text-embedding is also compressed through a fully connected layer into a 128x1 vector and then reshaped into a 4x4 matrix and depth-wise concatenated with the image representation. This image representation is derived after the input image has been convolved over multiple times, reduce the spatial resolution and extracting information. This embedding strategy for the discriminator is different from the conditional-GAN model in which the embedding is concatenated into the original image matrix and then convolved over.

One general thing to note about the architecture diagram is to visualize how the DCGAN upsamples vectors or low-resolution images to produce high-resolution images. You can see each de-convolutional layer increases the spatial resolution of the image. Additionally, the depth of the feature maps decreases per layer. Lastly, you can see how the convolutional layers in the discriminator network decreases the spatial resolution and increase the depth of the feature maps as it processes the image.

An interesting thing about this training process is that it is difficult to separate loss based on the generated image not looking realistic or loss based on the generated image not matching the text description. The authors of the paper describe the training dynamics being that initially the discriminator does not pay any attention to the text embedding, since the images created by the generator do not look real at all. Once G can generate images that at least pass the real vs. fake criterion, then the text embedding is factored in as well.

The authors smooth out the training dynamics of this by adding pairs of real images with incorrect text descriptions which are labeled as ‘fake’. The discriminator is solely focused on the binary task of real versus fake and is not separately considering the image apart from the text. This is in contrast to an approach such as AC-GAN with one-hot encoded class labels. The AC-GAN discriminator outputs real vs. fake and uses an auxiliary classifier sharing the intermediate features to classify the class label of the image.

Constructing a Text Embedding for Visual Attributes

The most interesting component of this paper is how they construct a unique text embedding that contains visual attributes of the image to be represented. This vector is constructed through the following process:

The loss function noted as equation (2) represents the overall objective of a text classifier that is optimizing the gated loss between two loss functions. These loss functions are shown in equations 3 and 4. The paper describes the intuition for this process as “A text encoding should have a higher compatibility score with images of the corresponding class compared to any other class and vice-versa”. The two terms each represent an image encoder and a text encoder. The image encoder is taken from the GoogLeNet image classification model. This classifier reduces the dimensionality of images until it is compressed to a 1024x1 vector. The objective function thus aims to minimize the distance between the image representation from GoogLeNet and the text representation from a character-level CNN or LSTM. Essentially, the vector encoding for the image classification is used to guide the text encodings based on similarity to similar images.

The details of this are expanded on in the following paper, “Learning Deep Representations of Fine-Grained Visual Descriptions” also from Reed et al.

Note the term ‘Fine-grained’, this is used to separate tasks such as different types of birds and flowers compared to completely different objects such as cats, airplanes, boats, mountains, dogs, etc. as in what is used in ImageNet challenges.

Manifold Interpolation

One of the interesting characteristics of Generative Adversarial Networks is that the latent vector z can be used to interpolate new instances. This is commonly referred to as “latent space addition”. An example would be to do “man with glasses” — “man without glasses” + “woman without glasses” and achieve a woman with glasses. In this paper, the authors aims to interpolate between the text embeddings. This is done with the following equation:

The discriminator has been trained to predict whether image and text pairs match or not. Therefore the images from interpolated text embeddings can fill in the gaps in the data manifold that were present during training. Using this as a regularization method for the training data space is paramount for the successful result of the model presented in this paper. This is a form of data augmentation since the interpolated text embeddings can expand the dataset used for training the text-to-image GAN.

Results / Conclusion

The experiments are conducted with three datasets, CUB dataset of bird images containing 11,788 bird images from 200 categories, Oxford-102 of Flowers containing 8,189 images from 102 different categories, and the MS-COCO dataset to demonstrate generalizability of the algorithm presented.

Each of these images from CUB and Oxford-102 contains 5 text captions.

Results on CUB Birds Dataset
Results on Oxford-102
Results on MS-COCO

All of the results presented above are on the Zero-Shot Learning task, meaning that the model has never seen that text description before during training. Each of the images above are fairly low-resolution at 64x64x3. Nevertheless, it is very encouraging to see this algorithm having some success on the very difficult multi-modal task of text-to-image. Thanks for reading this article, I highly recommend checking out the paper to learn more!

References

[1] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, Honglak Lee. Generative Adversarial Text to Image Synthesis. 2016.

[2] Scott Reed, Zeynep Akata, Bernt Shiele, Honglak Lee. Learning Deep Representations of Fine-grained Visual Descriptions. 2016.

--

--