Generating Images from Prompts using CLIP and StyleGAN

Victor Perez

Published in

Towards Data Science

4 min readFeb 6, 2021

After reading this article you will know…

… how CLIP works at a glance.
… how StyleGAN works at a glance.
… how they can be combined to generate faces from prompts.

You can find the code used in this post here.

CLIP

Contrastive Language–Image Pre-training (CLIP)—presented in Learning Transferable Visual Models From Natural Language Supervision—was published by OpenAI in January 2021. Their approach proposes to leverage natural language to improve the generality and robustness of deep learning models for image classification tasks. They are able to produce state-of-the-art performance in multiple benchmarks with a zero-shot setting, something really impressive.

The main idea behind CLIP is to pre-train a neural language model and an image classification model jointly using vast amounts of image data extracted from the Internet with their respective captions. In the following image the “Text Encoder” represents the language model and the “Image Encoder” the image classification model.

Image from the original blog from OpenAI.

Their goal is to construct a matrix where each value represents a similarity score between each prompt-image pair—computed as I*T in the image—and use it to train the language and visual model so they maximize the values at the positions that correspond to the correct pair, i.e. the diagonal values in the image. For example, if the text at position 0 is “pepper the aussie pup” and the image at position 0 represents that specific content, CLIP will train both models to create representations that maximize their similarity.

Once this pre-training process is done, they can use the pre-trained visual model to generate a representation of any given input image and compare it to several text embeddings obtained from several prompts using the language model. The text representation with the maximum similarity value will be chosen as the one that better represents the content on the image. With this technique, you can use the thousand classes of ImageNet as sentences and solve the classification task in a zero-shot setting.

The following are some examples of how the model successfully classifies an image given this set of prompts.

StyleGAN

StyleGAN is a famous generative adversarial network model with the ability to produce hyper realistic images with high resolution. The following is the structure of this model.

The main takeaway from this model is that given a latent vector z we can use the mapping network to generate another latent vector w that can be fed into the synthesis network and result in the final image. For our purpose, we will use a StyleGAN trained to generate faces.

Generating images from prompts

The following represents the architecture that I have used to generate faces from prompts using CLIP and StyleGAN.

The idea is simple, we will start with random values for the w latent vectors from StyleGAN to generate an image. The result will be passed to CLIP together with an arbitrary prompt. CLIP will generate a score representing how well this image is representing the content within the prompt. This value will be used to update w which will generate another image that will repeat this cycle over and over until we decide that the generated image sufficiently resembles the prompt.

Edit: the values of w will be updated using gradient descent and backpropagation as if they were weights in a neural network.

Results

The following are some results obtained using this simple method.

“An image with the face of a woman with blonde hair and purple eyes”

“An image with the face of Elon Musk with blonde hair”

And finally, just to have some fun with the model:

“An image with the face of a stoner”

Interestingly, the final generation of the former prompt ended up in the following image—the strain was probably good!

You can find more samples on my Twitter account.

Conclusions

The power of language models is not limited to textual problems. Their capabilities to abstract concepts from vast amounts of data can expand to other fields and produce astonishing results like the ones presented in this article, and with quite simple approaches!

It is exciting to think about what is to come in the field of computer vision after seeing models like GPT-3.