There is a growing trend to use text phrases to direct Generative Adversarial Networks (GANs) to generate new images. When OpenAI introduced their Clip [1] system for comparing text and pictures, it was a natural choice to use for text-to-image generation. Two popular GANs are StyleGAN 2 from Nvidia [2] and VQGAN by Esser et al. [3]. There are open-source projects to direct both of these GANs to generate images. For example, StyleCLIP by O. Patashnik et al. [4] and VQGAN+CLIP by Katherine Crowson work with Style GAN 2 and VQGAN, respectively. After experimenting with both systems, I found that they each have their pros and cons. But using both GANs together provides the best results.
This article will show how CLIP can direct both StyleGAN 2 and VQGAN to create scary faces, i.e., vampires and monsters. I will then show how to use the images to create "nightmare" videos. I call the system SpookyGAN.
SpookyGAN Overview
Here is a diagram that shows the major components of SpookyGAN. You can read the details of the components in the sections that follow.

The creation process starts by generating an image using StyleGAN 2 directed by StyleCLIP. You can enter a text prompt like "old scary witch long pointy nose green screen," and the system will generate an image after some number of iterations, like 40. The StyleGAN 2 system uses a model that was trained using faces of real people from Flickr [5], so the output doesn’t look very scary.
The next step is to take the output image and modify it using Vqgan as directed by the Adam optimizer using CLIP. You can either use the same prompt used to generate the initial image or modify it as needed. Only about ten iterations are required to "spookify" the image.
The final step is to use VQGAN and CLIP again to generate a video. The system will modify the scary image 300 times using the simple text prompt "nightmare" and write the images out to a sequence of files. The FFmpeg codec compresses the images into an mp4 movie for viewing.
System Details
The following sections describe each of the components in more detail.
StyleGAN 2
NVidia’s second iteration of StyleGAN was written up in their paper, "Analyzing and Improving the Image Quality of StyleGAN" [2]. They say…
…our improved model redefines the state of the art in unconditional image modeling, both in terms of existing distribution quality metrics as well as perceived image quality. – T. Karras, et al.
In general, GANs are comprised of a generator network and discriminator network. During training, the generator tries to create realistic images, and the discriminator tries to discern which images are real and which images are fake. The GAN I am using for the initial spooky image was trained using images from the Flickr-Faces-HQ Dataset from NVidia.

CLIP
Contrastive Language-Image Pre-training (CLIP) is a pair of encoders from OpenAI described in their paper, "Learning Transferable Visual Models From Natural Language Supervision" [1]. In the paper, the authors…
… demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn [state-of-the-art] image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. – Alec Radford, et al.
After training, the CLIP image encoder can transform an image into a list of 512 numbers called an image embedding, which captures the "features" of the image. The CLIP text encoder can transform a text phrase into a similar list of numbers that captures the features of the text.

If an image and phrase describe roughly the same thing, then the embeddings will be similar when compared mathematically. The two encoder models can be used to generate images from text using GANs incrementally.
StyleCLIP
As you may be able to guess from the name, StyleCLIP is a system that uses the CLIP encoders to steer StyleGAN 2 to generate images from a text phrase. In their paper, "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery," the authors…
… introduce an optimization scheme that utilizes a CLIPbased loss to modify an input latent vector in response to a user-provided text prompt. – Or Patashnik, et al.
StyleCLIP works by iteratively changing the input into Style GAN 2 to steer the generated image to match the text prompt as measured by CLIP.

The system starts with the average image defined by the StyleGAN 2 model that’s sent into the CLIP image encoder. The text prompt is sent into the CLIP text encoder, and the two embeddings are compared. The StyleCLIP optimizer then steers the generator to create images that get successively closer to the features described in the prompt. StyleCLIP does this by minimizing the difference between the image embedding and the text embedding with each iteration. The learning rate is set to 10% at this stage, so the change in each iteration can be substantial. After about 40 passes, the image will start to look like what the prompt describes.
As mentioned in the overview, the StyleGAN 2 model was trained using faces of real people posted on Flickr. The output will generally look like a real person and not a Halloween baddie, like a witch or a monster. But VQGAN can get us there.
VQGAN
Most GANs use a Convolutional Neural Network (CNN) for the generator and an inverted CNN for the discriminator. However, a new hybrid transformer-GAN called Vector Quantised Generative Adversarial Network (VQGAN) is described in the paper "Taming Transformers for High-Resolution Image Synthesis" [3]. The authors…
… demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. – Patrick Esser et al.
The VQGAN model I am using was trained on tens of thousands of images from WikiArt.org. Here is a diagram that shows the main components of VQGAN.

VQGAN conceptionally works like a codec in that it is trained to encode images into an embedded space where a transformer is used to define a "codebook" of sub-image parts. The decoder then creates an output image that closely resembles the original image.
During training, the VQGAN discriminator looks at sub-images in a grid and assesses each section as feedback for improvement. This allows the model to be quite flexible and create many types of images. However, one of its shortcomings is that VQGAN needs help in generating new images from scratch. That’s where CLIP comes in.
VQGAN+CLIP
Similar to the way StyleGAN 2 can be directed to create images from a text prompt, VQGAN can be controlled by CLIP to modify an input image to more closely resemble what is described by the phrase. Katherine Crowson, a developer known on GitHub as crowsonkb, created some Python code for doing exactly this. Here is a diagram that shows how VQGAN and CLIP can work together to modify an image iteratively.

The system starts with the output image created by StyleGAN 2 and is sent into the CLIP image encoder. A process similar to StyleCLIP is run for about ten passes, and the image will look even more like what the prompt describes.
Creating a Movie
I found that I could create videos of the images by using VQGAN and CLIP by simply using a generic text prompt like "nightmare" and reducing the learning rate of the optimizer. I turned down the learning rate to 1% to create the video, so the frames will change slowly. Here is a diagram that shows the components for making a movie.

This time the system starts with the modified image created by VQGAN and is sent into the CLIP image encoder. The prompt is simply "nightmare." The system runs for 300 frames, which generates 10 seconds of video at 30 frames per second.
The ffmpeg codec is used to generate an mp4 movie file. Here is the command to create the movie.
ffmpeg -r 30 -i /content/steps/%4d.png -c:v libx264 witch.mp4
The -r option specifies the frame rate of 30 fps. The -i option indicates where the mages are, and the -c-v option specifies the H.264 video codec.
Here’s the video.
You can see how the witch’s face morphs into what looks like a skull with teeth in the eyes. It sure looks like a nightmare to me! 😲
SpookyGAN Results
Below are more results from SpookyGAN. The images are from StyleGAN 2 on the left and VQGAN on the right. The text prompts are shown in the captions. Be sure to check out the appendix to see more videos.
Spooky Witch


Spooky Vampire


Spooky Frankenstein’s Monster


Spooky Werewolf


Source Code
The Google Colabs for using SpookyGAN are here.
Acknowledgments
I want to thank Jennifer Lim and Oliver Strimpel for their help with this project.
References
[1] CLIP by A. Radford, et al., Learning Transferable Visual Models From Natural Language Supervision (2021)
[2]StyleGAN 2 by T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, Analyzing and Improving the Image Quality of StyleGAN (2020)
[3] VQGAN by P. Esser, R. Rombach, and B. Ommer, Taming Transformers for High-Resolution Image Synthesis (2020)
[4] StyleCLIP by O. Patashnik, Z. Wu, E., D. Cohen-Or, and D. Lischinski, StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery (2021)
[5] FFHQ by T. Karras, J. Hellsten, Flickr-Faces-HQ (2018)
Appendix
Here are three videos from the prompt "nightmare" and the learning rate set to 1% for the VQGAN+CLIP optimizer.