The world’s leading publication for data science, AI, and ML professionals.

E-DALL-E: Creating Digital Art with Varying Aspect Ratios

How to expand images generated with DALL-E Mini by using VQGAN and CLIP to inpaint the sides

Images from DALL-E Mini (left), Expanded to 16:9 with Repeated Edge Pixels (center), and E-DALL-E (right), Images by Author
Images from DALL-E Mini (left), Expanded to 16:9 with Repeated Edge Pixels (center), and E-DALL-E (right), Images by Author

You may have seen some images generated from text using DALL-E 2 [1] from OpenAI. Although the system is impressive and the results are incredible, it’s currently only available in a closed beta with a waitlist. However, you can access and run an independent text-to-image system called DALL-E Mini [2], led by developers Boris Dayma and Pedro Cuenca. Although the results are not as spectacular as the images generated by DALL-E 2, they are still excellent, and the source code and trained models are free and open-sourced. You can try it out in their ad-supported demo.

You may have noticed that the images generated by both DALL-E models use a 1:1 aspect ratio; the images are always dead square. The systems will not produce images in landscape or portrait formats, limiting their usefulness.

However, I noticed that the image generator for DALL-E Mini uses the VQGAN model [3], which I know very well from several articles I wrote on image generation. I also know that VQGAN can render images with varying aspect ratios. So I wrote a little code to take the output from the DALL-E models, or any image, and expand the aspect ratio using VQGAN guided by CLIP from OpenAI [4]. I call the system Expand-DALL-E, or E-DALL-E for short. You can run it on the Colab here. And be sure to check out the image gallery in the appendix.

System Overview

Here is a diagram for the E-DALL-E system with a brief description of the processes. A full description of the system components follows.

E-Dall-E Components, Diagram by Author
E-Dall-E Components, Diagram by Author

The system has two main components: DALL-E Mini for image generation and E-DALL-E for expanding the aspect ratio.

The image creation process starts with a text prompt from the user, like "graffiti wall painting on brick of a pug wearing sunglasses at the beach." The DALL-E Mini system, trained on 15 million image/caption pairs, converts the text into an internal representation. It then decodes and samples the results into latent vectors to be used as inputs into the VQGAN Image Generator, which the authors trained using the ImageNet dataset [6]. VQGAN renders 256×256 images for each sample, and the user chooses one.

First, the user supplies the desired aspect ratio to E-DALL-E, like 16:9. Then, it fills out the selected image by repeating the original edge pixels as an initial image for further iterations. The system feeds the text prompt into the CLIP text encoder to be used as a guide. Then it starts iterating through the generation process for some N number of steps, like 100. The system uses the Adam Optimizer [7] to change the VQGAN vectors to create an image that best matches the text prompt. After each step, however, the system copies the vectors for the center part of the original image back in, so it will only update the edges of the image.

After completing the iterations, the system displays the expanded image with newly added details on the sides.

Original 1:1 Image from DALL-E Mini, Initial 16:9 Image with Repeated Edge Pixels, Expanded 16:9 Image using E-DALL-E, Images by Author
Original 1:1 Image from DALL-E Mini, Initial 16:9 Image with Repeated Edge Pixels, Expanded 16:9 Image using E-DALL-E, Images by Author

Looking closely, you can see that the center part of the expanded image on the right changed slightly from the original image on the left. This is an intentional aspect of how VQGAN renders images. I’ll describe this effect in the section on splicing images below.

Component Details

DALL-E

In January of 2021, OpenAI released a paper and a demo of their DALL-E text-to-image system [8]. The name is a play on words. It’s a cross between painter Salvador Dali’s last name and Disney’s animated movie, WALL-E.

The DALL-E model is a massive transformer. It has 12 billion parameters and was trained using 250 million image/caption pairs from the internet. Here are the results from the queries, "a tapir made of accordion. a tapir with the texture of an accordion." and "an illustration of a baby hedgehog in a christmas sweater walking a dog."

Output from DALL-E from OpenAI's paper, Zero-Shot Text-to-Image Generation
Output from DALL-E from OpenAI’s paper, Zero-Shot Text-to-Image Generation

The results of DALL-E are excellent. The system does a great job interpreting the text and rendering images in a creative but realistic way. One limitation is the output is relatively small, at 256×256 pixels. And OpenAI didn’t proide access to the DALL-E source code or the trained model. But that changed 14 months later.

DALL-E 2

On April 13, 2022, OpenAI released an updated version called DALL-E 2 [1]. It uses a variant of their CLIP model for the text encoder and a variant of their GLIDE [9] image decoder model with 3.5 billion parameters. OpenAI trained the DALL-E 2 system using 650 million images. The system renders images in high resolution at 1024×1024 pixels. Here are some sample images from their paper.

Output from DALL-E 2 from OpenAI's paper, Hierarchical Text-Conditional Image Generation with CLIP Latents
Output from DALL-E 2 from OpenAI’s paper, Hierarchical Text-Conditional Image Generation with CLIP Latents

These results are excellent! As I mentioned at the top of this article, DALL-E 2 is only available as a closed beta.

DALL-E Mini

Developers Boris Dayma and Pedro Cuenca led a team to create a free, open-source text-to-image generator called DALL-E Mini [2]. The system uses a pair of trained BERT Transformers [10] to convert text into latent vectors that can render images using VQGAN. As the name implies, the DALL-E Mini model is relatively small. It "only" has 400 million parameters and was trained on "only" 15 million images.

Here is the diagram of the inferencing pipeline for DALL-E Mini.

Inference Pipeline of DALL-E Mini, Source: DALL-E Mini Explained
Inference Pipeline of DALL-E Mini, Source: DALL-E Mini Explained

The text prompt is fed into BART and then decoded to produce samples of latent vectors for VQGAN, which renders the candidate images. Note that the system does not generate the images iteratively. It transforms into vectors used by VQGAN to render the images directly. A complete explanation of how DALL-E Mini works is here.

Here are some sample images for the prompts, "a painting of rolling farmland," "an abstract painting with orange triangles," and "a still life painting of a bowl of fruit."

Sample Output from DALL-E Mini, Images by Author
Sample Output from DALL-E Mini, Images by Author

Although they’re not as good as the images from DALL-E 2, the output from DALL-E Mini is decent. There is a good amount of variety in the images, and each composition is generally sound, albeit lacking in fine details.

Next, I will show you how to expand the output to have different aspect ratios.

E-DALL-E

The E-DALL-E system starts with an aspect ratio specified by the user. For example, if we have a 256×256 image and want to stretch it to, say, 16:9, we need to add 96 pixels to the left side and 96 pixels to the right side. This will make the resultant image 448×256 which is roughly 16:9.

Next, the system fills in the edges of the image with repeated pixels. This will help the VQGAN model get a head start in generating the sides of the expanded image. Repeating pixels is a simple programming trick that doesn’t require any Machine Learning.

For each row in the image, the system calculates the average of 8 pixels at the end and replicates them to make sub-images of constant horizontal colors. It also performs a 32-pixel linear blend back into the original to mask the transition, so the left and right sub-images must be 96+32 pixels wide.

Here’s what the three images look like in 16:9 format with repeated edge pixels.

The Output of DALL-E Mini with Repeated Edge Pixels, Images by Author
The Output of DALL-E Mini with Repeated Edge Pixels, Images by Author

OK, they’re not passable as finished images, but they’re a good starting point for the next step in the process. The source code for repeating edge pixels is here.

Splicing Images with VQGAN

Before I get into rendering the sides for the expanded images, I will discuss an interesting aspect of VQGAN that hasn’t been written up (as far as I know.) Although the authors of VQGAN created it to generate new images, it does an excellent job of splicing similar images together.

VQGAN is essentially a codec. The encoder takes an image and generates latent vectors. The decoder takes latent vectors and generates an image. During training, the system updates the model’s parameters to make the output images match the input images as closely as possible.

Unlike the latent vectors in most GANs, there is a spatial aspect to the latent vectors used by VQGAN. For every 16×16 block of RGB pixels in the input, the decoder creates 256 floating-point numbers. So a 256×256 RGB image has a latent size of 16x16x256. The group of 256 numbers is called a "codebook" in the VQGAN paper [4].

When I edit images in the latent vector space, the system will smooth out the results as it decodes the resultant image.

For example, here are two portraits from the artists Gustav Klimt and Amedeo Modigliani. The top row shows the results of splicing the RGB images right down the middle. The left half is from Klimt, and the right half is from Modigliani. The bottom row shows the results of splicing the VQGAN latent vectors and decoding the resultant image.

Top Row: "Amalie Zuckerkandl" by Gustav Klimt from [WikiArt](https://www.wikiart.org/en/amedeo-modigliani/marie-daughter-of-the-people-1918), Blend with Pixel Splicing by Author, "Marie, Daughter of the People" by Amedeo Modigliani from WikiArt, Bottom Row: "Amalie Zuckerkland" via VQGAN, Blend with Vector Splicing, "Marie, Daughter of the People" via VQGAN, Images by Author
Top Row: "Amalie Zuckerkandl" by Gustav Klimt from [WikiArt](https://www.wikiart.org/en/amedeo-modigliani/marie-daughter-of-the-people-1918), Blend with Pixel Splicing by Author, "Marie, Daughter of the People" by Amedeo Modigliani from WikiArt, Bottom Row: "Amalie Zuckerkland" via VQGAN, Blend with Vector Splicing, "Marie, Daughter of the People" via VQGAN, Images by Author

There are a couple of things to notice here. First, the VQGAN model is not a perfect codec. The images passed through the model do not match the original images exactly. For example, the eyes in the top-left and bottom-left portraits look different. However, you can see how splicing the images in the latent vector space blends the image parts quite naturally. Although the face in the center-bottom image is clearly made of two different halves, it’s hard to find the seam between the two. This is because each VQGAN codebook entry will not only render its spatial area when decoded, it will influence its neighbors to create a cohesive image based on the training data.

I use this feature of VQGAN when rendering the sides of expanded images.

Expanding Images with E-DALL-E

Here is the component diagram again, showing only the processes used for expanding the image.

E-DALL-E Component Details, Diagram by Author
E-DALL-E Component Details, Diagram by Author

The process starts with an input image and a specified aspect ratio. The system repeats the edge pixels to stretch the image, as described above. It then encodes the prompt using CLIP to guide the optimization process. For each iteration, the system encodes the image with CLIP. It uses the difference between the text encoding and the image encoding to modify the VQGAN latent vectors using the Adam Optimizer. After each iteration, the center part of the modified image vectors is replaced with the center of the vectors from the input image to keep the middle part relatively constant. Only the sides will change a lot. After N iterations, the system displays the final image with the new aspect ratio.

Here are some results from using E-DALL-E.

Results from E-DALL-E, Images by Author
Results from E-DALL-E, Images by Author

Pretty good! The system seems to do a good job making up new details in the expanded areas. And it’s hard to see the seams between the original image and the newly generated parts. But you can also see how VQGAN sometimes changes the central part of the image to make the new pieces fit in better. For example, the sky’s hue in the top-right image seems to have changed to a deeper blue in the central part. And the texture of the orange triangles in the center of the bottom-right image appears to match the surface of the newly generated triangles on the sides.

Cultural Biases

Like most neural network models that were trained on large amounts of data found on the Internet, the models used in the project have inherent cultural biases that get picked up in the data.

The CLIP authors studied the cultural biases that are present in their system.

CLIP is trained on text paired with images on the internet. These image-text pairs are unfiltered and uncurated and result in CLIP models learning many social biases. … For example, we find that … labels such as ‘nanny’ and ‘housekeeper’ start appearing for women whereas labels such as ‘prisoner’ and ‘mobster’ start appearing for men. … Additionally, CLIP attached some labels that described high status occupations disproportionately more often to men such as ‘executive’ and ‘doctor’. – Alec Radford, et al.

And the authors of DALL-E Mini found the following.

Overall it is difficult to investigate in much detail the model biases due to the low quality of generated people and faces, but it is nevertheless clear that biases are present. Occupations demonstrating higher levels of education (such as engineers, doctors or scientists) or high physical labor (such as in the construction industry) are mostly represented by white men. In contrast, nurses, secretaries or assistants are typically women, often white as well. Most of the people generated are white. It’s only on specific examples such as athletes that we will see different races, though most of them still under-represented. The dataset is limited to pictures with English descriptions, preventing text and images from non-English speaking cultures to be represented. – Boris Dayma and Pedro Cuenca, et al.

E-DALL-E will likely perpetuate these biases when rendering the new parts of expanded images.

Final Thoughts and Future Steps

The results of the DALL-E Mini text-to-image model appear to be very good. And VQGAN is a versatile image rendering model with many unexpected uses, like expanding images to change the aspect ratio.

I may try using VQGAN as a general inpainting tool in a future project. Although the system will quantize the masked area to 16×16 pixel blocks, it may be possible to blend back in both the original latent vectors and the original pixels outside the masked regions. And using CLIP as a guiding text encoder, it might be possible to perform a function like "draw a party hat on the pug." 😀

Source Code

The source code for this project is available on GitHub. I am releasing the sources under the CC BY-SA license. You can create and expand your own images using the Google Colabs .

Creative Commons Attribution Sharealike
Creative Commons Attribution Sharealike

Acknowledgments

I want to thank Jennifer Lim for her help with this article.

References

[1] A. Ramesh, et al., DALL-E 2 Hierarchical Text-Conditional Image Generation with CLIP Latents (2022)

[2] B. Dayma and P. Cuenca, DALL·E mini – Generate Images from Any Text Prompt (2021)

[4] VQGAN by P. Esser, R. Rombach, and B. Ommer, Taming Transformers for High-Resolution Image Synthesis (2020)

[5] CLIP by A. Radford et al., Learning Transferable Visual Models From Natural Language Supervision (2021)

[6] J. Deng, et al., ImageNet: A Large-Scale Hierarchical Image Database (2009)

[7] D. P. Kingma and J. Lei Ba, Adam: A Method for Stochastic Optimization (2015), The International Conference on Learning Representations **** 2015

[8] DALL-E by A. Ramesh et al., Zero-Shot Text-to-Image Generation (2021)

[9] A. Nichol et al., GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models (2022)

[10] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)

Appendix

Here are the results from DALL-E Mini on the left and E-DALL-E on the right for the indicated prompts.

a colorful abstract painting with swirls of blue, green, and purple

"a colorful abstract painting with swirls of blue, green, and purple," Images by Author
"a colorful abstract painting with swirls of blue, green, and purple," Images by Author

a painting of a chihuahua surfing

"a painting of a chihuahua surfing," Images by Author
"a painting of a chihuahua surfing," Images by Author

a painting of the Eiffel Tower in the style of Van Gogh’s starry night

"a painting of the Eiffel Tower in the style of Van Gogh's starry night," Images by Author
"a painting of the Eiffel Tower in the style of Van Gogh’s starry night," Images by Author

painting of a New England landscape with colorful foliage

"painting of a New England landscape with colorful foliage," Images by Author
"painting of a New England landscape with colorful foliage," Images by Author

Related Articles