DALL-E 2.0, Explained

How Does the Picasso of AI work?

Published in

Towards Data Science

4 min readMay 16, 2022

A few days ago, OpenAI released what I find to be the most striking display of the creative power of AI: DALL-E 2. On the most basic level, DALLE-2 is a function that maps text to images with remarkable accuracy, producing high quality and vibrant output images. But how does the technology work, really? In this post, I will try to explain as deeply as possible while keeping as many people in the loop as possible.

As stated above, DALL-E 2 takes text as input, and produces images as its output. But this isn’t all done in one go; rather, DALL-E 2 represents the culmination of a variety of techniques that have been built on and improved bit by bit over the past few years. These techniques are stacked on each other like lego blocks in order to compose the full model. To understand DALL-E 2, we first have to understand what these pieces are and what they do. Only then can we dive into the inner workings and training processes for understanding the underlying components.

CLIP

Among the most important building blocks in the DALL-E 2 architecture is CLIP. CLIP stands for Contrastive Language-Image Pre-training, and it’s essential to DALL-E 2 because it functions as the main bridge between text and images. Broadly, CLIP represents the idea that language can be a vehicle for teaching computers how different images relate to one another. Formally, it prescribes a simple way to carry out this teaching.

The best way to understand CLIP is to first see the shortcomings of previous computer vision systems. Until CLIP, neural methods for computer vision involved aggregating large datasets of images, and then hand labeling them into a set of categories. Though today’s models are exceptional at this task, it is inherently limited by the need for pre-selected categories. For example, imagine taking a picture of the street and asking such a system to describe it; it could tell you how many cars and signs there were, but it couldn’t give you a feel for the scene overall. What’s more, anything without enough images to produce a category would simply not be classified by the model.

The insight that makes CLIP so powerful is the bright idea to train models to not just identify which category (of a pre-defined list of options) an image belongs to, but to identify the caption of each image from a list of random captions. This allows the model to use language to more precisely understand the difference between “a shiba inu” and “a shiba inu wearing a beret,” rather than having a human labeler dictate in advance whether or not these belong in the same category.

In accomplishing the ‘pre-training task’ described above, CLIP is able to create a vector space whose dimensions represent both features of images and features of language. This shared vector space functionally provides models with a image-text dictionary of sorts, allowing them to translate (at least semantically) between one and the other.

Still, this shared understanding does not represent all of DALL-E 2. Even with a way to translate English words into Spanish words, one would need to learn Spanish pronunciations and grammar before speaking. In the case of our model, CLIP enables us to take textual phrases and understand how they map onto images. But still, we need a way to generate those images in a way that’s true to our understanding of the text. Enter our second lego building block: the diffusion model.

Diffusion Models

Imagine you have a Rubik’s Cube that is perfectly solved, with each side having only blocks of a single color. Now, you randomly choose a side and twist it. Then another side, and another, and another, and so on until you consider the Rubik’s Cube ‘scrambled.’ How would you go about solving it? Say you forgot all the twists you performed, could there be a general way to solve the cube without that information? Well, you could just take it one step at a time: keep twisting whatever face gets you closer to having all the same color on the same side until its solved. More practically, you could train a neural network to go from a state of disorder to a state of less disorder.

This is the crux of how diffusion models are created. You take an image, randomly scramble its pixels until you have an image of pure noise, and train a model to reduce the noise by changing its pixels step by step until you’re back at the original image (or something that resembles it). This leaves with a model that can effectively generate information — an image — from randomness! And by showing the model new, random samples to start from, we can even get new images.

To infuse these creations with the semantic meaning of our input text, we ‘condition’ the diffusion model on the CLIP embeddings. This simply means that we pass the vectors from our joint CLIP space described above into to the diffusion model that calculates what pixels to change upon each ‘step’ of the generation process. This allows the model to base its changes on that information.

Of course, the technical details of the math and implementation are more complicated than I have presented here. But at it’s core, DALL-E 2 is the optimization and refinement of these two technologies.

It has been said that “art is the window to [wo]man’s soul.” For this reason, some people find generative images to be boring or uninteresting. To me though, generative art is almost more interesting: it is the window to the computer’s soul. In producing art, the model reveals its understandings, misunderstandings, and — most of all — its creativity.

DALL-E 2.0, Explained

How Does the Picasso of AI work?

CLIP

Diffusion Models

Written by Daniel Fein