OpenAI’s DALL-E and CLIP 101: a brief introduction

After GPT-3, OpenAI returns with two models that combine text and images. Will DALL-E be the protagonist of 2021 in the field of AI?

Published in

Towards Data Science

6 min readJan 6, 2021

Photo taken by David Pereira at Dali’s museum in Figueres.

While the community is still discussing one of 2020 AI big announcements, GPT-3, whose paper was published July 22nd, 2021 has just begun and we already have two impressive new neural networks from OpenAI: CLIP and DALL-E.

Both CLIP and DALL-E are multimodal neural networks, and their creators claim them to be “a step toward systems with deeper understanding of the world”.

Multimodal neural networks: what are they?

Our experiences as humans are multimodal, meaning that we receive inputs from the world surrounding us in different formats (sound, image, odors, textures, etc.) that we combine using different senses (touch, sight, hearing, smell and taste) to produce learnings and to retain information.

So how can we define what a modality is and when a problem is multimodal? Let me refer here to a good definition that can be found in a well-known multimodal machine learning paper, called “Multimodal Machine Learning: A Survey and Taxonomy”:

Modality refers to the way in which something happens or is experienced, and a research problem is characterized as multimodal when it includes multiple such modalities.

In Deep Learning, it is very common to train models in only one data format (single modality). As an example of this, DeepMind’s AlphaFold 2 solves protein’s folding problem by transforming the information of how amino acids interact with others when they are part of a given 1 dimensional sequence into a correlations matrix, that can be represented as an image. In the same way, the output of AlphaFold is another image, this time representing distances between amino acids within a protein that are very related to the 3D structure of it. Therefore, DeepMind transformed the protein folding problem into an image-to-image machine learning problem.

Overall view on AlphaFold solution for predicting protein folding. Photo Credit: Nature. Source: https://www.nature.com/articles/s41586-019-1923-7

So finally, how can we define multimodality in Machine Learning? I pretty much like the introduction and definition from the video below:

“Multimodality (in machine learning) occurs when two or more heterogeneous inputs are processed by the same machine learning model, only if these inputs cannot be mapped unambiguously into one another by an algorithm”. Source: https://www.youtube.com/channel/UCobqgqE4i5Kf7wrxRxhToQA

So now that we know what a multimodal neural network is, let us explore what OpenAI’s new multimodal neural networks are trained to do.

DALL-E: creating images from captions expressed in natural language

So, the first of the two new OpenAI’s neural networks, DALL-E (inspired by the famous surrealist artist Salvador Dalí) is a 12-billion parameter version of GPT-3, trained to generate images from a text description input. It uses its same transformer architecture. In this case, as its creators say at their introductory blog post, “It receives both the text and the image as a single stream of data containing up to 1280 tokens and is trained using maximum likelihood to generate all of the tokens, one after another”.

Some of the available demo examples in OpenAI’s DALL-E website are amazing (you can change some of the text input parameters to see how the output is affected). Look at one example below, in which you can change all the underlined parameters:

OpenAI WALL-E demo, source: https://openai.com/blog/dall-e/

In their blog, OpenAI list some of the capabilities of their new neural network:

Controlling attributes (for example, changing colors, shapes, number of repetitions of the image in the output)
Drawing multiple objects (which requires the algorithm to calculate relative positioning, stacking, amongst others)
Perspective and three-dimensionality
Contextual detail inference
Visualizing internal and external capabilities (OpenAI’s blog shows a cross section of a walnut as an example).
Precedings capabilities
Combining unrelated elements (describing both real and imaginary concepts)
Zero-shot visual reasoning (which the authors admit they did not anticipate)
Geographic knowledge
Temporal knowledge

Cautions

From a technical perspective, there are still many doubts about this new model. As Dr. Gary Marcus points out, we are watching a very powerful preview, but as it happened with GPT-3, we have no access to a paper or to a really open demo environment at the time this article was written to deeply analyze the solution.

From a social impact perspective, and besides the obvious effects WALL-E can have on some professions and processes (e.g. related to stock photography), OpenAI mentions in their blog that they “plan to analyze how models like DALL·E relate to societal issues […], the potential for bias in the model outputs, and the longer-term ethical challenges implied by this technology. As the saying goes, an image is worth a thousand words, and we should take very seriously how tools like this can affect misinformation spreading in the future, among other problems like how to recognize the value of training data of this kind of algorithms, like shown in the tweet below.

OpenAI's DALL-E creates plausible images of literally anything you ask it to

OpenAI's latest strange yet fascinating creation is DALL-E, which by way of hasty summary might be called "GPT-3 for…

techcrunch.com

CLIP

The second of OpenAI’s new multimodal neural networks is called CLIP (Contrastive Language-Image Pre-training). As OpenAI mentions, current computer vision approaches present two challenges:

Datasets are labor intense and costly to create
Standard vision computing models require significant efforts to adapt to new tasks

In order to solve these challenges, CLIP is trained not by using labeled image datasets but from images and their descriptions (captions) taken from the internet . By using similar capabilities to GPT-3 zero-shot approach, CLIP can be instructed to perform classification benchmarks by using natural language. For OpenAI, that is the key, as mentioned in their introductory blog post (see quote below):

“By not directly optimizing for the benchmark, we show that it becomes much more representative: our system closes this “robustness gap” by up to 75% while matching the performance of the original ResNet507 on ImageNet zero-shot without using any of the original 1.28M labeled examples.”

OpenAI has shown with CLIP that introducing a simple pre-training task is sufficient for the model to perform very well on a wide group of datasets. That pre-training consists on predicting which caption is associated with a given sample image from of a set of 32,768 randomly sampled text snippets. To do that, OpenAI’s has used supervised learning through text paired with images found in the internet.

To show how CLIP can generalize, following a zero-shot approach, OpenAI has shared the following graph. As they mention in their introductory blog post “The best CLIP model outperforms the best publicly available ImageNet model, the Noisy Student EfficientNet-L2,23 on 20 out of 26 different transfer datasets we tested.”

Conclusions

2021 starts with big announcements from OpenAI in the field of AI. As 2020 was the year of GPT-3, will these be the year of DALL-E? Thinking about the potential implications of this particular model, I don’t assume OpenAI will roll out public access to it in a short time, but it will be interesting compiling the different applications developers create, as it happened with GPT-3.

On the other hand, and even if GPT-3 has no idea what it is talking about or DALL-E has no idea what it is drawing, both models still show how Deep Learning has progressed in the last years. It is amazing to think AlexNet is only 8 years old, or how much DALL-E improves images created by Neural Networks only 5 years ago.

Future implications for creative jobs are still to be unveiled, and one might argue if some of DALL-E’s creations can be called art, because neither us nor the model itself knows what is going on. But let us be honest for a second: can we say that we know what was going on Dali’s mind?

“Triple autoportrait de Salvador Dali” by tempslink1 is marked with CC0 1.0

If you enjoyed reading this piece, please consider a membership to get full access to every story on while supporting me and other writers on Medium.