Unimodal models are designed to work with data from a single mode, which can be either text or images. These models specialize in understanding and generating content specific to their chosen mode. For example, GPT are excellent at generating human-like text. They have been used for tasks like language translation, text generation, and answering questions. Convolutional Neural Networks (CNNs) are examples of image models that excel at tasks like image classification, object detection, and image generation. Currently, many interesting tasks such as Visual Question Answering (VQA) and Image-Text retrieval etc. require multimodal capabilities. Is it possible to combine both text and image processing? We can! Clip stands out as one of the initial highly successful image-text models, demonstrating proficiency in both image recognition and text comprehension.
We will divide this article into the following sections:
- Introduction
- Architecture
- Training process and Contrastive loss
- Zero-shot capability
- CuPL
- Conclusions
Introduction
The CLIP model is an impressive zero-shot predictor, enabling predictions on tasks it hasn’t explicitly been trained for. As we will see more in detail in the next sections, by using natural language prompts to query images, CLIP can perform image classification without requiring task-specific training data. Nevertheless, its performance can be significantly enhanced with a few tricks. In this series of articles, we will explore methods that leverage additional prompts generated by Large Language Models (LLM) or a few-shot training examples without involving any parameter training. These approaches offer a distinct advantage as they are computationally less demanding and do not necessitate fine-tuning additional parameters.
Architecture
CLIP is a dual encoder model with two separate encoders for visual and textual modalities that encode images and texts independently. Such architecture is different from the fusion encoder that enables the interaction between visual and textual modalities through cross-attention which involves learning attention weights that help the model focus on specific regions of an image and corresponding parts of the text when processing both modalities. The idea is similar to self-attention, which allows each token to attend to other tokens within the same modality. Cross-attention extends this concept by allowing tokens in one modality (e.g., tokens or patches representing image features) to attend to tokens in another modality (e.g., tokens representing textual descriptions). The idea of dual and fusion encoders can be summarized as follows:

Encoders
Text Encoder: Responsible for processing input text, the text encoder transforms it into a vector representation. Within CLIP, the model uses a standard that we thoroughly explored in this article. The text encoder produces an embedding for the provided text, encapsulating semantic information associated with the input. Image Encoder: The image encoder processes images to derive their vector representations. The visual encoder can either be a Convolutional Neural Network like the ResNet model or a ViT Transformer (see here to refresh your knowledge) that produces the image vector representations.
These two vectors share the same dimensions enabling the computation of similarity between a given text and image. If you have always worked with one modality you wonder how is it even possible that we can compare image and text embeddings? The key lies in the training process and the loss function which empower CLIP to learn a unified image-text space facilitating the comparison of vectors from different modalities.
Training process and Contrastive loss
CLIP was trained via a large-scale, multimodal objective on an extensive dataset of image-text pairs. When I say large-scale it means a LOT of data – approximately 0.4 billion image-text pairs. These were collected from publicly available sources on the internet and automatically filtered to ensure high quality.
Once image-text pairs are collected the model is trained via a contrastive loss. Contrastive loss enables the model to learn a shared image-text space by aligning the representations of images and text maximizing the similarity between the embeddings of matching pairs and minimizing the similarity between non-matching pairs. The process is presented in the image below:

The image embedding I_i corresponds to the text embedding T_i (i.e., on the diagonal) forming the matching pair, while all other texts T_j (j ≠ i) (off-diagonal) are considered to be non-matching pairs. Similarly, for T_i only I_i is regarded as the matching image, and all other images I_j (j ≠ i) are not considered descriptions of T_i. However, this assumption might be limiting as there could be other text pairs that effectively describe an image and vice versa. Mining hard negative examples stands as a potential solution to this challenge. CLIP, despite this, manages to overcome this limitation attributed to its substantial batch size of 32,768.
After pre-training on this diverse dataset, CLIP’s learned embeddings can be used for many downstream applications – one of them that is truly impressive is the zero-shot image classification.
Zero-Shot capability
What does zero-shot mean in the first place? As mentioned in the introduction, zero-shot classification refers to the ability of a model to correctly classify unseen classes without the need for any specific examples or training data for those particular classes. CLIP was trained on a large dataset and it has learned to generalize across a broad range of concepts that enables it to recognize and classify classes based on their semantic relationships. Let’s see how this is done in practice:
Assume we only know the class names for a particular dataset like ["dog", "cat", "horse"]. As CLIP was trained to match image and text, we can compute the cosine similarity between a given test image and prompt "Picture of a {class name}" which becomes in our case: "Picture of a {dog}", "Picture of a {cat}", "Picture of a {horse}".
The prompt with the highest cosine similarity represents the predicted class.
Improving Zero-Shot CLIP via Customized Prompts via Language Models (CuPL)
Now, Zero-Shot CLIP has achieved quite an impressive performance already, however, we can squeeze some more out of it with a few simple tricks. CLIP’s zero-shot performance is very sensitive to the textual prompts it’s fed with. This is why for different datasets such as ImageNet people have come up with different textual prompts such as "an origami {class name}", "a {class name} in a video game" and similar. These manually designed prompts are better than a simple "Picture of a {class name}", however, they still have some major limitations:
- Hand-written prompts require a lot of human effort
- Hand-written prompts must be general – we cannot use a template like "a photo of a {platypus}, a type of aquatic mammal" as it would apply only to aquatic mammals and not to other categories. This is limiting, as descriptive details are useful for fine-grained classification.
- Writing high-performing prompt templates requires prior information about the contents of the dataset. So in the case of ImageNet, we must know in advance that the dataset of interest contains origami, video game images, and so on.
So what can we do? Let’s just ask a Large Language Model (LLM) to generate for us such prompts that can be easily scaled to any number of classes and datasets! We can ask a LLM the following questions:
- Describe what a/the {class name} looks like:
- Describe a/the {class name}:
- What are the identifying characteristics of a/the {class name}?
Why should this be better than simple prompts? The hypothesis is that prompts that LLM generates will contain very detailed descriptions of the given classes and enable CLIP to place more importance on image regions that are most relevant for correct classification.
Let’s now jump into coding and see what we can get. We are going to use CLIP from the Transformers library from Hugging Face. So let’s import the model – we are going to use ViT with a patch of size 32 and the processing pipeline that tokenizes text and pre-processes images:
from transformers import CLIPProcessor, CLIPModel
import torch
import requests
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
Next, we are going to download an image of a "tree frog", a _ class contained in the ImageNet dataset,_ from the freeimages website that has an open license for the image below. Then, we are going to predict whether it’s a "tree frog" or a "tailed frog" (they are visually similar and mainly differ by the large eye size) using CLIP with a simple prompt "A photo of a {class}":

url = "https://images.freeimages.com/images/large-previews/342/green-tree-frog2-1616738.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a tree frog", "a photo of a tailed frog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
print(probs)
"""
Output:
tensor([[0.3164, 0.6836]], grad_fn=<SoftmaxBackward0>)
"""
The model makes an incorrect prediction by selecting "tailed frog" with a probability of 0.68.
Let’s now ask a LLM (e.g. ChatGPT) to generate the prompts for us:
prompts = {"tree frog": [
"A tree frog is a small frog that typically has greenish coloration.",
"A tree frog is a small frog that typically has bright colors, long toes that help it climb, and suction cups on its feet.",
"A tree frog is small, typically green, frog that lives in trees.",
"A tree frog looks like a frog with special adaptations for living in trees.",
"A tree frog is a small, typically green frog with large adhesive pads on its feet that allow it to climb smooth surfaces like glass and plastic.",
"Tree frogs are small amphibians with big toes that help them climb.",
" Most tree frogs have bright colors.",
"Tree frogs are small frogs that live in trees.",
"A tree frog is a small frog that has large toe pads that help it climb trees.",
"A tree frog typically has green skin, although some species can be brown, gray, or yellow.",
"A tree frog is a small frog that lives in trees.",
"A tree frog is a small, tailless amphibian with large, powerful hind legs and webbed feet.",
"A tree frog is a amphibian that has well-developed hind legs which enable it to climb trees and other structures.",
"Avatar of the forest, the tree frog is a small amphibian with big eyes bulging out of its head.",
"A tree frog is a small amphibian that typically has bright green skin and lives in trees.",
"A tree frog is a small, slim frog that typically has brightly colored skin.",
"A tree frog is a small frog that typically has a bright green body and lives in trees.",
"A tree frog is a small, tailless amphibian that typically has bright green or yellowish skin and lives in trees or near bodies of water.",
"A tree frog is a small amphibian that typically has a green body and large eyes.",
"A tree frog is a small frog that typically has bright colors.",
"The identifying characteristics of a tree frog varies depending on the species, but some common features include large adhesive toes, protruding eyes, and bright colours.",
"Some identifying characteristics of a tree frog are that they have large toe pads, which help them grip onto tree branches, and their bodies are slim so that they can fit into small spaces.",
"Tree frogs are small frogs that live in trees and other high places.",
"The identifying characteristics of a tree frog are its long hind legs, which it uses to jump, and its adhesive pads, which it uses to stick to surfaces.",
"The identifying characteristics of a tree frog are that they have long, sticky toes that help them climb trees, and they have wrinkled skin that helps them absorb water.",
"Tree frogs have long, sticky toes that help them climb trees.",
"Tree frogs are small frogs that can climb trees.",
"Tree frogs have long hind legs that they use to jump.",
"There are over 6,300 species of tree frogs, so it is difficult to give one answer to this question.",
"The identifying characteristics of a tree frog are its long, sticky toes that help it climb trees, and its dark green or brown coloration that helps it blend in with leaves."
],
"tailed frog": [
" short, stout body; webbed hind feet with large, adhesive discs on the toes; long, muscular tail; small eyes located on top of the head; smooth or warty skin; and a small mouth.",
"A tailed frog has a long, skinny body and a long tail.",
"A tailed frog has a long, slender body with a tail that is about as long as its body.",
"A tailed frog is a frog with a long tail.",
"A tailed frog has a long tail and four legs.",
"A tailed frog is a small frog that has a long tail.",
"A tailed frog has a long tail that is often as long as its body.",
"A tailed frog has a long tail and webbed feet.",
"A tailed frog is a frog with a long tail.",
"A tailed frog is a small frog with a long tail.",
"A tailed frog (Asteriscus species) is a species of frog in the Asteriscidae family.",
"A tailed frog is a type of frog that has a long tail.",
"A tailed frog is a frog with a long tail, typically over 10 cm in length.",
"A tailed frog is a species of frog that has a long tail.",
"Tailed frogs are a type of frog that have a long tail.",
"A tailed frog is a type of frog that has a long, tail-like structure protruding from its back.",
"A tailed frog is a frog that has a long tail.",
"A tailed frog is a frog that has a long tail.",
"A tailed frog has a long, thin body with short legs.",
"A tailed frog is a small amphibian that has a long tail.",
"Some identifying characteristics of a tailed frog are that they have a long tail, they are good swimmers, and they live near water.",
"There are over 60 species of tailed frogs, so it is difficult to give a definitive answer.",
"There are over 100 species of tailed frogs, so it is difficult to give a general answer to this question.",
"Some identifying characteristics of a tailed frog are that they have a long tail, they are small, and they have webbed feet.",
"There are over 60 species of tailed frog, so identifying characteristics can vary.",
"Tailed frogs are a species of frog that are native to the western United States and northern Mexico.",
"The identifying characteristics of a tailed frog are its tail, which is used for swimming, and its webbed feet.",
"Tailed frogs are small, dark-colored frogs with long, slender hind legs and a long, thin tail.",
"Some tailed frogs have a tail that is about one-third the length of their body.",
"The identifying characteristics of a tailed frog are its long tail and its smooth, moist skin."
]}
These prompts are more informative about the classes which should guide the model in identifying the correct class. For example, many prompts for "tree frog" emphasize that it has "large eyes", something that simple prompts like "Picture of a tree frog" do not capture.
Using the above prompts and after some manipulation, we form the final prompt for a class with the average vector embedding:
"""
First of all, we can verify that
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
probs
Is the same as:
image_features = model.visual_projection(model.vision_model(inputs['pixel_values']).pooler_output)
text_features = model.text_projection(model.text_model(inputs['input_ids']).pooler_output)
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# cosine similarity as logits
logit_scale = model.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
probs = logits_per_image.softmax(dim=1)
probs
"""
image_features = model.visual_projection(model.vision_model(inputs['pixel_values']).pooler_output)
tree_frog_vector = model.text_model(processor(prompts['tree frog'], return_tensors="pt", padding=True)['input_ids']).pooler_output
# take the mean prompt embedding
tree_frog_vector = tree_frog_vector.mean(dim=0, keepdims=True)
# final projection
tree_frog_vector = model.text_projection(tree_frog_vector)
tailed_frog_vector = model.text_model(processor(prompts['tailed frog'], return_tensors="pt", padding=True)['input_ids']).pooler_output
# take the mean prompt embedding
tailed_frog_vector = tailed_frog_vector.mean(dim=0, keepdims=True)
# final projection
tailed_frog_vector = model.text_projection(tailed_frog_vector)
# concatenate
text_features = torch.cat([tree_frog_vector, tailed_frog_vector], dim=0)
# normalize features
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# cosine similarity as logits
logit_scale = model.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
probs = logits_per_image.softmax(dim=1)
print(probs)
"""
Output:
tensor([[0.6512, 0.3488]], grad_fn=<SoftmaxBackward0>)
"""
Using prompts from LLM gives us the correct classification answer – "tree frog".
Conclusions
In this article, we have seen how easily we can improve CLIP zero-shot prediction using a LLM. The advantage of this solution is not only higher accuracy but also its scalability as we do not need any human effort to generate prompts. In the next articles, we are going to explore other methods to improve CLIP’s zero-shot learning and training-free few-shot learning methods.
References
[1] [2103.00020] Learning Transferable Visual Models From Natural Language Supervision (arxiv.org) [2] [2209.03320] What does a platypus look like? Generating customized prompts for zero-shot image classification (arxiv.org) [3] CLIP (huggingface.co)