The world’s leading publication for data science, AI, and ML professionals.

Simple Implementation of OpenAI CLIP model: A Tutorial

A tutorial on simple implementation of CLIP model in PyTorch.

Summary of CLIP model's approach, from Learning Transferable Visual Models From Natural Language Supervision paper
Summary of CLIP model’s approach, from Learning Transferable Visual Models From Natural Language Supervision paper

Introduction

It was in January of 2021 that OpenAI announced two new models: DALL-E and CLIP, both multi-modality models connecting texts and images in some way. In this article we are going to implement CLIP model from scratch in PyTorch. OpenAI has open-sourced some of the code relating to CLIP model but I found it intimidating and it was far from something short and simple. I also came across a good tutorial inspired by CLIP model on Keras code examples and I translated some parts of it into PyTorch to build this tutorial totally with our beloved PyTorch!

I’ve made all the code available as a notebook on Google Colab and Kaggle and I’ve also put it on my GitHub.

What does CLIP do? Why is it fun?

In Learning Transferable Visual Models From Natural Language Supervision paper, OpenAI introduces their new model which is called CLIP, for Contrastive Language-Image Pre-training. In a nutshell, this model learns the relationship between a whole sentence and the image it describes; in a sense that when the model is trained, given an input sentence it will be able to retrieve the most related images corresponding to that sentence. The important thing here is that it is trained on full sentences instead of single classes like car, dog, etc. The intuition is that when trained on whole sentences, the model can learn a lot more things and finds some pattern between images and texts.

They also show that when this model is trained on a huge dataset of images and their corresponding texts, it can also act as a classifier too. I encourage you to study the paper to learn more about this exciting model and their astonishing results on benchmarking datasets . To mention just one, CLIP model trained with this strategy classifies ImageNet better than those SOTA models trained on the ImageNet itself optimized for the only task of classification!

As a teaser (!), let’s see what the final model that we will build in this article from scratch is capable of: given a query like "a boy jumping with skateboard" or "a girl jumping from swing", the model will retrieve the most relevant images:

Final model's output for above mentioned text queries | Image by author
Final model’s output for above mentioned text queries | Image by author

Stay Tuned 🙂

Getting Started

Okay. Let’s go straight to its PyTorch implementation. First of all, we need a dataset containing images and some text describing them. Frankly, there are lots of them available online. We are going to use Flickr 8k dataset (you can use 30k version which is bigger and the final model will be perform better) which is mostly used for Image Captioning task. But, there is no limitation and we can use it to train CLIP model as well.

If you are using the Kaggle notebook that I’ve written, you do not need to download anything! The data is already there in ../input.

But if you are using Colab or you want to download it on your local machine, the following code will download the 8k (or 30k if you un-comment the last lines) and unzips them. You need to enter your Kaggle username and key in the specified strings below (simply create a Kaggle account if you don’t have one already!)

One thing to note about this dataset is that for each image there are 5 captions. I’ll talk about this later when writing the loss function!

Dataset

As you can see in the tittle image of this article, we need to encode both images and their describing texts. So, the dataset needs to return both images and texts. Of course we are not going to feed raw text to our text encoder! We will use DistilBERT model (which is smaller than BERT but performs nearly as well as BERT) from HuggingFace library as our text encoder; so, we need to tokenize the sentences (captions) with DistilBERT tokenizer and then feed the token ids (input_ids) and the attention masks to DistilBERT. Therefore, the dataset needs to take care of the tokenization as well. Below you can see the dataset’s code. Below that I’ll explain the most important things that is happening in the code.

A note on config and CFG: I wrote the codes with python scripts and then converted it into a Jupyter Notebook. So, in case of python scripts, config is a normal python file where I put all the hyperparameters and in the case of Jupyter Notebook, its a class defined in the beginning of the notebook to keep all the hyperparameters. Check out the GitHub repo or the notebooks to see all the hyperparameters.

In the init we receive a tokenizer object which is actually a HuggingFace tokinzer; this tokenizer will be loaded when running the model. We are padding and truncating the captions to a specified max_length. In the getitem we will first load an encoded caption which is a dictionary with keys input_ids and attention_mask, make tensors out of its values and after that we will load the corresponding image, transform and augment it (if there is any!) and then we make it a tensor and put it in the dictionary with "image" as the key. Finally we put the raw text of the caption with the key "caption" in the dictionary only for visualization purposes.

I did not use additional data augmentations but you can add them if you want to improve the model’s performance.

Image Encoder

The image encoder code is straight forward. I’m using PyTorch Image Models library (timm) here which makes a lot of different image models available from ResNets to EfficientNets and many more. Here we will use a ResNet50 as our image encoder. You can easily use torchvision library to use ResNets if you don’t want to install a new library.

The code encodes each image to a fixed size vector with the size of the model’s output channels (in case of ResNet50 the vector size will be 2048). This is the output after the nn.AdaptiveAvgPool2d() layer.

Text Encoder

As I mentioned before, I’ll use DistilBERT as the text encoder. Like its bigger brother BERT, two special tokens will be added to the actual input tokens: CLS and SEP which mark the start and end of a sentence. To grab the whole representation of a sentence (as the related BERT and DistilBERT papers point out) we use the final representations of the CLS token and we hope that this representation captures the overall meaning of the sentence (caption). Thinking it in this way, it is similar to what we did to images and converted them into a fixed size vector.

In the case of DistilBERT (and also BERT) the output hidden representation for each token is a vector with size 768. So, the whole caption will be encoded in the CLS token representation whose size is 768.

Projection Head

I used the Keras code example implementation of projection head to write the following in PyTorch.

Now that we have encoded both our images and texts into fixed size vectors (2048 for image and 768 for text) we need to bring (project) them into a new world (!) with similar dimensions for both images and texts in order to be able to compare them and push apart the non-relevant image and texts and pull together those that match. So, the following code will bring the 2048 and 768 dimensional vectors into a 256 (projection_dim) dimensional world, where we can compare them:

"embedding_dim" is the size of the input vector (2048 for images and 768 for texts) and "projection_dim" is the the size of the output vector which will be 256 for our case. For understanding the details of this part you can refer to the CLIP paper.

CLIP Model

This part is where all the fun happens! I’ll also talk about the loss function here. I translated some of the code from Keras code examples into PyTorch for writing this part. Take a look at the code and then read the explanation below this code block.

Here we will use the previous modules that we built to implement the main model. The init function is self-explanatory. In the forward function, we first encode the images and texts separately into fixed size vectors (with different dimensionalities). After that, using separate projection modules we project them to that shared world (space) that I talked about previously. Here the encodings will become of similar shape (256 in our case). After that we will compute the loss. Again I recommend reading CLIP paper to get it better but I’ll try my best to explain this part.

In Linear Algebra, one common way to measure if two vectors are of similar characteristics (they are like each other) is to calculate their dot product (multiplying the matching entries and take the sum of them); if the final number is big, they are alike and if it is small they are not (relatively speaking)!

Okay! What I just said is the most important thing to have in mind to understand this loss function. Let’s continue. We talked about two vectors, but, what do we have here? We have image_embeddings, a matrix with shape (batch_size, 256) and text_embeddings with shape (batch_size, 256). Easy enough! it means we have two groups of vectors instead of two single vectors. How do we measure how similar two groups of vectors (two matrices) are to each other? Again, with dot product (@ operator in PyTorch does the dot product or matrix multiplication in this case). To be able to multiply these two matrices together, we transpose the second one. Okay, we get a matrix with shape (batch_size, batch_size) which we will call logits. (temperature is equal to 1.0 in our case, so, it does not make a difference. You can play with it and see what difference it makes. Also look at the paper to see why it is here!).

I hope you are still with me! If not it’s okay, just review the code and check their shapes. Now that we have our logits, we need targets. I need to say that there is a more straight forward way to obtain targets but I had to do this for our case (I’ll talk about why in a next paragraph).

Let’s consider what we hope that this model learns: we want it to learn "similar representations (vectors)" for a given image and the caption describing it. Meaning that either we give it an image or the text describing it, we want it to produce same 256 sized vectors for both.

So, in the best case scenario, text_embeddings and image_embedding matricies should be the same because they are describing similar things. Let’s think now: if this happens, what would the logits matrix be like? Let’s see with a simple example!

So logits, in the best case, will be a matrix that if we take its softmax, will have 1.0s in the diagonal (An identity matrix to call it with fancy words!). As the loss function’s job is to make model’s predictions similar to targets (at least in most cases!), we want such a matrix as our target. That’s the reason why we are calculating images_similarity and texts_similarity matrices in the code block above.

Now that we’ve got our targets matrix, we will use simple cross entropy to calculate the actual loss. I’ve written the full matrix form of cross entropy as a function which you can see in the bottom of the code block. Okay! We are done! Wasn’t it simple?! Alright, you can ignore the next paragraph but if you are curious, there is an important note in that.

Here’s why I didn’t use a simpler approach: I need to admit that there’s a simpler way to calculate this loss in PyTorch; by doing this: nn.CrossEntropyLoss()(logits, torch.arange(batch_size)). Why I did not use it here? For 2 reasons. 1- The dataset we are using has multiple captions for a single image; so, there is the possibility that two identical images with their similar captions exist in a batch (it is rare but it can happen). Taking the loss with this easier method will ignore this possibility and the model learns to pull apart two representations (assume them different) that are actually the same. Obviously, we don’t want this to happen so I calculated the whole target matrix in a way that takes care of these edge cases. 2- Doing it the way I did, gave me a better understanding of what is happening in this loss function; so, I thought it would give you a better intuition as well!

Train

Here’s a handy function to train our model. There’s not much happening here; just loading the batches, feeding them to the model and stepping the optimizer and lr_scheduler.

There are some more utility functions and classes (like AvgMeter and get_lr) that you can find in the Colab and Kaggle notebooks or in my GitHub repo.

Okay! We are done with training the model. Now, we need to do inference which in our case will be giving the model a piece of text and want it to retrieve the most relevant images from an unseen validation (or test) set.

Getting Image Embeddings

In this function, we are loading the model that we saved after training, feeding it images in validation set and returning the image_embeddings with shape (valid_set_size, 256) and the model itself.

Finding Matches

This function does the final task that we wished our model would be capable of: it gets the model, image_embeddings, and a text query. It will display the most relevant images from the validation set! Isn’t it amazing? Let’s see how it performs after all!

Let’s see some examples! At this point when I saw the outputs, I screamed out of happiness and being shocked that the model is actually learning this relationship between images and texts! The feeling was just incredible.

This is how we use this function. Aaaannnndddd the results:

Image by author
Image by author

I was just like: Wow! This model knows something! Of course it is not perfect because there are two dogs in some of the pictures but considering the small training set and short training time, I think it’s wonderful!

Let’s see some other outputs. The queries are written at the top of each image.

Image by author
Image by author

See! It is also capable of numerating! Compare this to previous one. The model knows the meaning of "two" and brings images that have two dogs in them in contrast to the previous query! At this moment I screamed out of being shocked a second time 🙂

Our outputs from the beginning of the article:

Image by author
Image by author
Image by author
Image by author

For the next one, there are some mistakes the model is making but overall, it obviously has a good understanding of both texts and images.

Image by author
Image by author

Final words

I hope you have enjoyed this article. Implementing this paper was a really interesting experience for me. I want to thank Khalid Salama for the great Keras code example he provided which inspired me to write something similar in PyTorch.

As mentioned in the article, all the codes and results are available in my GitHub repo and also as Jupyter notebooks on Kaggle and Colab.


Related Articles