🚀 Sascha’s Paper Club
Today we will dive into a paper that builds upon the great success of CLIP in language-image pre-training and extends it to the task of object detection: GLIP – Grounded Language-Image Pre-training. We will cover the key concepts and findings of the paper and make them easy to understand by providing further context and adding annotations to images and experiment results. Let’s go!

Paper: Grounded Language-Image Pre-training by Liunian Harold Li et. al., 7. Dec. 2021
Resources: GitHub
Category: representation learning, object detection, phrase-grounding, foundation models
Other Walkthroughs: [BYOL] – [CLIP] – [Depth Anything] – [Segment Anything] – [DINO] – [DDPM]
Outline
- Context & Background
- Claimed Contributions
- Method
- Experiments
- Further Readings & Resources
Context & Background
GLIP (Grounded Language-Image Pre-training) is a multi-modal language-image model. Similar to [CLIP](https://towardsdatascience.com/the-clip-foundation-model-7770858b487d?source=friends_link&sk=a7b10ba1d0c3a20ecd4adb8200a48500) (Contrastive Language-Image Pre-Training), it performs contrastive pre-training to learn semantically rich representations and aligns them across its modalities. While CLIP learns these representation on an image level, which means one sentence describes the entire image, GLIP aims to extend this approach to object-level representations, meaning one sentence might correspond to multiple objects within the image. The task of identifying correspondences between single tokens in a text-prompt and objects or regions in an image is called phrase grounding. Hence the word "Grounded" in GLIP.
Therefore, GLIP aims to:
- Unify phrase grounding and object detection for large-scale pre-training.
- Provide a flexible framework for zero-shot object detection, where flexible means it is not restricted to a fixed set of classes.
- Build one pre-trained model that seamlessly transfers to various tasks and domains, in a zero-shot or few-shot manner.
What can you do with such a model? You could use text prompts to find objects or regions of interest within a given input image. And the best part: you are not restricted to pre-defined classes.

You could further process these detections (e.g. feeding those into a tracking system) or create a custom dataset with certain classes of interest and use those to train your own supervised detection system. Not only that you could cover rare or very specific classes, but you could also save a lot of time and money for the creation of manual labels. As we will see later, the authors of GLIP had a similar idea to boost the performance even further by introducing a teacher-student framework.
GLIP has been adopted by many other projects and domains in Deep Learning. For example, GLIGEN (Grounded-Language-to-Image-Generation) uses GLIP as to condition the image generation of a latent diffusion model to increase the controllability. Additionally, GLIP has been combined with other models such as DINO (DETR with Improved deNoising anchOr boxes) and SAM (Segment Anything Model) to GroundingDINO and Grounded-Segment-Anything respectively. GLIPv2 extends the initial GLIP model with vision-language understanding to not only improve phrase grounding but also enable visual question answering tasks.
Claimed Contributions
- Large scale pre-training for combined phrase grounding and Object Detection
- Providing a unified view on object detection and phrase grounding
- Deep cross-modality fusion to learn high-quality language-aware visual representations and to achieve superior transfer learning performance.
- Presenting that prompt-tuning is more effective in deep vision-language fusion (e.g. GLIP) as in shallow fused networks (e.g. CLIP)
Method
Having a rough idea of what can be done with GLIP, let’s have a closer look into the details of the paper.
Architectural Overview
On a high level, GLIP’s architecture is quite similar to CLIP‘s in a sense that it also consists of a text encoder, an image encoder and some sort of contrastive learning on the similarity of text and image features. The architecture of GLIP is shown in Fig. 2.

GLIP adds a language-image aware deep fusion module after the text and image encoder. This module performs cross-modal attention and extracts further features. A cosine similarity is calculated over the resulting region features and word features. During training, the similarity of matching pairs is maximized, while minimized for incorrect pairs. In contrast to CLIP, where the matching pairs are located on the diagonal of the similarity matrix, in GLIP the matching is not performed on sentence level, but on (sub)word level resulting in usually off-diagonal positions.
Phrase Grounding Formulated as Object Detection Problem
The authors noted that the problem of phrase grounding (= associating words with objects/regions in an image) can be formulated as Object detection Objective, where the standard loss objective is:

The localization loss is concerned with the quality of the predicted bounding box, which depending on the format, might be the size and location of the box. The classification loss is the key part in the unification. By calculating the logits over the similarity score of text-image features instead of over the logits from an image classifier, the same loss objective can be used for training.

Different Model Variants
Five different models are trained to show the effect of the authors’ design choices and model scale:

Teacher-Student Pre-Training
To boost the performance of GLIP, the authors train the GLIP-T (C) model (see Fig.3) on human annotated data, called GoldG, to generate grounding data from text-image pairs from the internet. They call this model the teacher model and subsequently train a student model feeding it the with the data used to train the teacher plus the data the teacher generated. See Fig. 4 for an illustration.
Note: Even though the terms teacher and student are used, it is not the same process as in knowledge distillation, where a smaller student model is trained to match the output of a larger teacher model.

Interestingly, as we will see in the experiments, the student surpasses the teacher on many (but not all) datasets for both; zero-shot and few-shot detection. Why is that? The paper hypothesizes, that even though the teacher provides a prediction with low confidence (they call it an "educated guess"), it becomes the ground truth (they call it "supervised signal") in the generated dataset consumed by the student.
Experiments
The GLIP paper presents various experiments and ablation studies, mainly concerned with:
- Zero-Shot Domain Transfer
- Data Efficiency
- Prompt Engineering
I have some doubts for some of the results and the way they are presented, and I will point them out in the annotations. I don’t want to diminish the achievements of GLIP and rather view it with a critical eye.
Now let’s jump into the details!
Zero-Shot Domain Transfer
First, we will have a look into the results from the zero-shot domain transfer. In this task the objective is to analyze how well the pre-trained GLIP models perform on a different dataset (i.e. COCO and LVIS) as used during pre-training and compare it against a baseline with models that have been trained in a supervised fashion. Then, the pre-trained GLIP is further fine-tuned and evaluated on the dataset under test.
In Fig.5 we see the results from the zero-shot domain transfer on COCO. We see that all GLIP models have a better 0-shot performance as a supervised Faster RCNN. We are also presented with the result, that GLIP-L outperforms the previous SOTA (at the time of the paper’s release). We see that the larger student GLIP-L outperforms the teacher model GLIP-T (C).

Following I list my doubts when reading these results and the claims made in the paper, where it is said that GLIP-L surpasses the best supervised model SoftTeacher.
- The model that has better metrics than SoftTeacher is GLIP-L, which is better by 0.2 points. This small margin might not be the result of the new method of GLIP but might be due to some differences in training hyperparameters.
- GLIP-L does not even use the data (Cap4M or Cap24M) generated from teacher model which they presented as a good solution.
- GLIP-L has been trained on a much larger corpus of training data as SoftTeacher.
In my opinion the results comparing the different GLIP models and the DyHead-T they trained themselves are completely fine, I just have my doubts in general when different methods and models are compared under unclear or different constraints.
In Fig.6, we see the zero-shot domain transfer performance on LVIS dataset. We can see that the largest GLIP model, GLIP-L, outperforms all other presented supervised models.

Finally, GLIP has been compared on its phrase grounding performance on the Flickr30K entities against MDETR (see Fig.7). Both student models, GLIP-T and GLIP-L, surpass the MDETR baselines.

Data Efficiency
Another experiment is concerned with the data efficiency. This experiment aims to show how the performance (in terms of average precision) changes when fine-tuning a pre-trained model on a certain number of task specific data. In Fig.8, the models are evaluated on 13 different datasets and their performance is reported as average precision averaged over the 13 datasets. Results are reported for 0-shot, 1-shot, 3-shot, 5-shot, 10-shot and "all"-shot (I doubt that’s an official term for complete fine-tuning, but I guess you get the point 😅 ).

Prompt Engineering
Similar as in CLIP, the authors also report a correlation of the model’s performance and the formulation of the input text prompt. They propose two techniques to improve the performance of a pre-trained model, without the need to retrain the model’s weights:
- Manual prompt tuning
- Prompt Tuning
The idea of manual prompt tuning is to provide further context in form of additional descriptive words, see Fig. 9:

Manual prompt tuning can always be used to improve the performance, meaning it does not matter if the model is fully fine-tuned or if the model is used in a zero-shot or few-shot scenario.
The second approach, prompt tuning, requires access to ground truth labels of a downstream task and is especially suitable for scenarios, where each detection task has a single prompt (e.g. "Detect car"). In that scenario, this prompt would first be translated into a feature embedding using the text encoder. Then, the image encoder and the deep fusion module are frozen and only the input embedding is optimized using the ground truth labels. The optimized embeddings would then serve as input to the model and the text encoder could be removed.
Fig.10 shows the result of this prompt tuning for various GLIP models. When applied to models that have a deep fusion module, prompt tuning achieves almost the same performance as fine-tuning the model’s weights.

Further Readings & Resources
As mentioned at the beginning of this article, GLIP has been widely adopted by a vast number of projects.
Following a list of papers that built upon GLIP:
- GLIPv2: Unifying Localization and Vision-Language Understanding
- GLIGEN: Open-Set Grounded Text-to-Image Generation
- Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Here is a list of repositories if you want to dive into the implementation of GLIP and other interesting projects that built upon GLIP:
- Official implementation of GLIP
- Python Notebook to play around with GLIP
- GroundingDINO: combining GLIP and DINO
- Grounded-Segment-Anything: combining GroundingDINO and SAM
Here is one of my articles about the CLIP foundation model, following the same approach of summary as this article: