The world’s leading publication for data science, AI, and ML professionals.

GLIP: Introducing Language-Image Pre-Training to Object Detection

Grounded Language-Image Pre-training by L. H. Li et. al.

🚀 Sascha’s Paper Club

Today we will dive into a paper that builds upon the great success of CLIP in language-image pre-training and extends it to the task of object detection: GLIP – Grounded Language-Image Pre-training. We will cover the key concepts and findings of the paper and make them easy to understand by providing further context and adding annotations to images and experiment results. Let’s go!

Image created from publication by Sascha Kirch
Image created from publication by Sascha Kirch

Paper: Grounded Language-Image Pre-training by Liunian Harold Li et. al., 7. Dec. 2021

Resources: GitHub

Category: representation learning, object detection, phrase-grounding, foundation models

Other Walkthroughs: [BYOL] – [CLIP] – [Depth Anything] – [Segment Anything] – [DINO] – [DDPM]

Outline

  1. Context & Background
  2. Claimed Contributions
  3. Method
  4. Experiments
  5. Further Readings & Resources

Context & Background

GLIP (Grounded Language-Image Pre-training) is a multi-modal language-image model. Similar to [CLIP](https://towardsdatascience.com/the-clip-foundation-model-7770858b487d?source=friends_link&sk=a7b10ba1d0c3a20ecd4adb8200a48500) (Contrastive Language-Image Pre-Training), it performs contrastive pre-training to learn semantically rich representations and aligns them across its modalities. While CLIP learns these representation on an image level, which means one sentence describes the entire image, GLIP aims to extend this approach to object-level representations, meaning one sentence might correspond to multiple objects within the image. The task of identifying correspondences between single tokens in a text-prompt and objects or regions in an image is called phrase grounding. Hence the word "Grounded" in GLIP.

Therefore, GLIP aims to:

  1. Unify phrase grounding and object detection for large-scale pre-training.
  2. Provide a flexible framework for zero-shot object detection, where flexible means it is not restricted to a fixed set of classes.
  3. Build one pre-trained model that seamlessly transfers to various tasks and domains, in a zero-shot or few-shot manner.

What can you do with such a model? You could use text prompts to find objects or regions of interest within a given input image. And the best part: you are not restricted to pre-defined classes.

Fig. 1: Output of GLIP for different images and prompt formats. Image source + annotations by Sascha Kirch
Fig. 1: Output of GLIP for different images and prompt formats. Image source + annotations by Sascha Kirch

You could further process these detections (e.g. feeding those into a tracking system) or create a custom dataset with certain classes of interest and use those to train your own supervised detection system. Not only that you could cover rare or very specific classes, but you could also save a lot of time and money for the creation of manual labels. As we will see later, the authors of GLIP had a similar idea to boost the performance even further by introducing a teacher-student framework.

GLIP has been adopted by many other projects and domains in Deep Learning. For example, GLIGEN (Grounded-Language-to-Image-Generation) uses GLIP as to condition the image generation of a latent diffusion model to increase the controllability. Additionally, GLIP has been combined with other models such as DINO (DETR with Improved deNoising anchOr boxes) and SAM (Segment Anything Model) to GroundingDINO and Grounded-Segment-Anything respectively. GLIPv2 extends the initial GLIP model with vision-language understanding to not only improve phrase grounding but also enable visual question answering tasks.

Paper Walkthroughs by Sascha Kirch

Claimed Contributions

  1. Large scale pre-training for combined phrase grounding and Object Detection
  2. Providing a unified view on object detection and phrase grounding
  3. Deep cross-modality fusion to learn high-quality language-aware visual representations and to achieve superior transfer learning performance.
  4. Presenting that prompt-tuning is more effective in deep vision-language fusion (e.g. GLIP) as in shallow fused networks (e.g. CLIP)

Method

Having a rough idea of what can be done with GLIP, let’s have a closer look into the details of the paper.

Architectural Overview

On a high level, GLIP’s architecture is quite similar to CLIP‘s in a sense that it also consists of a text encoder, an image encoder and some sort of contrastive learning on the similarity of text and image features. The architecture of GLIP is shown in Fig. 2.

Fig. 2: Framework architecture. Image source + annotations by Sascha Kirch
Fig. 2: Framework architecture. Image source + annotations by Sascha Kirch

GLIP adds a language-image aware deep fusion module after the text and image encoder. This module performs cross-modal attention and extracts further features. A cosine similarity is calculated over the resulting region features and word features. During training, the similarity of matching pairs is maximized, while minimized for incorrect pairs. In contrast to CLIP, where the matching pairs are located on the diagonal of the similarity matrix, in GLIP the matching is not performed on sentence level, but on (sub)word level resulting in usually off-diagonal positions.

Phrase Grounding Formulated as Object Detection Problem

The authors noted that the problem of phrase grounding (= associating words with objects/regions in an image) can be formulated as Object detection Objective, where the standard loss objective is:

The localization loss is concerned with the quality of the predicted bounding box, which depending on the format, might be the size and location of the box. The classification loss is the key part in the unification. By calculating the logits over the similarity score of text-image features instead of over the logits from an image classifier, the same loss objective can be used for training.

Different Model Variants

Five different models are trained to show the effect of the authors’ design choices and model scale:

Fig. 3: Model variants. Image source + annotations by Sascha Kirch
Fig. 3: Model variants. Image source + annotations by Sascha Kirch

Teacher-Student Pre-Training

To boost the performance of GLIP, the authors train the GLIP-T (C) model (see Fig.3) on human annotated data, called GoldG, to generate grounding data from text-image pairs from the internet. They call this model the teacher model and subsequently train a student model feeding it the with the data used to train the teacher plus the data the teacher generated. See Fig. 4 for an illustration.

Note: Even though the terms teacher and student are used, it is not the same process as in knowledge distillation, where a smaller student model is trained to match the output of a larger teacher model.

Fig. 4. Teacher-Student Pre-Training. Image by Sascha Kirch
Fig. 4. Teacher-Student Pre-Training. Image by Sascha Kirch

Interestingly, as we will see in the experiments, the student surpasses the teacher on many (but not all) datasets for both; zero-shot and few-shot detection. Why is that? The paper hypothesizes, that even though the teacher provides a prediction with low confidence (they call it an "educated guess"), it becomes the ground truth (they call it "supervised signal") in the generated dataset consumed by the student.

Get an email whenever Sascha Kirch publishes 🚀 _Get an email whenever Sascha Kirch publishes 🚀 Looking to learn more about deep learning or simply stay up to dat_e…medium.com

Experiments

The GLIP paper presents various experiments and ablation studies, mainly concerned with:

  1. Zero-Shot Domain Transfer
  2. Data Efficiency
  3. Prompt Engineering

I have some doubts for some of the results and the way they are presented, and I will point them out in the annotations. I don’t want to diminish the achievements of GLIP and rather view it with a critical eye.

Now let’s jump into the details!

Zero-Shot Domain Transfer

First, we will have a look into the results from the zero-shot domain transfer. In this task the objective is to analyze how well the pre-trained GLIP models perform on a different dataset (i.e. COCO and LVIS) as used during pre-training and compare it against a baseline with models that have been trained in a supervised fashion. Then, the pre-trained GLIP is further fine-tuned and evaluated on the dataset under test.

In Fig.5 we see the results from the zero-shot domain transfer on COCO. We see that all GLIP models have a better 0-shot performance as a supervised Faster RCNN. We are also presented with the result, that GLIP-L outperforms the previous SOTA (at the time of the paper’s release). We see that the larger student GLIP-L outperforms the teacher model GLIP-T (C).

Fig. 5: Zero-shot domain transfer and fine-tuning on COCO. Image source + annotations by Sascha Kirch
Fig. 5: Zero-shot domain transfer and fine-tuning on COCO. Image source + annotations by Sascha Kirch

Following I list my doubts when reading these results and the claims made in the paper, where it is said that GLIP-L surpasses the best supervised model SoftTeacher.

  1. The model that has better metrics than SoftTeacher is GLIP-L, which is better by 0.2 points. This small margin might not be the result of the new method of GLIP but might be due to some differences in training hyperparameters.
  2. GLIP-L does not even use the data (Cap4M or Cap24M) generated from teacher model which they presented as a good solution.
  3. GLIP-L has been trained on a much larger corpus of training data as SoftTeacher.

In my opinion the results comparing the different GLIP models and the DyHead-T they trained themselves are completely fine, I just have my doubts in general when different methods and models are compared under unclear or different constraints.

In Fig.6, we see the zero-shot domain transfer performance on LVIS dataset. We can see that the largest GLIP model, GLIP-L, outperforms all other presented supervised models.

Fig. 6: Zero-shot domain transfer to LVIS. Image source + annotations by Sascha Kirch
Fig. 6: Zero-shot domain transfer to LVIS. Image source + annotations by Sascha Kirch

Finally, GLIP has been compared on its phrase grounding performance on the Flickr30K entities against MDETR (see Fig.7). Both student models, GLIP-T and GLIP-L, surpass the MDETR baselines.

Fig. 7: Phrase grounding performance on Flickr30K entities. Image source + annotations by Sascha Kirch
Fig. 7: Phrase grounding performance on Flickr30K entities. Image source + annotations by Sascha Kirch

Data Efficiency

Another experiment is concerned with the data efficiency. This experiment aims to show how the performance (in terms of average precision) changes when fine-tuning a pre-trained model on a certain number of task specific data. In Fig.8, the models are evaluated on 13 different datasets and their performance is reported as average precision averaged over the 13 datasets. Results are reported for 0-shot, 1-shot, 3-shot, 5-shot, 10-shot and "all"-shot (I doubt that’s an official term for complete fine-tuning, but I guess you get the point 😅 ).

Fig. 8: Data Efficiency. Image source + annotations by Sascha Kirch
Fig. 8: Data Efficiency. Image source + annotations by Sascha Kirch

Prompt Engineering

Similar as in CLIP, the authors also report a correlation of the model’s performance and the formulation of the input text prompt. They propose two techniques to improve the performance of a pre-trained model, without the need to retrain the model’s weights:

  1. Manual prompt tuning
  2. Prompt Tuning

The idea of manual prompt tuning is to provide further context in form of additional descriptive words, see Fig. 9:

Fig. 9: Manual prompt tning example. Image source + annotations by Sascha Kirch
Fig. 9: Manual prompt tning example. Image source + annotations by Sascha Kirch

Manual prompt tuning can always be used to improve the performance, meaning it does not matter if the model is fully fine-tuned or if the model is used in a zero-shot or few-shot scenario.

The second approach, prompt tuning, requires access to ground truth labels of a downstream task and is especially suitable for scenarios, where each detection task has a single prompt (e.g. "Detect car"). In that scenario, this prompt would first be translated into a feature embedding using the text encoder. Then, the image encoder and the deep fusion module are frozen and only the input embedding is optimized using the ground truth labels. The optimized embeddings would then serve as input to the model and the text encoder could be removed.

Fig.10 shows the result of this prompt tuning for various GLIP models. When applied to models that have a deep fusion module, prompt tuning achieves almost the same performance as fine-tuning the model’s weights.

Fig. 10: Effectiveness of prompt tuning. Image source + annotations by Sascha Kirch
Fig. 10: Effectiveness of prompt tuning. Image source + annotations by Sascha Kirch

Further Readings & Resources

As mentioned at the beginning of this article, GLIP has been widely adopted by a vast number of projects.

Following a list of papers that built upon GLIP:

  1. GLIPv2: Unifying Localization and Vision-Language Understanding
  2. GLIGEN: Open-Set Grounded Text-to-Image Generation
  3. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Here is a list of repositories if you want to dive into the implementation of GLIP and other interesting projects that built upon GLIP:


Here is one of my articles about the CLIP foundation model, following the same approach of summary as this article:

The CLIP Foundation Model


Related Articles