Exploring “Small” Vision-Language Models with TinyGPT-V

TinyGPT-V is a “small” vision-language model that can run on a single GPU

Published in

Towards Data Science

8 min readJan 12, 2024

Summary

AI technologies are continuing to become embedded in our everyday lives. One application of AI includes going multi-modal, such as integrating language with vision models. These vision-language models can be applied towards tasks such as video captioning, semantic searching, and many other problems.

This week, I’m going to shed a spotlight towards a recent vision-language model called TinyGPT-V (Arxiv | GitHub). What makes this multimodal language model interesting is that it is very “small” for a large language model, and can be deployed on a single GPU with as little as 8GB of GPU or CPU for inference. This is significant for maximizing the speed, efficiency, and costs of AI models in the wild.

I would like to note that I’m not an author or in anyway affiliated with the authors of the model. However, as a researcher and practitioner, I thought it was an intriguing development in AI that is worth examining, especially since having more efficient models will unlock many more applications. Let’s dive in!

The Problem: Vision-Language Models are Useful But Resource Intensive

Multi-modal models, such as vision-language models, are achieving record performance in human-aligned responses. As these models continue to improve, we could see companies begin to apply these technologies in real-world scenarios and applications.

However, many AI models, especially multi-modal models, require substantial computational resources for both model training and inference. This physical constraint of time, hardware resources, and capital is a bottleneck for researchers and practitioners.

Further, these constrains currently prevent multi-modal models from being deployed in certain application interfaces, such as edge-devices. Research and development towards quantized (smaller) and high performance models is needed to address these challenges.

TinyGPT-V: “Small” Vision Language Models

TinyGPT-V is a 2.8B parameter vision-language model that can be trained on a 24GB GPU and uses 8GB of GPU or CPU for inference. This is significant, because other state-of-the-art “smaller” vision-language models, such as LLaVA1.5, are still relatively “big” (7B and 13B parameters).

When benchmarking against other larger vision-language models, TinyGPT-V achieves similar performance on multiple tasks. Together, this work contributes towards a movement to make AI models more efficient by reducing their computational needs while retaining performance. Balancing these two objectives will enable vision-language models to be served directly on devices, which will offer better user experiences including reduced latency and more robustness.

Related Work and Adjacent Technologies Applied in the TinyGPT-V architecture

Not-So-Large Foundation Vision-Language Models (VLMs)

VLMs learn the relationship between images/videos and text, which can be applied for many common tasks such as searching for objects within a photo (semantic search), asking questions and receiving answers on videos (VQA), and many more tasks. LLaVA1.5 and MiniGPT-4 are two multi-modal large language models that are state-of-the-art as of January 2024, and are relatively smaller than similar VL foundation models. However, these VLMs still requires significant GPU usage and training hours. For example, the authors describe the training resources for LLaVA-v1.5 13B parameter model, which uses eight A100 GPUs with 80GB RAM for 25.5 hours of training. This is a barrier towards individuals and institutions that wish to study, develop, and apply these models in the wild.

TinyGPT-V is one of the latest VLMs that aims to address this issue. It uses two separate foundation models for the vision and language components: the EVA encoder was used as the vision component, while Phi-2 was used as the language model. Briefly, EVA scales up to a 1B parameter vision transformer model that is pre-trained to reconstruct masked image-text features. Phi-2 is a 2.7B parameter language model that was trained on curated synthetic and web datasets. The authors were able to merge these two models and quantize them to have a total parameter size of 2.8B.

Shown below is the performance of TinyGPT-V compared to other VLMs with various visual language tasks. Notably, TinyGPT-V performs similarly to BLIP-2, likely due to the pre-trained Q-Former module that was taken from BLIP-2. Further, it appears that InstructBLIP achieved better performance compared to TinyGPT-V, although it is noted that the smallest InstructBLIP model is trained with 4B parameters. Depending on the application, this trade-off may be worth it to a practitioner, and additional analyses would need to be conducted to explain for this difference.

The following datasets the model is trained with include:

GQA: Real-world visual reasoning and compositional QA
VSR: text-image pairs in english with spatial relationships
IconQA: visual understanding and reasoning with icon images
VizWiz: visual queries derived from a photo taken by a visually impaired individual with a smartphone and supplemented with 10 answers.
HM: a multimodal collection designed to detect hateful content in memes.

TinyGPT-V benchmark performance against similar state-of-the-art “smaller” vision language models (adapted from Figure 1 of Yuan et al., 2023). Note that we should assume that the authors denote their model as “TinyGPT-4”. It’s performance is comparable to BLIP-2, which is ~3.1B parameters. InstructBLIP has better performance across different tasks, but is notably ~4B parameters. This is much bigger than TinyGPT-V, which is ~2.1B parameters in size.

Cross-modal alignment of visual and language features

VLM training consists of several objective functions to optimize for to a) expand the utility of VLMs, b) increase VLM general performance, and c) mitigate the risk of catastrophic forgetting. In addition to different objective functions, there are several model architectures or methods to learn and merge the joint representation of vision and language features. We will discuss the relevant layers for training TinyGPT-V, which are shown below as blocks.

TinyGPT-V training schemes, adapted from Figure 2 (Yuan et al., 2023). Stage 1 was a warm-up pre-training stage. The second stage is a pre-training stage to train the LoRA module. The third training stage aims to instruction-tune the model. Finally, the fourth training stage aims to fine-tune the model for various multi-modal tasks.

The Q-Former described in BLIP-2 paper was used to learn the joint representation from the aligned image-text data. The Q-Former method optimizes for three objectives to learn the vision-language representation:

Image-Text Matching: Learn fine-grained alignment between the image and text representation
Image-Text Contrastive Learning: Align the image and text representation to maximize the mutual information gained
Image-Grounded Text Generation: Train the model to generate text, given input images

Following the Q-former layer, they employed a pre-trained linear projection layer from MiniGPT-4 (Vicuna 7B) in order to accelerate learning. Then they apply a linear projection layer to embed these features into the Phi-2 language model.

Normalization

Training smaller large-scale language models from different modalities presented significant challenges. During their training process, they found that the model outputs were susceptible to NaN or INF values. Much of this was attributed to the vanishing gradient problem, as the model had a limited number of trainable parameters. To address these issues, they applied several normalization procedures in the Phi-2 model to ensure that the data is in an adequate representation for model training.

There are three normalization techniques that are applied throughout the Phi-2 model with minor adjustments from their vanilla implementation. They updated the LayerNorm mechanism that is applied within each hidden layer by including a small number for numerical stability. Further they implemented RMSNorm as a post-normalization procedure after each Multi-Head Attention Layer. Finally, they incorporated a Query-Key Normalization procedure, which they determined as being important in low-resource learning scenarios.

Parameter Efficient Fine-Tuning

Fine-tuning models is essential to achieve better performance on downstream tasks or domain areas that are not covered in pre-training. This is an essential step to provide huge performance gains compared to out-of-the-box foundation models.

One intuitive way to fine-tune a model is to update all pre-trained parameters with the new task or domain area in mind. However, there are issues with this way of fine-tuning large language models, as it requires a full copy of the fine-tuned model for each task. Parameter Efficient Fine-Tuning (PEFT) is an active area of research in the AI community, where a smaller number of task-specific parameters are updated while most of the foundation model parameters are frozen.

Low-Rank Adaptation (LoRA) is a specific PEFT method that was used to fine-tune TinyGPT-V. At a high-level, LoRA freezes the pre-trained model weights, and injects trainable rank decomposition matrices into each layer of a transformer, which reduces the number of trainable parameters for downstream tasks. Shown below is how the LoRA module was applied to the TinyGPT-V model.

Adapted from Figure 3 (Yuan et al., 2023). Low-Rank Adaptation (LoRA) was applied to fine-tune TinyGPT-V. Panel c) hows how LoRA was implemented in TinyGPT-V. Panel d) shows the query-key normalization method described in the previous section.

Conclusions and parting thoughts

TinyGPT-V contributes to a body of research for making multi-modal large language models more efficient. Innovations in multiple areas, such as PEFT, quantization methods, and model architectures will be essential to getting models as small as possible while not sacrificing too much performance. As was observed in the pre-print, TinyGPT-V achieves a similar performance to other smaller VLMs. It matches BLIP-2 performance (smallest model is 3.1B parameters), and while it falls short of InstructBLIP’s performance on similar benchmarks, it is still smaller in size (TinyGPT-V is 2.8B parameters versus InstructBLIP’s 4B).

For future directions, there are certainly aspects that could be explored to improve TinyGPT’s performance. For instance, other PEFT methods could have been applied for fine-tuning. From the pre-print, it is unclear if these model architecture decisions were purely based on empirical performance, or if it was a matter of convenience for implementation. This should be studied further.

Finally, at the time of this writing the pre-trained model and the model fine-tuned for instruction learning are available, while the multi-task model is currently a test version on GitHub. As developers and users use the model, further improvements could shed insights into additional strengths and weaknesses with TinyGPT-V. But altogether, I thought this was a useful study for designing more efficient VLMs.

I hope you found this breakdown of TinyGPT-V useful for your own applications! If you want to chat more about AI or if you’re in the Bay Area and just want to grab some coffee, please feel free to reach out on LinkedIn. Otherwise, you can also catch me on torchstack.ai, where we offer custom AI solutions to customers and businesses.