LLaVA: An open-source alternative to GPT-4V(ision)

Running LLaVA on the Web, locally, and on Google Colab

Published in

Towards Data Science

7 min readJan 23, 2024

Curious where this picture was taken? Ask LLaVA! (Image by Guy Rey-Bellet from Pixabay).

LLaVA (acronym of Large Language and Visual Assistant) is a promising open-source generative AI model that replicates some of the capabilities of OpenAI GPT-4 in conversing with images. Users can add images into LLaVA chat conversations, allowing to discuss about the content of these images, but also to use them as a way to describe ideas, contexts or situations in a visual way.

The most compelling features of LLaVA are its ability to improve upon other open-source solutions while using a simpler model architecture and orders of magnitude less training data. These characteristics make LLaVA not only faster and cheaper to train, but also more suitable for inference on consumer hardware.

This post gives an overview of LLaVA, and more specifically aims to

show how to experiment with it from a web interface, and how it can be installed on your computer or laptop
explain its main technical characteristics
illustrate how to program with it, using as an example a simple chatbot application built with HuggingFace libraries (Transformers and Gradio) on Google Colab.

Using LLaVA online

If you have not yet tried it, the simplest way to use LLaVA is by going to the Web interface provided by its authors. The screenshot below illustrates how the interface operates, where a user asks for ideas about what meals to do given a picture of the content of their fridge. Images can be loaded using the widget on the left, and the chat interface allows to ask questions and obtain answers in the form of text.

In this example, LLaVA correctly identifies ingredients present in the fridge, such as blueberries, strawberries, carrots, yoghourt or milk, and suggest relevant ideas such as fruit salads, smoothies or cakes.

Other examples of conversations with LLaVA are given on the project website, which illustrate that LLaVA is capable of not just describing images but also making inferences and reasoning based on the elements within the image (identify a movie or a person using clues from a picture, code a website from a drawing, explain humourous situations, and so on).

Running LLaVA locally

LLaVA can also be installed on a local machine using Ollama or a Mozilla ‘llamafile’. These tools can run on most CPU-only consumer-grade level machines, as the model only requires 8GB of RAM and 4GB of free disk space, and was even shown to successfully run on a Raspberry PI. Among the tools and interfaces developed around the Ollama project, a notable initiative is the Ollama-WebUI (illustrated below), which reproduces the look and feel of OpenAI ChatGPT user interface.

Ollama Web user interface — inspired by OpenAI ChatGPT

Brief overview of LLaVA’s main features

LLaVA was designed by researchers from the University of Wisconsin-Madison, Microsoft Research and Columbia University, and was recently showcased at NeurIPS 2023. The project’s code and technical specifications can be accessed on its Github repository, which also offers various interfaces for interacting with the assistant.

As the authors summarize in their paper’s abstract:

[LLava] achieves state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

The benchmark results, reported in the paper as the radar chart below, illustrate the improvements compared to other state-of-the-art models.

Radar chart of LLaVA’s benchmark results (image from paper)

Inner workings

LLaVA’s data processing workflow is conceptually simple. The model essentially works as a standard causal language model, taking language instructions (a user text prompt) as input, and returning a language response. The ability of the language model to handle images is allowed by a separate vision encoder model that converts images into language tokens, which are quietly added to the user text prompt (acting as a kind of soft prompt). The LLaVA process is illustrated below.

LLaVA network architecture (image from paper)

LLaVA’s language model and vision encoder rely on two reference models called Vicuna and CLIP, respectively. Vicuna is a pretrained large language model based on LLaMA-2 (designed by Meta) that boasts competitive performances with medium sized LLM (See model cards for the 7B and 13B versions on HuggingFace). CLIP is an image encoder designed by OpenAI, pretrained to encode images and text in a similar embedding space using contrastive language-image pretraining (hence ‘CLIP’). The model used in LLaVA is the vision transformer variant CLIP-ViT-L/14 (see its model card on HuggingFace).

To match the dimension of the vision encoder with those of the language model, a projection module (W in the image above) is applied. It is a simple linear projection in the original LLaVA, and a two-layer perceptron in LLaVA 1.5.

Training process

The training process of LLaVA consists of two relatively simple stages.

The first stage solely aims at tuning the projection module W, and the weights of the vision encoder and LLM are kept frozen. The training is performed using a subset of around 600k image/caption pairs from the CC3M conceptual caption dataset, and is available on HuggingFace in this repository.

In a second stage, the projection module weigths W are fine-tuned together with the LLM weights (while keeping the vision encoder’s weights frozen), using dataset of 158K language-image instruction-following data. The data is generated using GPT4, and feature examples of conversations, detailed descriptions and complex reasonings, and is available on HuggingFace in this repository.

The whole training takes around a day using eight A100 GPUs.

Programming with LLaVA: How to get started

Code available on the Colab related notebook.

The LLaVA model is integrated in the Transformers library, and can be loaded using the standard pipeline object. The 7B and 13B variants of the models are available on the LLaVA 😊 Hub space, and may be loaded in 4 and 8 bits to save GPU memory. We illustrate below how to load and run model using code that can be executed on Colab with a T4 TPU (15GB RAM GPU).

Below is the code snippet to load the 7B variant of LLaVA 1.5 in 4 bits:

from transformers import pipeline, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model_id = "llava-hf/llava-1.5-7b-hf"

pipe = pipeline("image-to-text", model=model_id, model_kwargs={"quantization_config": quantization_config})

Let us then load this picture

We use the standard PIL library for loading the picture:

import requests
from PIL import Image

image_url = "https://cdn.pixabay.com/photo/2018/01/29/14/13/italy-3116211_960_720.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
image

Let us finally query the LLaVA model with the image, with a prompt asking to describe the picture.

Note: The format for the prompt follows

“USER: <image>\n<prompt>\nASSISTANT:”

prompt = "USER: <image>\nDescribe this picture\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs[0]['generated_text'])

Which returns the following answer:

USER: Describe this picture
ASSISTANT: The image features a large, empty amphitheater with a stunning view of the ocean in the background. The amphitheater is surrounded by a lush green hillside, and a majestic mountain can be seen in the distance. The scene is serene and picturesque, with the sun shining brightly over the landscape.

LLaVA chatbot

Let us finally create a simple chatbot that relies on a LLaVA model. We will use the Gradio library, which provides a fast and easy way to create machine learning web interfaces.

The core for the interface consists of a row with an image uploader (a Gradio Image object), and a chat interface (a Gradio ChatInterface object).

import gradio as gr

with gr.Blocks() as demo:

    with gr.Row():
      image = gr.Image(type='pil', interactive=True)

      gr.ChatInterface(
          update_conversation, additional_inputs=[image]
      )

The chat interface connects to a function update_conversation, that takes care of keeping the conversation history, and calling the LLaVA model for a response whenever the user sends a message.

def update_conversation(new_message, history, image):

    if image is None:
        return "Please upload an image first using the widget on the left"

    conversation_starting_from_image = [[user, assistant] for [user, assistant] in history if not assistant.startswith('Please')]

    prompt = "USER: <image>\n"

    for i in range(len(history)):
        prompt+=history[i][0]+'ASSISTANT: '+history[i][1]+"USER: "

    prompt = prompt+new_message+'ASSISTANT: '

    outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200, "do_sample" : True, "temperature" : 0.7})[0]['generated_text']

    return outputs[len(prompt)-6:]

The interface is launched calling the launch method.