Visual Question Answering with Frozen Large Language Models

Talking with LLMs about images, without training LLMs on images.

Published in

Towards Data Science

18 min readOct 9, 2023

“Bridging modalities”, made with MidJourney. All images by the author unless otherwise stated.

In this article we’ll use a Q-Former, a technique for bridging computer vision and natural language models, to create a visual question answering system. We’ll go over the necessary theory, following the BLIP-2 paper, then implement a system which can be…

Visual Question Answering with Frozen Large Language Models

Talking with LLMs about images, without training LLMs on images.

Written by Daniel Warfield