Visual Question Answering with Frozen Large Language Models

Talking with LLMs about images, without training LLMs on images.

Daniel Warfield
Towards Data Science
18 min readOct 9, 2023

--

“Bridging modalities”, made with MidJourney. All images by the author unless otherwise stated.

In this article we’ll use a Q-Former, a technique for bridging computer vision and natural language models, to create a visual question answering system. We’ll go over the necessary theory, following the BLIP-2 paper, then implement a system which can be…

--

--