Visual Question Answering with Frozen Large Language Models | by Daniel Warfield

Talking with LLMs about images, without training LLMs on images.

18 min read

19 hours ago

“Bridging modalities”, made with MidJourney. All images by the author unless otherwise stated.

In this article we’ll use a Q-Former, a technique for bridging computer vision and natural language models, to create a visual question answering system. We’ll go over the necessary theory, following the BLIP-2 paper, then implement a system which can be used to talk with a large language model about an image.

Who is this useful for? Data scientists interested in computer vision, natural language processing, and multimodal modeling.

How advanced is this post? Intermediate. You might struggle if you don’t have some experience in both computer vision and natural language processing.

Prerequisites: High level familiarity with transformers, embeddings, and encoder-decoders. All of these topics are covered in the following article:

Visual language modeling really started up in 2016 with the paper VQA: Visual Question Answering, which formally posed the following class of problem:

Given an image and a natural language question about the image, the task is to provide an accurate natural language answer — VQA: Visual Question Answering

In 2016, when VQA was popularized, a typical approach looked something like this:

A VQA model from 2016 using an LSTM to embed the question into a vector, an existing computer vision network to embed the image as a vector, then a dense layer which considers the two in the correct choice of output. From VQA: Visual Question Answering.

In the early days of VQA it was appropriate to train the vision and language components from scratch, pass the outputs to a dense network, and pick one of n possible outputs as a response.

As vision and language models became more powerful, Visual Question Answering gave way to Visual Language Modeling (VLM), which can generally be considered as an expansion on visual question…

Source link