In this article we’ll use a Q-Former, a technique for bridging computer vision and natural language models, to create a visual question answering system. We’ll go over the necessary theory, following the BLIP-2 paper, then implement a system which can be used to talk with a large language model about an image.
Who is this useful for? Data scientists interested in computer vision, natural language processing, and multimodal modeling.
How advanced is this post? Intermediate. You might struggle if you don’t have some experience in both computer vision and natural language processing.
Prerequisites: High level familiarity with transformers, embeddings, and encoder-decoders. All of these topics are covered in the following article:
Visual language modeling really started up in 2016 with the paper VQA: Visual Question Answering, which formally posed the following class of problem:
Given an image and a natural language question about the image, the task is to provide an accurate natural language answer — VQA: Visual Question Answering
In 2016, when VQA was popularized, a typical approach looked something like this:
In the early days of VQA it was appropriate to train the vision and language components from scratch, pass the outputs to a dense network, and pick one of n possible outputs as a response.
As vision and language models became more powerful, Visual Question Answering gave way to Visual Language Modeling (VLM), which can generally be considered as an expansion on visual question…
Be the first to comment