Run Visual Question Answering Models on Your Images Without Code
Welcome to week two of Ten Weeks of Plugins. During these ten weeks, we will be building a FiftyOne Plugin (or multiple!) each week and sharing the lessons learned!
If you’re new to them, FiftyOne Plugins provide a flexible mechanism for anyone to extend the functionality of their FiftyOne App. You may find the following resources helpful:
- FiftyOne Plugins Repo
- FiftyOne Plugin Docs
- Plugins Channel in the FiftyOne Community Slack
- Week 1 Plugins: AI Art Gallery & Twilio Automation
Ok, let’s dive into this week’s FiftyOne Plugin!
Visual Question Answering 🖼️❓🗨️
Imagine a world where your dataset could talk to you. A world where you could ask any question — specific or open-ended — and get a meaningful answer. How much more dynamic would your data exploration be? Welcome to Visual Question Answering (VQA), a machine learning task squarely situated at the intersection of computer vision and natural language processing.
In recent years, transformer models like Salesforce’s BLIPv2 have taken this open-ended data exploration to a new level. This plugin brings the power of VQA models directly to your image dataset. Now you can start asking your images those burning questions — all without a single line of code!
Plugin Overview & Functionality
For the second week of 10 Weeks of Plugins, I built a Visual Question Answering (VQA) Plugin. This plugin allows you to ask open-ended questions to your images — effectively chatting with your data — within the FiftyOne App.
Out of the box, this plugin supports two models (and two types of usage):
- A Vision-Language Transformer (fine-tuned on the VQAv2 dataset), which is the default VQA model in the Visual Question Answering pipeline from Hugging Face’s Transformers library. This model is run locally.
- BLIP2 from Salesforce, which is accessed via a Replicate inference endpoint.
After you install the plugin, when you open the operators list (pressing “`
” in the FiftyOne App) and click into the answer_visual_question
operator, you can choose which of these models to use.
Enter your question in the question
box, and the answer will be displayed in the operator’s output:
No data is added to the underlying dataset.
Installing the Plugin
If you haven’t already done so, install FiftyOne:
pip install fiftyone
Then you can download this plugin from the command line with:
fiftyone plugins download https://github.com/jacobmarks/vqa-plugin
Refresh the FiftyOne App, and you should see the answer_visual_question
operator in your operators list when you press the “`
” key.
To use the Vision Language transformer (ViLT), install Hugging Face’s transformers library.
pip install transformers
To use BLIPv2, set up an account with Replicate, install the Replicate Python library:
pip install replicate
And add your Replicate API Token to your environment variables:
export REPLICATE_API_TOKEN=...
You do not need both to use the plugin — the operator checks your environment variables and only shows as options models accessible via the corresponding APIs.
If you want to use a different VQA model (or fine tune your own version of one of these!), locally or via API, it should be easy to extend this code.
Lessons Learned
The Visual Question Answering plugin is a Python Plugin consisting of four files:
__init__.py
: defining the operatorfiftyone.yml
: making the plugin register for download and installationREADME.md
: describing the pluginrequirements.txt
: listing the requirements. Bothtransformers
andreplicate
are commented out by default because neither is strictly required.
Using Selected Samples
Visual question answering models like BLIPv2 typically answer questions about one image at a time. As a result, it only makes sense for the answer_visual_question
operator to likewise act on a single image. But how does the operator know which image to answer a question about?
Just like the FiftyOne App, whose session
has a selected
attribute (see Selecting sample), the plugin’s context, ctx
, has a selected
attribute. In direct analogy with the session, ctx.selected
is a list of sample IDs that are currently selected in the FiftyOne App.
The VQA plugin looks at the number of selected samples in the resolve_input()
method:
num_selected = len(ctx.selected)
And only allows the user to enter a question if exactly one sample is selected.
💡Note: to use ctx.selected
, when you expect the selected sample(s) to be changing, you must pass dynamic=True
into the operator’s configuration. In this case, the operator config was:
@property def config(self): return foo.OperatorConfig( name="answer_visual_question", label="VQA: Answer question about selected image", dynamic=True, )
Returning Informative Outputs
The VQA plugin doesn’t write anything onto the samples themselves, but we still need a way to see the results of the model’s run: the “answer”. In this plugin, I return the model’s answer as output, using the resolve_output()
method.
Output in a Python plugin works in much the same way as input. In resolve_input()
, we create an inputs object inputs = types.Object()
, add elements to this, e.g. inputs.str("question", label="Question", required=True)
, and then return these inputs via types.Property(inputs, view=...)
. In resolve_output()
, we create an output object outputs = types.Output()
object, add elements to this, e.g. outputs.str("question", label="Question")
,and return these outputs via types.Property(outputs, view=...)
.
The main difference between inputs and outputs is that in resolve_input()
, the input values come from the user. How are variables communicated to resolve_output()
? You can return them as a dictionary from execute()
, and then use the values, referencing them by key.
In this plugin, I pass the question and answer from execute()
:
return {"question": question, "answer": answer}
Then in resolve_output()
, I access these values:
outputs.str("question", label="Question") outputs.str("answer", label="Answer")
This works for a variety of data types, not just strings!
Conclusion
If you want to chat with your entire dataset, VoxelGPT is a great option. VoxelGPT is another example of a FiftyOne Plugin, which we launched earlier this year. It translates your natural language prompts into actions that organize and explore your data. On the other hand, if you want to ask open-ended questions about specific images in your dataset — without departing from your existing workflows — then this Visual Question Answering plugin is for you!
Stay tuned over the remaining weeks in the Ten Weeks of FiftyOne Plugins while we continue to pump out a killer lineup of plugins! You can track our journey in our ten-weeks-of-plugins repo — and I encourage you to fork the repo and join me on this journey!
Week 2 Community Plugins
🚀Check out this awesome line2d plugin 📉by wayofsamu for visualizing (x,y)
points as a line chart!