10 Multimodal ML Papers You Won't Want to Miss

The biggest conference in machine learning is underway. The thirty-seventh conference on Neural Information Processing Systems, or NeurIPS as it is colloquially called, is taking place from December 10th-16th, 2023 in New Orleans, LA. Thousands of machine learning researchers from around the world have converged in the Big Easy to eat beignets and discuss everything from Bayesian optimization to adversarial attacks and in-context learning.

With 3,584 accepted papers, 14 tutorials, and 58 workshops, it’s nigh impossible to absorb all that the conference contains. Lucky for you, we’ve scraped, scoured, and synthesized the data to bring you the cream of the crop from NeurIPS 2023!

Heading to the event? Come visit the Voxel51 team at NeurIPS booth #427! We’d love to meet you, show you how invaluable open source FiftyOne will be to your data-centric AI workflows, and send you home with some epic swag. Also, be sure to check out our demo on Sunday’s agenda!

Without further ado, here are ten of our favorite multimodal machine learning advances from NeurIPS in alphabetical order:

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
DataComp: In search of the next generation of multimodal datasets
Holistic Evaluation of Text-to-Image Models
ImageReward: Learning and Evaluating Human Preferences for Text-to-image Generation
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
LAMM: Multi-Modal Large Language Models and Applications as AI Agents
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

For a more comprehensive presentation of NeurIPS papers you won’t want to miss, across multiple domains, as well as meta-analysis of NeurIPS paper data, check out the Awesome NeurIPS 2023 repo — your one-stop-shop for all things NeurIPS 2023.

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Examples of Chameleon with GPT-4 on ScienceQA. Original source: Chameleon paper.

TL;DR: Flexible training-free technique for extending pure language models to challenging multimodal tasks

Released back in April but still as important as ever, Chameleon is a prompting infrastructure, a reasoning engine, and a tool orchestrator. The technique leverages the general reasoning capabilities of large language models in conjunction with tools for web search, mathematical analysis, and visual understanding to tackle complex questions that involve multi-step or multimodal reasoning.

While similar in spirit to HuggingGPT (NeurIPS 2023 ), ViperGPT (ICCV 2023) and VisProg (CVPR 2023), Chameleon stands out in its flexibility and adaptability. It is truly plug-and-play — it works with tools for image understanding, web browsing, and tabular data processing. GPT-4 augmented with Chameleon scores 86.54% on the challenging ScienceQA benchmark, outperforming GPT-4 with chain-of-thought (CoT) prompting, but falling short of fine-tuned vision-language models like Multimodal Chain of Thoughtand LaVIN.

💡If you like Chameleon, you should check out:

HuggingGPT
ViperGPT
VisProg
VoxelGPT

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Overview of LaVIN Architecture. Original source: LaVIN paper.

TL;DR: Single-stage training of multi-modality adapters 👉 SOTA parameter efficient approach for fine-tuning vision language models

Adapting large language models for multimodal applications can be a costly and time-consuming process. It can involve multiple stages, and many trained parameters — for instance, LLaVA-13B retrains the entire 13B-parameter language model. These realities can be prohibitive.

In Cheap and Quick, researchers introduce a new technique Mixture-of-Modality Adaptation (MMA) for fine-tuning vision-language models which drastically reduces these barriers to entry. MMA swaps out the multistage process for a single stage in which a vision adapter, vision-to-language adapter, and new mixture-of-modality adapter are trained simultaneously.

With this approach, lightweight adapters totalling just a few million parameters are sufficient to produce remarkably strong multimodal models. Applying MMA to LLaMA and training for just 1.4 hours, the resulting model dubbed LaVIN (large vision-language instructed) achieves 89.41% on ScienceQA, and a 13B version of the model scores 90.83%.

DataComp: In search of the next generation of multimodal datasets

High-level journey of participants in the DataComp competition. Original source: DataComp paper.

TL;DR: Competition and benchmark for evaluating novel filtering strategies for constructing multimodal datasets, and a SOTA 1B sample multimodal dataset

Multimodal models like OpenAI’s CLIP (contrastive language-image pretraining) have enabled a wide range of multimodal applications. Over the past few years, the pace of development in the architectures of these models has substantially picked up, but far less research has gone into the data used to train these models.

DataComp flips the traditional ML benchmark on its head, fixing the training budget, model architecture, and evaluation code, pinning the focus on the data. The DataComp competition tasks participants with finding the dataset that results in a trained CLIP model with best downstream performance on 38 multimodal classification and retrieval tasks.

The competition consists of two parallel tracks. In the first track, all participants start with a common pool of initial data and design strategies to filter the data. In the second track, participants bring their own data. As a baseline, the DataComp team curates a high-quality 1B sample dataset, and uses this to train a CLIP ViT-L/14 model with 79.2% zero-shot accuracy on ImageNet.

💡If you’re interested in DataComp, and want to dive deeper into filtering strategies, check out:

CLIPA
MetaCLIP

Holistic Evaluation of Text-to-Image Models

Overview of aspects evaluated as part of the HEIM benchmark. Original source: HEIM paper.

TL;DR: The first benchmark for comprehensively comparing text-to-image models

If you’ve played around with text-to-image (T2I) models like DALLE, Stable Diffusion, or Midjourney, you’ve probably noticed major differences in style, accuracy, and maybe even biases from one model to another. Probing deeper, you may have noticed that some models are better than others at reasoning about spatial relationships, incorporating historical knowledge, and handling inputs from disparate languages.

If you have, you’re not alone! As with large language models, comparing text-to-image models is not so straightforward — one number metrics like FID (image quality) and CLIPScore (image-text alignment) only tell part of the story.

Taking inspiration from HELM, researchers propose the first holistic evaluation benchmark for text-to-image models (HEIM). The benchmark encapsulates 12 “aspects” of performance, including toxicity, originality, and efficiency, incorporating computational metrics and crowd-sourced human evaluations. Using the benchmark to evaluate 26 popular T2I models, they find that no single model comes out on top across the board.

ImageReward: Learning and Evaluating Human Preferences for Text-to-image Generation

Comparison of images generated with ImageReward and ReFL to top images from base text-to-image models. Original source: ImageReward paper.

TL;DR: First general-purpose reward model for human preferences on T2I models, and an approach for aligning T2I models

Reinforcement Learning from Human Feedback or RLHF was one of the biggest breakthroughs that made ChatGPT such a hit. The main idea behind RLHF is that by having a model output multiple responses for the same input and asking humans to rank those responses, you can better align the model itself (via fine tuning) with human preferences. Intuitively, one would love to apply a similar approach to other generative models like text-to-image (T2I) models in order to remedy body disfigurations, reduce toxicity, and enhance image-text alignment.

In this work, researchers present ImageReward — the first general purpose reward model for human preferences on T2I generated images. The model is built from a BLIP backbone, and is trained on a newly constructed ImageRewardDB dataset of 137k comparison image pairs across 8,878 prompts.

On top of ImageReward, a technique called Reward Feedback Learning (ReFL) is applied to tune T2I models for alignment with human preferences. Whereas in RLHF, the likelihood of a generation to increase reward is used to steer the language model, an analogous notion of likelihood does not exist for latent diffusion models. As an alternative, the ImageReward score of the image after denoising steps is used as feedback. In human evaluations, the resulting ReFL-tuned model wins in head-to-head against Stable Diffusion 59% of the time.

💡Browse a sanitized version of the ImageReardDB dataset for free!

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

InstructBLIP architecture overview. Original source: InstructBLIP paper.

TL;DR: Adding instruction-aware feature extraction to vision encoder and applying visual instruction tuning to LLMs 👉SOTA zero-shot and task-specific performance

Instruction tuning is a technique for fine tuning large language models by training them on (instruction, output) pairs. For instance, instead of including few-shot examples in a prompt nudging the model to respond to multiple choice questions with A, B, C, or D, train the model on multiple choice questions with the instruction “answer with A, B, C, or D”, and the desired answers at outputs.

With InstructBLIP, a team from Salesforce applied the instruction tuning approach to vision-language models in full force. The researchers took 26 benchmark datasets across 11 multimodal tasks and hand-crafted task-specific instruction templates. Here are a few exemplary templates:

Starting from pretrained language models (FlanT5 and Vicuna), the researchers added a Query Transformer which adds a layer of instruction-awareness to the visual encoding passed to the language model. The models were then fine tuned on instruction data, resulting in a family of state-of-the-art (SOTA) zero-shot models — InstructBLIP (Bootstrapped Language-Image Pretraining). And further tuning InstructBLIP models for specific tasks also leads to SOTA on those tasks!

LAMM: Multi-Modal Large Language Models and Applications as AI Agents

Overview of the Language-Assisted Multi-Modal (LAMM) Benchmark. Original source: LAMM paper.

TL;DR: Dataset and benchmark for evaluating 2D and 3D visual reasoning in multimodal large language models

With InstructBLIP, the first step was adapting existing multimodal datasets for instruction tuning. In LAMM, researchers construct a multimodal instruction tuning dataset from scratch. This new purpose-built dataset forms the basis for the Language-assisted multimodal model (LAMM) benchmark, which offers a unified way to evaluate the multimodal capabilities of vision-language models on visual tasks.

To construct the dataset, the LAMM team took images (for 2D tasks like keypoint detection and optical character recognition) and point clouds (for 3D tasks like 3D object detection) from publicly available datasets. They then augmented the visual data with instruction-response pairs generated by GPT — a technique referred to as self-instruction — resulting in 186,098 sets of image, instruction, and response as well as 10,262 sets of point cloud, instruction and response.
💡Check out the LLAM leaderboard — you’ll see lots of familiar faces!

MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing

Example image-editing sequences from the MagicBrush dataset, including single-turn and multi-turn instructions. Original source: MagicBrush paper.

TL;DR: High quality hand-crafted image editing dataset 👉improved performance on single and multi-turn image editing tasks

If you’ve ever photoshopped someone into or out of a photo, you know how frustrating and tedious image editing can be. Instruction-guided image editing models like InstructPix2Pix (CVPR 2023 Highlight) aim to perform these types of edits from instructions given in natural language. However, the datasets used to train prior image editing models were either synthetically generated (using text-to-image models), closed-source, or single-turn only — so model performance degrades rapidly as edits are iteratively applied.

The MagicBrush dataset brings an unprecedented level of quality and robustness to instruction-guided image editing. The dataset consists of 10,388 (source image, instruction, target image) triplets, including both single-turn and multi-turn edits. Source images were taken from MS COCO, and balanced by object class for data diversity. The types of edits were also varied, including actions, color changes, object removals and more.

Human annotators who passed a training process proposed text-based instructions and masks on the images (denoting the region of the image to inpaint), and DALL-E2 was used to generate edited images. The human annotators then selected the most faithful and realistic image as a target image in a MagicBrush triplet. Fine-tuning both InstructPix2Pix and HIVE on MagicBrush resulted in significant improvements, in both single and multi-turn editing!

💡Browse the MagicBrush dataset for free!

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Illustration of the difference between isolated image-text pairs (left) and interleaved multimodal documents (right) such as those in the OBELICS dataset. Original source: OBELICS paper.

TL;DR: Rich dataset of 141M multimodal documents containing both images and text

Most multimodal datasets are designed with specific tasks in mind. Image captioning datasets like COCO Captions, for instance, pair an image with a string of text (the caption) in a structured and constrained fashion. This can be useful for training models to excel at one particular task, but can leave a lot on the table in terms of universal multimodal reasoning. The controlled laboratory environment of such a structured dataset is not representative of the broader spectrum of multimodal interactions we’d like a sufficiently general model to have.

The OBELICS dataset from Hugging Face is the first web-scale dataset of natural multimodal documents with interleaved images and text. The dataset features 141 million web pages from the Common Crawl, including 353 million images and 115 billion tokens of textual data. The researchers demonstrate the utility of OBELICS by training a new 80B parameter model on the dataset. This model, called IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) is shown to be competitive with Flamingo!

Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

Example pairs of non-preferred (left) and preferred (right) images in the Pick-a-Pic dataset. Original source: Pick-a-Pic paper.

TL;DR: Large-scale dataset of prompts, generated images, and human preferences

In ImageReward, we saw that human preferences could be used to better align text-to-image (T2I) models. As is often the case, the authors of Pick-a-Pic identified the availability of high-quality, accessible data as a limiting factor in applying these techniques.

In response to this need, they created a web application to collect user preferences. users engaged with the application by crafting their own prompts and ranking the images generated by T2I models Stable Diffusion and SDXL. At the time the paper’s experiments were performed, the authors had collected 968,965 rankings across 66,798 prompts.

The dataset was used to train a model PickScore which scores images based on their alignment with prompts. When evaluated on the Pick-a-Pic test set, PickScore surpasses ImageReward and even trumps expert human annotators (who are unaware of the original user’s intention) at predicting human preferences.

⚠️Watch out — while the dataset has a lot of awesome examples, it also has some very toxic content.

If you enjoyed this post and want to geek out on data-centric AI and the future of the field, reach out to me on LinkedIn and come by our booth at the conference — Voxel51 is a platinum sponsor!

Talk to a computer vision expert