10 Multimodal ML Papers You Won’t Want to Miss
The biggest conference in machine learning is underway. The thirty-seventh conference on Neural Information Processing Systems, or NeurIPS as it is colloquially called, is taking place from December 10th-16th, 2023 in New Orleans, LA. Thousands of machine learning researchers from around the world have converged in the Big Easy to eat beignets and discuss everything from Bayesian optimization to adversarial attacks and in-context learning.
With 3,584 accepted papers, 14 tutorials, and 58 workshops, it’s nigh impossible to absorb all that the conference contains. Lucky for you, we’ve scraped, scoured, and synthesized the data to bring you the cream of the crop from NeurIPS 2023!
Heading to the event? Come visit the Voxel51 team at NeurIPS booth #427! We’d love to meet you, show you how invaluable open source FiftyOne will be to your data-centric AI workflows, and send you home with some epic swag. Also, be sure to check out our demo on Sunday’s agenda!
Without further ado, here are ten of our favorite multimodal machine learning advances from NeurIPS in alphabetical order:
- Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
- Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
- DataComp: In search of the next generation of multimodal datasets
- Holistic Evaluation of Text-to-Image Models
- ImageReward: Learning and Evaluating Human Preferences for Text-to-image Generation
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- LAMM: Multi-Modal Large Language Models and Applications as AI Agents
- MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
- OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
- Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
For a more comprehensive presentation of NeurIPS papers you won’t want to miss, across multiple domains, as well as meta-analysis of NeurIPS paper data, check out the Awesome NeurIPS 2023 repo — your one-stop-shop for all things NeurIPS 2023.
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Examples of Chameleon with GPT-4 on ScienceQA. Original source: Chameleon paper.
- Links: (Arxiv | Code | Project Page)
- Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Jianfeng Gao
TL;DR: Flexible training-free technique for extending pure language models to challenging multimodal tasks
Released back in April but still as important as ever, Chameleon is a prompting infrastructure, a reasoning engine, and a tool orchestrator. The technique leverages the general reasoning capabilities of large language models in conjunction with tools for web search, mathematical analysis, and visual understanding to tackle complex questions that involve multi-step or multimodal reasoning.
While similar in spirit to HuggingGPT (NeurIPS 2023 ), ViperGPT (ICCV 2023) and VisProg (CVPR 2023), Chameleon stands out in its flexibility and adaptability. It is truly plug-and-play — it works with tools for image understanding, web browsing, and tabular data processing. GPT-4 augmented with Chameleon scores 86.54% on the challenging ScienceQA benchmark, outperforming GPT-4 with chain-of-thought (CoT) prompting, but falling short of fine-tuned vision-language models like Multimodal Chain of Thought and LaVIN.
💡If you like Chameleon, you should check out:
- HuggingGPT: LLM dispatcher for CV tasks
- ViperGPT: Visual inference via Python execution
- VisProg: Visual reasoning without training (CVPR 2023 best paper)
- VoxelGPT: Text-to-query for CV datasets
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Overview of LaVIN Architecture. Original source: LaVIN paper.
- Links: (Arxiv | Code | Project Page)
- Gen Luo, Yiyi Zhou, Tianhe Ren, Shengxin Chen, Xiaoshuai Sun, Rongrong Ji
TL;DR: Single-stage training of multi-modality adapters 👉 SOTA parameter efficient approach for fine-tuning vision language models
Adapting large language models for multimodal applications can be a costly and time-consuming process. It can involve multiple stages, and many trained parameters — for instance, LLaVA-13B retrains the entire 13B-parameter language model. These realities can be prohibitive.
In Cheap and Quick, researchers introduce a new technique Mixture-of-Modality Adaptation (MMA) for fine-tuning vision-language models which drastically reduces these barriers to entry. MMA swaps out the multistage process for a single stage in which a vision adapter, vision-to-language adapter, and new mixture-of-modality adapter are trained simultaneously.
With this approach, lightweight adapters totalling just a few million parameters are sufficient to produce remarkably strong multimodal models. Applying MMA to LLaMA and training for just 1.4 hours, the resulting model dubbed LaVIN (large vision-language instructed) achieves 89.41% on ScienceQA, and a 13B version of the model scores 90.83%.
DataComp: In search of the next generation of multimodal datasets
High-level journey of participants in the DataComp competition. Original source: DataComp paper.
- Links: (Arxiv | Code | Project Page)
- Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt
TL;DR: Competition and benchmark for evaluating novel filtering strategies for constructing multimodal datasets, and a SOTA 1B sample multimodal dataset
Multimodal models like OpenAI’s CLIP (contrastive language-image pretraining) have enabled a wide range of multimodal applications. Over the past few years, the pace of development in the architectures of these models has substantially picked up, but far less research has gone into the data used to train these models.
DataComp flips the traditional ML benchmark on its head, fixing the training budget, model architecture, and evaluation code, pinning the focus on the data. The DataComp competition tasks participants with finding the dataset that results in a trained CLIP model with best downstream performance on 38 multimodal classification and retrieval tasks.
The competition consists of two parallel tracks. In the first track, all participants start with a common pool of initial data and design strategies to filter the data. In the second track, participants bring their own data. As a baseline, the DataComp team curates a high-quality 1B sample dataset, and uses this to train a CLIP ViT-L/14 model with 79.2% zero-shot accuracy on ImageNet.
💡If you’re interested in DataComp, and want to dive deeper into filtering strategies, check out:
Holistic Evaluation of Text-to-Image Models
Overview of aspects evaluated as part of the HEIM benchmark. Original source: HEIM paper.
- Links: (Arxiv | Code | Project Page)
- Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt
TL;DR: The first benchmark for comprehensively comparing text-to-image models
If you’ve played around with text-to-image (T2I) models like DALLE, Stable Diffusion, or Midjourney, you’ve probably noticed major differences in style, accuracy, and maybe even biases from one model to another. Probing deeper, you may have noticed that some models are better than others at reasoning about spatial relationships, incorporating historical knowledge, and handling inputs from disparate languages.
If you have, you’re not alone! As with large language models, comparing text-to-image models is not so straightforward — one number metrics like FID (image quality) and CLIPScore (image-text alignment) only tell part of the story.
Taking inspiration from HELM, researchers propose the first holistic evaluation benchmark for text-to-image models (HEIM). The benchmark encapsulates 12 “aspects” of performance, including toxicity, originality, and efficiency, incorporating computational metrics and crowd-sourced human evaluations. Using the benchmark to evaluate 26 popular T2I models, they find that no single model comes out on top across the board.
ImageReward: Learning and Evaluating Human Preferences for Text-to-image Generation
Comparison of images generated with ImageReward and ReFL to top images from base text-to-image models. Original source: ImageReward paper.
- Links: (Arxiv | Code | Hugging Face)
- Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, Yuxiao Dong
TL;DR: First general-purpose reward model for human preferences on T2I models, and an approach for aligning T2I models
Reinforcement Learning from Human Feedback or RLHF was one of the biggest breakthroughs that made ChatGPT such a hit. The main idea behind RLHF is that by having a model output multiple responses for the same input and asking humans to rank those responses, you can better align the model itself (via fine tuning) with human preferences. Intuitively, one would love to apply a similar approach to other generative models like text-to-image (T2I) models in order to remedy body disfigurations, reduce toxicity, and enhance image-text alignment.
In this work, researchers present ImageReward — the first general purpose reward model for human preferences on T2I generated images. The model is built from a BLIP backbone, and is trained on a newly constructed ImageRewardDB dataset of 137k comparison image pairs across 8,878 prompts.
On top of ImageReward, a technique called Reward Feedback Learning (ReFL) is applied to tune T2I models for alignment with human preferences. Whereas in RLHF, the likelihood of a generation to increase reward is used to steer the language model, an analogous notion of likelihood does not exist for latent diffusion models. As an alternative, the ImageReward score of the image after denoising steps is used as feedback. In human evaluations, the resulting ReFL-tuned model wins in head-to-head against Stable Diffusion 59% of the time.
💡Browse a sanitized version of the ImageReardDB dataset for free!
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
InstructBLIP architecture overview. Original source: InstructBLIP paper.
- Links: (Arxiv | Code | Hugging Face)
- Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
TL;DR: Adding instruction-aware feature extraction to vision encoder and applying visual instruction tuning to LLMs 👉SOTA zero-shot and task-specific performance
Instruction tuning is a technique for fine tuning large language models by training them on (instruction, output) pairs. For instance, instead of including few-shot examples in a prompt nudging the model to respond to multiple choice questions with A, B, C, or D, train the model on multiple choice questions with the instruction “answer with A, B, C, or D”, and the desired answers at outputs.
With InstructBLIP, a team from Salesforce applied the instruction tuning approach to vision-language models in full force. The researchers took 26 benchmark datasets across 11 multimodal tasks and hand-crafted task-specific instruction templates. Here are a few exemplary templates:
- Image Captioning: <Image>Write a short description for the image.
- Visual Question Answering: <Image>The question “{Question}” can be answered using the image. A short answer is
Starting from pretrained language models (FlanT5 and Vicuna), the researchers added a Query Transformer which adds a layer of instruction-awareness to the visual encoding passed to the language model. The models were then fine tuned on instruction data, resulting in a family of state-of-the-art (SOTA) zero-shot models — InstructBLIP (Bootstrapped Language-Image Pretraining). And further tuning InstructBLIP models for specific tasks also leads to SOTA on those tasks!
LAMM: Multi-Modal Large Language Models and Applications as AI Agents
Overview of the Language-Assisted Multi-Modal (LAMM) Benchmark. Original source: LAMM paper.
- Links: (Arxiv | Code | Project Page)
- Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang
TL;DR: Dataset and benchmark for evaluating 2D and 3D visual reasoning in multimodal large language models
With InstructBLIP, the first step was adapting existing multimodal datasets for instruction tuning. In LAMM, researchers construct a multimodal instruction tuning dataset from scratch. This new purpose-built dataset forms the basis for the Language-assisted multimodal model (LAMM) benchmark, which offers a unified way to evaluate the multimodal capabilities of vision-language models on visual tasks.
To construct the dataset, the LAMM team took images (for 2D tasks like keypoint detection and optical character recognition) and point clouds (for 3D tasks like 3D object detection) from publicly available datasets. They then augmented the visual data with instruction-response pairs generated by GPT — a technique referred to as self-instruction — resulting in 186,098 sets of image, instruction, and response as well as 10,262 sets of point cloud, instruction and response.
💡Check out the LLAM leaderboard — you’ll see lots of familiar faces!
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
Example image-editing sequences from the MagicBrush dataset, including single-turn and multi-turn instructions. Original source: MagicBrush paper.
- Links: (Arxiv | Code | Hugging Face | Project Page)
- Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, Yu Su
TL;DR: High quality hand-crafted image editing dataset 👉improved performance on single and multi-turn image editing tasks
If you’ve ever photoshopped someone into or out of a photo, you know how frustrating and tedious image editing can be. Instruction-guided image editing models like InstructPix2Pix (CVPR 2023 Highlight) aim to perform these types of edits from instructions given in natural language. However, the datasets used to train prior image editing models were either synthetically generated (using text-to-image models), closed-source, or single-turn only — so model performance degrades rapidly as edits are iteratively applied.
The MagicBrush dataset brings an unprecedented level of quality and robustness to instruction-guided image editing. The dataset consists of 10,388 (source image, instruction, target image) triplets, including both single-turn and multi-turn edits. Source images were taken from MS COCO, and balanced by object class for data diversity. The types of edits were also varied, including actions, color changes, object removals and more.
Human annotators who passed a training process proposed text-based instructions and masks on the images (denoting the region of the image to inpaint), and DALL-E2 was used to generate edited images. The human annotators then selected the most faithful and realistic image as a target image in a MagicBrush triplet. Fine-tuning both InstructPix2Pix and HIVE on MagicBrush resulted in significant improvements, in both single and multi-turn editing!
💡Browse the MagicBrush dataset for free!
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Illustration of the difference between isolated image-text pairs (left) and interleaved multimodal documents (right) such as those in the OBELICS dataset. Original source: OBELICS paper.
- Links: (Arxiv | Code | Hugging Face)
- Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh
TL;DR: Rich dataset of 141M multimodal documents containing both images and text
Most multimodal datasets are designed with specific tasks in mind. Image captioning datasets like COCO Captions, for instance, pair an image with a string of text (the caption) in a structured and constrained fashion. This can be useful for training models to excel at one particular task, but can leave a lot on the table in terms of universal multimodal reasoning. The controlled laboratory environment of such a structured dataset is not representative of the broader spectrum of multimodal interactions we’d like a sufficiently general model to have.
The OBELICS dataset from Hugging Face is the first web-scale dataset of natural multimodal documents with interleaved images and text. The dataset features 141 million web pages from the Common Crawl, including 353 million images and 115 billion tokens of textual data. The researchers demonstrate the utility of OBELICS by training a new 80B parameter model on the dataset. This model, called IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) is shown to be competitive with Flamingo!
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
Example pairs of non-preferred (left) and preferred (right) images in the Pick-a-Pic dataset. Original source: Pick-a-Pic paper.
- Links: (Arxiv | Code | Hugging Face)
- Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, Omer Levy
TL;DR: Large-scale dataset of prompts, generated images, and human preferences
In ImageReward, we saw that human preferences could be used to better align text-to-image (T2I) models. As is often the case, the authors of Pick-a-Pic identified the availability of high-quality, accessible data as a limiting factor in applying these techniques.
In response to this need, they created a web application to collect user preferences. users engaged with the application by crafting their own prompts and ranking the images generated by T2I models Stable Diffusion and SDXL. At the time the paper’s experiments were performed, the authors had collected 968,965 rankings across 66,798 prompts.
The dataset was used to train a model PickScore which scores images based on their alignment with prompts. When evaluated on the Pick-a-Pic test set, PickScore surpasses ImageReward and even trumps expert human annotators (who are unaware of the original user’s intention) at predicting human preferences.
⚠️Watch out — while the dataset has a lot of awesome examples, it also has some very toxic content.
If you enjoyed this post and want to geek out on data-centric AI and the future of the field, reach out to me on LinkedIn and come by our booth at the conference — Voxel51 is a platinum sponsor!