Skip to content

The NeurlPS 2024 Preshow: NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples



In recent years, the development of Vision Language Models (VLMs) has marked a significant breakthrough in combining visual perception with linguistic understanding.  

These models have demonstrated impressive capabilities across tasks like visual question answering (VQA). Nevertheless, new challenges emerge regarding the accuracy of current benchmarks used to evaluate these models, leading us to question: Are they as capable as current benchmarks suggest? A new paper accepted into NeuroIPS 2024, “NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples,” challenges the status quo of VLM evaluation and reveals a substantial gap between perceived and actual capabilities.

Today, we discuss these challenges with Zhiqiu Lin, co-author of the NaturalBench paper.


NeurIPS 2024 Paper: NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Author: Zhiqiu Lin, co-author of the NaturalBench paper is a Doctoral Student at Carnegie Mellon University (Robotics Institute)

The Problem with Existing Benchmarks

The central question posed by this paper is whether VLMs are as capable as current benchmarks suggest.  

According to the authors, even the most advanced models struggle with simple inquiries concerning natural images, exposing a gap between perceived performance and actual capabilities. Existing benchmarks often fail to assess a VLM’s ability to understand visual content accurately. Many questions in these benchmarks can be answered correctly without looking at the image! This is due to:

  • Questions that rely heavily on common sense or general knowledge. For example, “What is the capital of Massachusetts?” or “Is the African elephant the smallest or largest land animal on the world?” can be answered without visual input.
  • An imbalance in answer distribution, creating exploitable biases. For instance, “yes” is a significantly more frequent correct answer in some benchmarks.

As a result, even “blind” language models like ChatGPT, which lack any visual processing capabilities, can achieve surprisingly high scores on these benchmarks. 

This raises serious concerns about whether these evaluations measure a model’s visual understanding.

NaturalBench: A New Standard

To address these limitations, the paper introduces NaturalBench. This benchmark emphasizes vision-centric evaluation and is designed to provide a more accurate assessment by forcing models to depend on visual input.  

The intent is to measure the true understanding of an image, which exposes biases in existing models and underscores the need for further research to develop more robust VLMs.  Here’s how it works:

  1. Identifying Confusable Image-Text Pairs: The researchers use existing image-caption datasets and powerful VLMs like CLIP to find pairs of images and captions that are easily mismatched. These pairs tend to be visually and semantically similar, making them challenging for models to differentiate.
  2. Generating Challenging Questions: ChatGPT creates questions designed to highlight the subtle differences between the confusing image pairs. Each question is paired with two images, and the correct answers for those two images must be different. This forces the model to rely on visual information to arrive at the right answer.
  3. Human Verification: To ensure quality and eliminate any ambiguities, human annotators review all generated questions and answers, discarding those that are deemed incorrect or irrelevant

Pairing two images with two questions that require opposite answers ensures that any blind language model relying solely on language rather than visual perception will perform no better than random guessing.  Interestingly, this method echoes the VQAv2 benchmark developed nearly a decade ago but utilizes contemporary foundation models like CLIP to streamline the process.

NaturalBench reveals a stark reality: even state-of-the-art VLMs struggle with tasks humans find trivial. The researchers tested over 50 VLMs, including both open-source and closed-source models. The results are eye-opening:

  • Most models performed only slightly better than random chance, highlighting their limitations in truly understanding visual content.
  • Even powerful closed-source models like GPT-4o significantly lagged behind human performance. This indicates that there’s still much room for improvement in VLM development.
  • The benchmark exposed significant biases in VLMs, particularly a tendency to favor certain answers regardless of the image. Addressing these biases is crucial for developing more robust and reliable VLMs

Why is NaturalBench So Challenging?

The difficulty of NaturalBench stems from two primary factors:

  1. Compositionality: Solving these questions often requires a combination of visual and linguistic skills, including object recognition, attribute binding, relational understanding, and advanced reasoning abilities like logic, comparison, and counting. Many models currently lack the sophistication to effectively combine these skills.
  1. Biases: Existing VLMs exhibit strong biases towards certain answers. This suggests that they may be relying on language priors and shortcuts rather than genuinely understanding the visual information

A unique aspect of NaturalBench is its use of manually reviewed datasets to maintain high quality. By screening curated samples with human annotators, the benchmark ensures the questions remain simple for humans yet challenging for models.

The assessment includes a strict scoring system that penalizes models not accounting for visual cues, ensuring that any biases in the models, such as consistently answering “yes” or “no,” are appropriately highlighted.

Addressing Spatial and Logical Biases

One critical area where VLMs exhibit deficiencies is spatial reasoning. For instance, determining if individuals are facing the same direction proves difficult for these models. Until more targeted datasets improve such capabilities, practitioners should exercise caution when deploying these models in real-world applications requiring spatial assessments.

Concluding Thoughts

Zhiqiu encourages the community to revisit and balance visual language benchmarks thoughtfully. With NaturalBench, the data curation process is relatively lightweight and automatable, primarily involving the innovative application of current foundation models for data pairing.

This work underscores the need to critically re-evaluate existing VQA benchmarks and adopt new approaches like NaturalBench to ensure we are accurately measuring the true progress of VLMs. Only then can we confidently push the boundaries of visual understanding and create models that are genuinely capable of comprehending and reasoning about the visual world.

NaturalBench exemplifies the shifts needed in VLM development and evaluation. 

By adopting a focus on visual input, it shifts the paradigm toward more accurate, unbiased assessments of VLM capabilities. As researchers and developers continue to refine these models, benchmarks like NaturalBench pave the way toward a future of more reliable and versatile vision language technologies.