Rethinking How We Evaluate Multimodal AI
Jun 12, 2025
25 min read

CVPR 2025 reveals why spatial reasoning and subjective ‘vibes’ are redefining how we benchmark AI systems

At CVPR 2025, I spent my first day attending talks on benchmarking and evaluation of multimodal AI systems. Despite the impressive capabilities showcased throughout the conference, these sessions revealed critical gaps between how we evaluate these models and their actual performance in real-world scenarios.
What emerged was a fascinating narrative about the disconnect between public perception and technical reality. Our current multimodal models — despite their seemingly magical abilities — struggle with spatial reasoning tasks that toddlers master effortlessly. Meanwhile, our evaluation systems often reward the wrong things: verbose responses over accurate ones, language shortcuts over visual understanding, and single metrics over nuanced capabilities.
Three speakers particularly stood out, each touching on different dimensions of evaluation:
  • Andre Araujo’s gave a comprehensive overview of the significant progress in multimodal AI, highlighting both its “magical” capabilities and critical limitations, while proposing innovative solutions for spatial awareness, effective tool use, and fine-grain understanding.
  • Saining Xie addressed the challenges of benchmarking real intelligence, emphasizing the importance of visual-spatial intelligence in conjunction with linguistic skills. He notes advancements in self-supervised learning (SSL) in the multimodal era, particularly in optical character recognition (OCR). Xie also introduces VSI-Bench, a benchmark for evaluating a model's spatial understanding from videos, highlighting the limitations of current models in spatial reasoning compared to humans.
  • Lisa Dunlap discussed the challenges of evaluating Large Multimodal Models (LMMs) beyond traditional metrics. She discusses Chatbot Arena, a platform that gathers real user conversations and votes to develop a live leaderboard. Emphasizing the need for detailed quality scores, she focuses on subjective aspects like reasoning, tone, and style. The concept of a “vibe check” is introduced to assess these properties, advocating for customizable evaluation interfaces that offer tailored model recommendations based on user preferences, rather than generalized leaderboards.
As you read through my takeaways from each talk, a common thread becomes clear: we’re entering an era where benchmarking must evolve beyond sterile leaderboards toward more human-centred, spatially-aware, and personalized evaluation frameworks.
The future of AI depends on it.

Andre Araujo: Multimodal AI is Amazing... Yet Deeply Flawed

Multimodal AI models are showing mind-blowing capabilities that seem almost magical.
These systems can seamlessly process combinations of text, images, video, and audio to handle complex user requests. They excel at retrieving information from massive data collections, directly analyzing visual content without external help, and can even leverage specialized tools to expand their capabilities. The architectural innovation behind these systems involves complex fusion mechanisms that align representations across modalities, enabling them to understand relationships between visual elements and textual descriptions in ways previously impossible.
Current models are just scratching the surface of true multimodal Intelligence.

The Embarrassing Reality Check

Multimodal AI systems fail spectacularly at tasks that even toddlers can handle.
These supposedly advanced systems struggle with basic spatial reasoning, often failing to count objects correctly or determine the direction of bird flight in an image. They demonstrate poor fine-grain visual understanding, regularly misidentifying specific bird species or artworks despite having access to specialized recognition tools. The fundamental issue lies in their representation learning — current visual encoders excel at global image understanding but lack the dense prediction capabilities needed for precise spatial reasoning and object localization, creating a significant gap between human-like perception and machine interpretation.
The gap between flashy demos and real-world reliability remains frustratingly wide.

HAMMR Multimodal ReACT

HAMMR introduces a paradigm shift in how AI systems leverage tools to solve complex problems.
Unlike traditional approaches that ineffectively cram dozens of tools and examples into a single prompt, HAMMR implements a hierarchical, modular architecture of specialized agents. Each agent manages a carefully curated subset of tools and can dynamically call other specialized agents when needed, creating a compositional problem-solving approach. This architecture extends the React (Reasoning and Acting) framework beyond text to support truly multimodal variables — including images, video, and audio — as both inputs and outputs, with each observation step providing explicit descriptions of variable assignments.
HAMMR’s iterative “thought, action, observation” loop enables unprecedented flexibility and error recovery.

TIPS: Engineering Spatial Understanding

TIPS fundamentally reimagines how visual foundation models encode spatial information.
The approach addresses the critical limitation of existing multimodal encoders by introducing a dual-CLS token architecture in vision transformers— one token aligns with noisy web captions for main object recognition. In contrast, a second token aligns with synthetic, spatially rich captions that describe backgrounds and spatial relationships. This innovation is complemented by self-supervised masked image modeling techniques like Ibot loss, which incentivize the model to learn location- sensitive features by reconstructing masked image patches, effectively teaching the system to understand “where” in addition to “what.”
This architectural breakthrough enables robust performance across both global and dense vision tasks.

UDON: Mastering Fine-Grain Visual Understanding

UDON addresses the persistent challenge of multi-domain, fine-grained visual recognition.
The technique employs a multi-teacher distillation approach that first trains separate “teacher” models, specialized for distinct domains such as landmarks, food, or artwork, each capturing domain-specific visual cues and taxonomies. It then consolidates this specialized knowledge into a universal “student” model through dynamic sampling strategies that balance domains with vastly different class distributions. The resulting unified model maintains domain expertise without forcing contradictory visual cues to compete, enabling unprecedented accuracy in identifying subtle visual distinctions across diverse categories.
This elegant solution demonstrates how carefully designed knowledge transfer can overcome fundamental limitations in visual representation learning.

Benchmarking is an Important Reality Check

Rigorous benchmarking exposes the real capabilities and limitations of multimodal AI.
Effective evaluation requires testing both global understanding (like image-text retrieval) and dense prediction tasks (such as semantic segmentation and depth estimation). Challenging benchmarks push models to demonstrate fine-grain visual understanding through “single-hop” and “two-hop” questions that require identifying visual elements and then retrieving related factual knowledge. Performance gaps on these benchmarks reveal fundamental limitations in current architectures — HAMMR shows nearly 20% improvement over standard tool-use approaches, yet still falls short on complex spatial reasoning tasks that require integrated understanding across modalities.
Only through systematic and multifaceted evaluation can we drive the architectural innovations necessary for truly robust multimodal intelligence.

Saining Xie: Language Shortcuts Undermine Visual Intelligence

Current multimodal models are cheating with language instead of truly understanding what they see.
Language intelligence is a powerful shortcut that masks significant gaps in visual understanding. These models often just associate visual symbols with pre-existing knowledge rather than developing genuine visual intelligence. The resulting systems might perform well on benchmarks but fail catastrophically when deployed in real-world scenarios that require robust visual reasoning.
We need benchmarks that force models to develop actual visual intelligence, not just leverage language priors.

Self-Supervised Learning Makes a Comeback

Self-supervised learning models are finally closing the gap with language-supervised approaches, such as CLIP.
Previous performance differences weren’t due to inherent weaknesses in SSL methodology but simply insufficient scale. By training on billion-scale web data and scaling parameters beyond 1 billion, SSL models now outperform CLIP on average visual perception benchmarks. The most dramatic improvements are observed in challenging domains, such as OCR and character recognition, suggesting that SSL’s true potential was previously underestimated.
SSL’s remarkable responsiveness to data distribution makes it an incredibly promising path forward.

Visual Search Is Non-Negotiable

Deliberate visual processing isn’t optional — it’s a fundamental capability that all multimodal models must possess.
The V* model demonstrates how integrating methodical visual search enables AI to focus on crucial details when answering complex, high-resolution questions. This approach mirrors human cognition, where we allocate additional processing power to difficult visual tasks rather than relying on obvious but potentially misleading cues. OpenAI’s “thinking with image” feature validates this direction by achieving near-perfect scores on the V* benchmark.
This isn’t just an engineering trick — it’s a core cognitive mechanism that future models cannot succeed without.

Video Benchmarks Miss the Point

Most current video understanding benchmarks are fundamentally broken and test the wrong things.
These benchmarks inadvertently reward knowledge retrieval and linguistic understanding instead of genuine visual-spatial reasoning. Questions like “Why are objects flying?” in scientifically incorrect videos or queries about astronaut equipment don’t test visual understanding — they’re glorified trivia contests. This misalignment creates models with impressive benchmark scores but crippling real-world limitations.
We’re heading down a dangerous path if we continue optimizing for these flawed metrics.

VSI-Bench Forces Spatial Thinking

VSI-Bench represents a new approach that makes models genuinely think in three-dimensional space.
Unlike traditional benchmarks focused on recognition or storytelling, VSI-Bench evaluates mental imagery and spatial manipulation abilities. The tasks — from counting objects and determining relative directions to planning routes and estimating dimensions — require models to track spatial relationships across extended video sequences. By repurposing existing 3D datasets to automatically generate high-quality video-question pairs, this approach becomes both rigorous and practical.
The massive performance gap between humans and state-of-the-art models like Gemini 1.5 Pro on VSI-Bench should serve as a wake-up call.

Models Fail at Spatial Logic

AI models recognize objects perfectly but can’t figure out where they are in relation to each other.
Analysis shows that 71% of errors on VSI-Bench stem from spatial reasoning failures rather than visual perception or language understanding problems. Surprisingly, common linguistic reasoning techniques like chain-of-thought prompting actually degrade performance on spatial tasks. Current models can handle objects appearing together in single frames but collapse when tracking relationships across time.
We need fundamentally new mechanisms for spatial reasoning, not just more data or parameter scaling.

Spatial Supersensing Is the Future

The ultimate goal is for AI to understand physical space as effortlessly as humans navigate the world.
This vision extends far beyond current chatbot interactions to encompass always-on spatial intelligence, integrated into devices such as AI glasses. Achieving this capability requires abandoning brute-force encoding of every pixel and frame in favour of more efficient mechanisms that can process unlimited visual information. Current models remain “definitely worse than cats” at this crucial capability despite their impressive performance in other domains.
This is a massively exciting, open frontier that demands novel approaches to truly ground AI in the real world.

Lisa Dunlap: The Problem with Single-Number Leaderboards

Traditional LLM leaderboards are completely missing the point.
They reduce complex AI systems to a single number, which is like rating a chef solely on how fast they cook. Generative AI quality isn’t just about correctness — it’s about tone, style, explanation approach, and countless subjective properties that users actually care about. These nuanced characteristics (or “vibes”) are what make people prefer one model over another in real-world usage.
Single metrics just don’t cut it anymore.

The Chatbot Arena Revolution

Chatbot Arena (now known as LMArena) is changing the evaluation game entirely.
It collects millions of real user conversations and pairwise preferences, allowing people to directly compare anonymous models side by side. With over 100 million queries and 3 million votes, it’s become the go-to platform for understanding how models perform “in the wild” across countless languages and tasks. The battle mode brilliantly forces users to make explicit choices about which response they prefer.
This is the evaluation that actually matters.

Style Matters More Than We Thought

Users are being tricked by verbose models.
Longer responses consistently win user preferences, even when they’re not better answers. Models like GPT-4 sometimes exploit this by generating unnecessarily lengthy responses (like writing paragraphs to answer “what’s the scientific name for octopus?”). When researchers control for style factors, some models’ rankings change dramatically, revealing they’ve been “style hacking” rather than providing genuinely better answers.
Style is the hidden influencer of user preference.

The “Vibe Check” Methodology

Manually identifying all the subjective qualities that matter to users is impossible, so we desperately need to automate “vibe” discovery.
The traditional approach of hand-defining categories for evaluation is limiting, as there are numerous qualitative differences in how models respond that we may not consider beforehand. This is where the “vibe check” comes in, leveraging LLMs themselves as a tool to automatically discover and quantify these subjective properties (or “vibes”) directly from model outputs. It’s about finding what’s user-aligned, well-defined, and differentiating between models.
The “vibe check” approach uses LLMs themselves to analyze and score responses on subjective qualities like friendliness, humor, or formality. It was developed to move beyond the limitations of hand-defined features for evaluating qualitative differences in models, seeking to uncover properties users truly care about that might be hard to anticipate beforehand.
A “good vibe” in this context is defined by three criteria:
  • User aligned: It’s a property relevant to the user.
  • Well-defined: Different judges (human or LLM) would consistently agree on its presence (e.g., if a response is “friendly”).
  • Differentiating: Models being evaluated should show clear differences in this property.
The method operates through a four-step process:
  1. Discovering Vibes: An LLM is prompted to identify differences between pairwise model responses from a subset of data (like Chatbot Arena battles), compiling a list of frequently appearing “candidate vibes”.
  2. Scoring Outputs: A panel of smaller LLM judges then scores each model output for the presence of these discovered vibes (e.g., “which output is more friendly?”), providing fine-grained subjective analysis.
  3. Filtering Properties: Vibes are filtered out if there is low agreement among judges (meaning it’s not well-defined) or if the vibe is perceived equally across all model outputs (meaning it does not differentiate).
  4. Quantifying Utility: Finally, logistic regression is used to predict which model generated an output or which output a user would prefer, based solely on these identified “vibes.” This reveals the impact each vibe has on a model’s identity and user preference.
The “vibe check” has proven incredibly useful, particularly in explaining why models like Llama 3 ranked highly on preference benchmarks like Chatbot Arena despite potentially lower performance on traditional objective metrics. It revealed Llama 3 was perceived as more friendly, funny, and interactive, which positively correlated with user preference, unlike GPT-4 and Claude which were more formal or ethics-focused. The method is agnostic to Chatbot Arena, meaning it can be applied to any benchmark to uncover distinct, context-specific differences between models that traditional metrics might miss. It can even inform how to adjust a model’s behavior to influence its perceived quality, as shown by re-prompting Gemini to include subjective interpretations, which improved its preference among judges.
Vibes are the missing piece of AI evaluation.

The Future is Personalized Evaluation

One-size-fits-all rankings need to die.
The future of LLM evaluation is undeniably moving towards personalization, recognizing that a “one-size-fits-all” leaderboard with a single numerical score is insufficient for real-world applications.
The ultimate goal of benchmarks is to evaluate general-purpose agents designed to cater to the diverse personal needs of everyone, which generates a vast amount of information that is currently difficult to interpret.
While platforms like Chatbot Arena and Helm provide extensive data and decompose quality beyond a single number, they still struggle to make this information genuinely useful and actionable for individual users.
  • Helm, despite its comprehensive array of benchmarks and aggregated results, can present a “light” leaderboard that is difficult for a user to interpret and decide which model to use.
  • Chatbot Arena’s category leaderboards, while offering more detail, are still “somewhat reductive” and complex to understand, especially when considering the many nuances within specific tasks.
  • Even with methods like “vibe checks” that offer personalized insights into subjective properties, the challenge remains in scaling and easily understanding these highly personalized results.
The key to unlocking truly useful evaluation lies in providing interfaces for benchmarks that can customize results for each user. This doesn’t mean creating entirely new evaluations for every person, but rather intelligently presenting existing benchmark data to offer a customized recommendation experience.
This customization can occur in two main ways:
Customizing to a Specific Task:
  • The idea is to allow users to input a specific task they care about and receive a predicted leaderboard tailored to that task.
  • An example of preliminary work in this area is “Prompt to Leaderboard”, a model trained on Arena battles that predicts what the leaderboard will look like based on a user’s specific prompt input.
Customizing to User Preferences:
  • This focuses on the more subjective aspects of model preference.
  • The “vibe check” method, as discussed previously, is a prime example, automatically discovering and quantifying subjective properties that influence user preference.
  • Other preliminary work includes papers like “Report Cards” and “Inverse Constitutional AI”, which aim to generate natural language descriptions of models’ properties.
Despite these advancements, a method that seamlessly combines both task-specific customization and user-preference customization is still an area with significant work to be done.

Why Personalization is the Evolution of Evaluation

Leaderboards exist to guide users and developers in choosing the best model or checkpoint for their needs. However, the definition of “best” is highly individual:
  • For a developer, it’s about selecting the right model for production.
  • For a user, it’s about deciding which subscription to buy.
  • Crucially, what model to use depends fundamentally on “what you’re using it for and who you are as a person”. The biggest question in evaluation, therefore, is how to present metrics so that two people with the same task but different preferences (or vice versa) can find the optimal model for them.
To facilitate this personalized future, model providers should consider:
  • Enhanced Customization: Implementing ideas like “memory” (e.g., based on concepts like MemGPT), which learns specific user context (like being a PhD student in AI) through conversations to better customize responses.
  • Utilizing Implicit Feedback: There are many points in multi-turn conversations where users implicitly signal their preferences or what they truly desire. Finding better ways to leverage this feedback signal in the training process or personalization techniques will be incredibly important.
Personalization is the endgame of AI evaluation.

The Path Forward

As Day 1 of CVPR 2025 made abundantly clear, the gap between impressive demos and fundamental limitations can no longer be ignored.
Araujo’s architectural innovations, Xie’s spatial reasoning benchmarks, and Dunlap’s subjective “vibe checks” collectively point to a new evaluation paradigm — one that prizes human-like spatial understanding and user-relevant qualities over misleading metrics. The next generation of truly capable multimodal systems won’t emerge from chasing leaderboard positions, but from confronting these uncomfortable truths about what our models still can’t do.
The question isn’t just “which model ranks highest?” but “which model thinks spatially, understands deeply, and resonates personally?”
That’s the benchmark that matters.
Loading related posts...