The 2025 ICCV conference is packed with innovation, but our focus is simple: highlight ICCV research papers that matter. We want to show how computer vision is moving beyond benchmarks and into the messy, complex environments where people actually use AI—retail stores, navigation systems, urban planning, and product recommendations. Follow along on our four-part blog and
virtual meetup series.
This is the second in a four-part series, in this blog we will be focusing on ICCV papers that highlight vision language models. We hope they inspires new ideas and opens doors for conversations, collaboration, and opportunities to lift this fantastic community.
MINDCUBE: Teaching vision language models to imagine space they can't see [1]
ICCV paper authors: Baiqiao Yin, Qineng Wang, Pingyue Zhang, and team
Institutions: Northwestern University, Stanford University, New York University, University of Washington
What it’s about
Humans can walk into a room, see it from one angle, and immediately understand what's behind them. We build mental maps of spaces even when we can't see the whole picture. Can vision language models do the same?
This paper introduces MINDCUBE, a benchmark with 21,154 questions across 3,268 images that tests whether VLMs can form spatial mental models from limited viewpoints. The benchmark evaluates three core capabilities: representing positions (cognitive mapping), understanding orientations (perspective-taking), and simulating dynamics (mental "what-if" movements). Current state-of-the-art VLMs perform only marginally better than random guessing on these tasks.
Why it matters
For AI to operate in the real world—whether in robotics,
autonomous vehicles, or augmented reality—it needs to reason about space beyond what's immediately visible. A robot navigating a warehouse needs to remember where objects are even when they're out of view. An AR navigation system needs to understand spatial relationships to provide useful directions. MINDCUBE exposes a critical gap: VLMs excel at recognizing objects but struggle with spatial reasoning across multiple views.
How it works
- Benchmark design: Three camera movement patterns (rotation, around, among) with questions requiring reasoning about occluded objects and cross-view consistency
- Evaluation framework: Tests spatial reasoning through questions about object locations, perspective shifts, and hypothetical movements
- Three scaffolding approaches: View interpolation (adding intermediate frames), free-form reasoning chains, and cognitive maps (structured 2D representations)
- Training strategy: "Map-then-reason" approach where models first generate cognitive maps, then reason over them using both supervised fine-tuning and reinforcement learning
Key result
Starting from a baseline of 37.8% accuracy, the "map-then-reason" approach with supervised fine-tuning boosted performance to 60.8%. Adding reinforcement learning pushed it further to 70.7%. Critically, vision language models that actively generated and reasoned over internal cognitive maps dramatically outperformed those given maps as input, demonstrating the importance of learning to construct spatial representations rather than simply consuming them.
Broader impact
MINDCUBE establishes a new standard for evaluating spatial reasoning in vision language models. The findings suggest that current VLMs lack fundamental spatial cognition capabilities that humans take for granted. By providing both a rigorous benchmark and effective training approaches, this work opens pathways toward VLMs that can truly understand and navigate 3D space—a critical requirement for embodied AI applications in
robotics, autonomous systems, and spatial computing.
SGBD: Can AI recommend products that actually match your needs? [2]
ICCV paper title: SGBD: Sharpness-Aware Mirror Gradient with BLIP-Based Denoising for Robust Multimodal Product Recommendation
ICCV paper authors: Sarthak Srivastava, Kathy Wu
Institution: Amazon
What it's about
Online shopping depends on recommendation systems that combine product images, descriptions, and user behavior to suggest relevant items. But what happens when product images are low-quality, descriptions are cluttered with irrelevant keywords, or merchants constantly update their listings? These real-world challenges create noise that degrades recommendation accuracy. This paper introduces SGBD (Sharpness-Aware Mirror Gradient with BLIP-Based Denoising), a vision language model training strategy that makes multimodal recommender systems more robust.
Why it matters
Most recommendation systems struggle when real-world data gets messy. A product photo taken during a sale event might include promotional banners that confuse the model. Product descriptions often contain irrelevant metadata or SEO keywords that dilute the actual product information. When these systems fail, customers see irrelevant recommendations, leading to poor shopping experiences and lost sales. SGBD addresses these challenges head-on, making vision language model recommendations more reliable in production environments.
How it works
- BLIP-based denoising: Uses a vision-language model to clean noisy product images and text descriptions before they enter the recommendation pipeline, filtering out promotional overlays, irrelevant text, and low-quality visual features
- Sharpness-aware Optimization: Combines Sharpness-Aware Minimization with Mirror Gradient to guide the vision language model toward flat local minima in the loss landscape, making it less sensitive to small changes in input data
- Theoretical grounding: The method introduces a curvature-dependent regularization term that dynamically adjusts optimization, penalizing sharp minima while encouraging convergence to flatter, more generalizable solutions
Key result
Across four Amazon product datasets (Baby, Sports, Electronics, Clothing), SGBD achieved an average performance improvement of 24.5% across key metrics (recall, precision, MAP, and NDCG). When tested with noisy inputs, the vision language model method demonstrated superior robustness compared to baseline approaches, maintaining stable performance even when product features were perturbed.
Broader impact
SGBD offers a practical solution for real-world recommender systems operating at scale. By addressing both data quality issues and model robustness simultaneously, it enables more reliable product recommendations in dynamic retail environments where data is constantly changing. The vision language model’s compatibility with existing recommendation architectures makes it particularly valuable for deployment in production systems.
Sari Sandbox: Can AI agents actually shop like humans? [3]
ICCV paper title: Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents
ICCV paper authors: Janika Deborah Gajo, Emmanuel G. Maminta, and team
Institutions: University of the Philippines Diliman
What it’s about
Most embodied AI research focuses on household environments—kitchens, bedrooms, living rooms. But what about retail spaces? Shopping involves complex behaviors: navigating stores, comparing products, reading labels, and making decisions based on multiple criteria.
Sari Sandbox is a high-fidelity 3D retail store simulator designed specifically for training and evaluating embodied AI agents in shopping scenarios. Built on Unity's Universal Render Pipeline, it features over 250 photorealistic grocery products across three distinct store layouts, a functional self-checkout system, and support for both VR-based human interaction and API-controlled AI agents. The system includes SariBench, a dataset of human demonstrations across varied task difficulties.
Why it matters
Existing embodied AI simulators like Habitat and AI2-THOR excel at household tasks but don't address
retail-specific challenges. Retail environments require agents to read product labels, compare nutritional information, handle varied packaging types, and navigate crowded shelves—capabilities that aren't tested in typical domestic settings. As retailers increasingly explore automation and digital twins for store optimization, having a vision language model platform to develop and benchmark these capabilities becomes essential.
How it works
- Photorealistic products: 250 3D models based on real grocery items, categorized into 11 food types, with dynamic expiration dates and barcodes that update procedurally
- Three store layouts: Different configurations based on real convenience store surveys, each with unique arrangements and traffic patterns
- VR integration: Supports human benchmarking through Meta Quest 2, with hand interaction, haptic feedback, and tunneling vignette to reduce cybersickness
- Python API: Enables programmatic control of avatar movement, hand manipulation, object interaction, and environment data retrieval
- Benchmark tasks: Easy (find and pick up items), Average (scan items at checkout), Difficult (compare products and make decisions)
Key result
Human participants completed easy tasks in an average of 47-73 seconds with near-100% success rates. In contrast, a VLM-powered embodied agent took 420-780 seconds (up to 16× longer) with success rates under 70%. The performance gap highlights significant challenges in VLM-based navigation, decision-making, and task planning for retail scenarios, despite the agents having full access to the designed APIs.
Broader impact
Sari Sandbox fills a critical gap in embodied vision LLM model research by providing a retail-focused simulation environment with human baseline data. Beyond academic research, the platform has practical applications for retailers exploring automation, store layout optimization, and digital twin technologies. The stark human-agent performance gap also reveals important limitations in current VLM-based approaches, suggesting areas where architectural improvements or training strategies could yield significant gains.
Can sky images predict the air you breathe? [4]
ICCV paper title: Forecasting and Visualizing Air Quality from Sky Images with Vision-Language Models
ICCV paper authors: Mohammad Saleh Vahdatpour, Maryam Eyvazi, Yanqing Zhang
Institutions: Georgia State University, Savannah College of Art and Design
What it’s about
Traditional air quality monitoring relies on expensive sensor networks with limited coverage. What if we could estimate pollution levels just by looking at the sky? And what if AI could show you what cleaner—or more polluted—air would look like?
This paper presents a vision LLM model framework that predicts air quality levels directly from photographs of the sky and generates visual simulations showing how the sky would appear under different pollution conditions. The system combines statistical texture analysis (Gabor filters) with supervised learning for pollution classification, then uses a vision language model (BLIP-2) to guide generative synthesis of counterfactual sky scenes corresponding to different AQI (Air Quality Index) categories.
Why it matters
Over 99% of the global population lives in areas where air pollution exceeds WHO safety guidelines. Yet sensor-based monitoring systems are expensive and often unavailable in low-resource regions. This visual LLM approach offers a scalable, low-cost alternative. More importantly, by generating realistic visualizations of different pollution levels, the system makes abstract AQI numbers tangible and understandable—helping people make informed decisions about outdoor activities, health precautions, and environmental advocacy.
How it works
- Feature extraction: Multi-orientation Gabor filters capture directional texture patterns in sky images (haze, cloud density, chromatic shifts) that correlate with particulate matter concentrations
- Classification pipeline: Random forest classifier on extracted features achieves competitive performance, benchmarked against CNN baselines
- VLM-guided generation: BLIP-2 acts as semantic controller, translating predicted AQI grades into descriptive prompts (e.g., "thick smog with reddish haze") that condition a diffusion-based image generator
- User interface: Interactive application allows real-time uploads, displays predictions with EPA color codes, and generates side-by-side comparisons across pollution scenarios
Key result
The CNN variant achieved 89.4% accuracy and 88.1% macro F1-score on pollution classification. Generated synthetic images maintained 89.6% classification accuracy when reclassified, demonstrating semantic consistency. Visual realism metrics showed strong performance: SSIM of 0.823 and FID score of 18.6. In user studies, the VLM-augmented interface achieved interpretability scores of 4.3/5 compared to 3.5/5 for text-only outputs.
Broader impact
This work demonstrates how vision language models can bridge the gap between technical predictions and public understanding. By providing both numerical forecasts and visual counterfactuals, the system makes environmental data more accessible and actionable. The approach has practical applications in
public health communication, urban planning, and environmental advocacy. Future work will incorporate energy-efficient green CNN architectures with FPGA-based incremental learning for sustainable edge deployment.
Wrapping up day 2
The four papers we've explored today share a common thread: they address the messy reality of deploying vision language models in production environments. Whether it's noisy product data in e-commerce, limited viewpoints in spatial reasoning, retail automation challenges, or making environmental data understandable, these researchers are tackling problems that matter beyond academic benchmarks.
Together, they signal an important shift:
- From clean benchmarks to noisy reality: Robust training strategies that handle real-world data quality issues
- From passive perception to active spatial understanding: Teaching models to build internal representations of space
- From general simulators to domain-specific environments: Purpose-built platforms for retail and embodied AI
- From numbers to narratives: Visualization that makes AI predictions interpretable and actionable
ICCV 2025 isn't just about seeing better—it's about building vision systems that work reliably in the complex, dynamic environments where people need them most.
Continue exploring the breakthroughs from The Best of ICCV 2025 Series:
Register for our
virtual meetup to connect and discover how AI transparency and multimodal AI are shaping the next era of computer vision.
References
[1] Yin, B., Wang, Q., Zhang, P., Zhang, J., Wang, K., Wang, Z., Zhang, J., Chandrasegaran, K., Liu, H., Krishna, R., Xie, S., Li, M., Wu, J., and Fei-Fei, L. "MINDCUBE: Spatial Mental Modeling from Limited Views," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2506.21458.
[2] Srivastava, S., and Wu, K. "SGBD: Sharpness-Aware Mirror Gradient with BLIP-Based Denoising for Robust Multimodal Product Recommendation," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025.
[3] Gajo, J.D., Merales, G.P., Escarcha, J., Molina, B.A., Nartea, G., Maminta, E.G., Roldan, J.C., and Atienza, R.O. "Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025.
[4] Vahdatpour, M.S., Eyvazi, M., and Zhang, Y. "Forecasting and Visualizing Air Quality from Sky Images with Vision-Language Models," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2509.15076.