Day 4 of our Best of ICCV series explores a fascinating theme: the future of AI and pushing vision AI beyond conventional boundaries. Today's CV research papers tackle questions that challenge our assumptions about what cameras can see, how models should learn, and what it means to truly understand motion. From hyperspectral imaging that reveals material properties invisible to the human eye, to exposing fundamental flaws in how we evaluate foundation models, these researchers are rethinking the future of AI.

We're highlighting computer vision research that matters—work that addresses real limitations and proposes thoughtful solutions.

UnMix-NeRF: What if cameras could see what materials are made of? [1]

Paper title: UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields

Paper authors: Fabian Perez, Sara Rojas, Carlos Hinojosa, Hoover Rueda-Chacón, Bernard Ghanem

Institutions: Universidad Industrial de Santander, KAUST

What it’s about

RGB cameras capture what things look like, but not what they're made of. Two objects can appear identical under certain lighting despite being completely different materials—a phenomenon called metamerism. For robots, AR systems, and simulations that need to interact with the physical world, understanding materials matters as much as recognizing objects.

UnMix-NeRF’s research integrates spectral unmixing—a technique from hyperspectral imaging—into Neural Radiance Fields to enable simultaneous hyperspectral novel view synthesis and unsupervised material segmentation. The method models spectral reflectance using learned "endmembers" (pure material signatures) and per-point "abundances" (the mixture of materials at each 3D location). By decomposing scenes into diffuse and specular components, UnMix-NeRF captures material properties across dozens of spectral bands, not just the three RGB channels.

Why this CV research matters

Standard NeRF methods operate on RGB data, which fundamentally limits material understanding. Many materials exhibit distinctive behaviors outside the visible spectrum—near-infrared reveals vegetation health, ultraviolet shows mineral fluorescence. Without this information, vision systems can't reliably distinguish materials that happen to look similar in RGB. For robotics grasping different objects, AR systems rendering realistic materials, or industrial inspection identifying defects, accurate material perception isn't optional—it's the future of AI.

How it works

Global endmember dictionary: Learns K pure material spectral signatures (endmembers) optimized across the entire scene during training, initialized using classical Vertex Component Analysis
Per-point abundances: Predicts the fractional mixture of materials at each 3D location using a dedicated MLP with softmax activation to ensure physical constraints (non-negativity, sum-to-one)
Diffuse-specular decomposition: Models diffuse reflectance through scaled endmember combinations and predicts view-dependent specular highlights separately using a dichromatic reflection model
Unsupervised segmentation: Clusters materials by computing normalized inner products between predicted spectral signatures and learned endmembers, enabling material separation without labels

Key research results

On the NeSpoF dataset, UnMix-NeRF achieved an average PSNR of 33.2 while reducing computation time to 44 minutes. On the BaySpec dataset, the method improved over HyperGS by +3.15 PSNR on Caladium scene while achieving the lowest Spectral Angle Mapping (SAM) and RMSE values across all scenes. For unsupervised material segmentation, this CV research approach achieved an F1 score of 0.41 and mIoU of 0.28, successfully distinguishing materials without any segmentation labels.

Broader research impacts

UnMix-NeRF’s CV research opens pathways for vision systems that understand material composition, not just appearance. By leveraging hyperspectral data within neural scene representations, it enables applications in robotic manipulation (grasping objects based on material properties), augmented reality (rendering materials that respond correctly to lighting), industrial quality control (detecting material defects invisible to RGB cameras), and scientific imaging. Their CV research framework also supports intuitive scene editing by modifying material signatures—changing wood to metal, or adjusting vegetation health—through direct manipulation of the learned endmember dictionary.

Limitations of few shot CLIP benchmarks: Are we actually testing what we think we’re testing? [2]

Paper title: Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting

Paper authors: Alexey Kravets, Da Chen, Vinay P. Namboodiri

Institution: University of Bath

What it's about

This CV research in CLIP-based few-shot learning methods claim impressive results—adapting foundation models to new tasks with just a handful of examples. But there's a problem: the benchmarks used to evaluate these methods aren't actually testing what we think they are. Most evaluation datasets were likely seen during CLIP's training, turning what should be a test of generalization into a partially memorized exam.

This CV research paper exposes a fundamental flaw in how few-shot CLIP methods are evaluated and proposes a solution using machine unlearning. The authors demonstrate that standard benchmarks create a "partially transductive" setting where CLIP has already seen the test classes during pre-training. By using Selective Synaptic Dampening to unlearn target datasets from CLIP before evaluation, they create true inductive benchmarks. The CV research results are sobering: methods that appeared to work well show massive performance drops—on average -55%—when evaluated properly.

Why this CV research matters

Few-shot learning is crucial for practical AI deployment—you can't always collect thousands of labeled examples for every new task. Methods that claim to adapt CLIP to novel classes with just a few examples would be incredibly valuable. But if those methods only work because CLIP already knows about the test classes, they're not actually solving the few-shot problem. The future of this AI matters for real applications like identifying new types of harmful content, emerging diseases, or any scenario where you genuinely encounter classes the model has never seen.

How it works

Unlearning pipeline: Adapts Selective Synaptic Dampening (SSD) to remove knowledge about specific datasets from CLIP by selectively updating parameters important for "forget" classes but not for "retain" classes
Oracle validation: Trains CLIP from scratch on ImageNet subsets to verify that unlearned models behave similarly to models that never saw the target classes, confirmed through UMAP visualizations and performance comparisons
Comprehensive evaluation: Tests 13 baseline CLIP few-shot methods across 7 datasets with varying shot counts (1, 2, 4, 8, 16) and unlearning levels—5,880 total experiments
SEPRES method: Proposes Self-Enhanced Prompt Tuning with Residual Textual Features, combining internal prompt fusion with learnable residual parameters to enable true learning from few-shot examples

Key research results

When evaluated in the new inductive setting, existing methods showed dramatic performance drops: CoOp fell from 71.4% to 15.3% (-56.1%), PromptSRC from 74.7% to 20.1% (-53.6%), and CLIPLora from 74.9% to 19.2% (-55.7%). In contrast, the proposed SEPRES CV research method dropped only from 75.4% to 56.0% (-19.4%), consistently outperforming all 13 baselines across datasets and shot configurations in the true inductive setting.

Broader research impacts

This work fundamentally changes the future of AI and how we should evaluate few-shot learning on foundation models. The proposed unlearning-based benchmarking pipeline can be applied to any model with an effective unlearning method, providing a general framework for creating true inductive benchmarks. The findings reveal that many published few-shot learning improvements may be illusory—relying on memorization rather than genuine generalization. This has immediate implications for deploying few-shot systems in production environments and highlights the need for more rigorous evaluation standards in the foundation model era.

DuoLoRA: Teaching AI to mix any subject with any style [3]

Paper title: DuoLoRA: Cycle-consistent and Rank-disentangled Content-Style Personalization

Paper authors: Aniket Roy, Shubhankar Borse, Shreya Kadambi, Debasmit Das, Shweta Mahajan, Risheek Garrepalli, Hyojin Park, Ankita Nayak, Rama Chellappa, Munawar Hayat, Fatih Porikli

Institutions: Johns Hopkins University, Qualcomm AI Research

What it’s about

Want to generate images of your dog in the style of Van Gogh's Starry Night, or blend multiple objects with a specific artistic aesthetic? Text-to-image models can do this, but effectively merging content (what to generate) with style (how it should look) from just a few reference images remains surprisingly challenging.

DuoLoRA reframes content-style personalization as a Low-Rank Adapter (LoRA) merging problem. Rather than treating content and style as independent concepts that can be linearly combined, the CV research paper recognizes they're intertwined and proposes three innovations: learning masks in the rank dimension (rather than output dimension) for adaptive flexibility, incorporating SDXL layer priors that recognize certain layers control content while others control style, and introducing "Constyle loss" that leverages cycle-consistency between content and style domains.

Why this CV research matters

Creative applications need fine-grained control over both what appears in generated images and how it looks. Existing approaches like ZipLoRA treat content and style independently, requiring extensive fine-tuning during inference and using fixed ranks for both adapters despite their different representational needs. DuoLoRA’s CV research achieves better content-style blending with 19× fewer trainable parameters (0.07M vs 1.33M), faster training, and more adaptive control—making personalized generation more practical and accessible.

How it works

ZipRank masking: Learns masks within the rank dimension instead of output dimension, enabling adaptive rank adjustment where content-heavy layers can have different ranks than style-heavy layers
Layer-specific priors: Recognizes that SDXL's low-resolution layers (up_block.2, down_block.2, mid_block with resolution <32) primarily influence content, while high-resolution layers affect style, enforcing this through nuclear norm minimization under sparsity constraints
Cycle-consistency loss: Inspired by CycleGAN, adds style to content then removes it (and vice versa), minimizing reconstruction error to ensure balanced merging that respects the interdependent nature of content and style
Multi-concept extension: Enables styling multiple objects simultaneously through directional prompting and LoRA composition

Key research results

Across multiple benchmarks, DuoLoRA’s CV research outperformed state-of-the-art methods: DINO similarity of 0.56 (content preservation), CLIP-I of 0.69 (style adherence), and CSD-s of 0.48 on the Dreambooth+StyleDrop benchmark. In user studies with 50 participants completing 1,000 evaluations, DuoLoRA was preferred 50% of the time—more than double the next-best method. The approach also demonstrated strong multi-concept stylization, successfully blending 2-4 objects with consistent artistic styles.

Broader research impacts

DuoLoRA makes the future of AI personalized image generation more practical by reducing both computational requirements and parameter overhead. The ability to precisely control content-style blending enables applications in digital art creation, advertising (generating branded content in specific styles), game development (creating consistent visual assets), and creative tools that democratize artistic expression. The framework's efficiency and flexibility also make it suitable for on-device deployment and real-time creative workflows.

VLM4D: Can vision models actually understand motion through time? [4]

Paper title: VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Paper authors: Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, Achuta Kadambi

Institutions: UCLA, Microsoft, UCSC, USC

What it’s about

Humans watch a car drive past and instantly know: it's moving right, turning left, receding into the distance. Humans intuitively reason motion in 4D (3D space + time). Vision-language models process videos, but can they actually reason about spatiotemporal dynamics? Or are they just aggregating 2D features and guessing?

VLM4D is the first CV research benchmark specifically designed to evaluate spatiotemporal reasoning in vision-language models. Comprising 1,000 videos with over 1,800 carefully curated question-answer pairs, the benchmark tests translational movement, rotational motion, perspective awareness, and temporal continuity. The dataset combines real-world exocentric (third-person) and egocentric (first-person) videos with synthetic scenes, all requiring models to reason about motion dynamics rather than static scene understanding.

Why this CV research matters

For embodied AI, robotics, and autonomous vehicles, understanding how objects move through space and time is fundamental. A robot needs to predict where a moving object will be, not just what it is. An autonomous vehicle must reason about other cars' trajectories from changing perspectives. Current VLMs excel at image understanding but their impressive performance doesn't translate to robust 4D reasoning. This benchmark CV research reveals a critical gap: even state-of-the-art models perform far below human levels (which achieve 98.8% accuracy) on tasks that require integrating spatial, temporal, and perspective information.

How it works

Diverse video sources: 37.5% exocentric (DAVIS, YouTube-VOS), 37.5% egocentric (Ego4D), 25% synthetic (Cosmos world model), all temporally segmented to 3-8 seconds focusing on key events
Four evaluation categories: Translational movement (55%), rotational movement (19%), spatiotemporal counting (17%), and false positives (9%) to test critical reasoning
Human-in-the-loop annotation: Direct human annotations followed by LLM augmentation for multiple-choice options, with three-round cross-validation to eliminate ambiguities
Two solution approaches tested: Supervised fine-tuning on spatiotemporal-rich data, and 4D feature field reconstruction that lifts 2D video features into temporally coherent 4D representations

Key research results

The best-performing model, Gemini-2.5-Pro, achieved only 62.0% accuracy compared to human performance of 98.8%. Most open-source models scored below 55%, with significant variation across categories. Notably, chain-of-thought reasoning provided minimal improvement over direct answers, suggesting fundamental deficiencies in spatiotemporal understanding rather than just reasoning capability. Fine-tuning on spatiotemporal data improved Qwen2.5-VL from 43.4% to 56.3%, while 4D feature field reconstruction boosted InternVideo2 accuracy from 36.0% to 37.4% on direct output.

Broader research impacts

VLM4D’s CV research exposes a critical limitation in current vision-language models: they process videos but don't truly understand motion through spacetime. This matters for any application requiring physical reasoning—robotics, autonomous driving, video understanding, sports analysis, and interactive AI. The benchmark provides a rigorous evaluation framework and demonstrates that targeted improvements (spatiotemporal SFT, 4D reconstruction) show promise, though significant gaps remain. By quantifying these limitations precisely, the work guides the future of AI research toward VLMs with genuine spatiotemporal cognition essential for embodied ML.

Can we predict how spinning objects will move? [5]

Paper title: Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)

Paper authors: Lennart Bastian, Mohammad Rashed, Nassir Navab, Tolga Birdal

Institutions: Technical University of Munich, Munich Center of Machine Learning, Imperial College London

What it’s about

Tracking rotating objects is fundamental in computer vision and robotics—whether it's a drone tumbling through the air, a tool spinning in a robot's gripper, or a spacecraft maneuvering in orbit. But predicting how these objects will continue rotating is remarkably difficult. Unknown qualities such as moment of inertia complicate dynamics, sensor measurements are noisy and sometimes missing, and real-world forces like friction violate the energy conservation assumptions most methods rely on.

This CV research paper proposes modeling rotational dynamics on the manifold of 3D rotations (SO(3)) using Neural Controlled Differential Equations guided by Savitzky-Golay filtering. Unlike existing methods that assume energy conservation or constant velocity, the approach learns a general latent dynamical system that can handle non-conservative forces (friction, damping, external torques) while remaining robust to noisy, sparse observations. This CV research method treats rotation prediction as a continuous-time problem with geometric awareness of SO(3)'s structure.

Why this CV research matters

Most rotation forecasting methods make simplifying assumptions that break in real-world scenarios. Energy isn't always conserved—friction dissipates it, external forces add it. Constant velocity assumptions fail for objects under active control or external influences. Meanwhile, sensors drop frames, add noise, and provide irregular measurements. For autonomous systems tracking moving objects, robots planning dynamic grasps, or spacecraft estimating tumbling debris, you need predictions that work despite these challenges, not methods that only work when everything is ideal.

How it works

Savitzky-Golay control paths: Constructs smooth, geometrically-valid paths directly on SO(3) by regressing polynomials in the tangent space (Lie algebra) and mapping back to the manifold, providing robust denoising and numerically stable integration
Neural controlled differential equations: Learns a latent representation of the underlying physical dynamics by integrating with respect to the filtered control path, capturing unknown moment of inertia and external forces
Second-order formulation: Incorporates both first and second derivatives of the control path to capture angular acceleration, preventing temporal phase shifts in predictions
Weighted filtering: Learns adaptive weights for the Savitzky-Golay regression that optimize for extrapolation rather than just interpolation, addressing boundary artifacts

Key research results

On simulated non-conservative systems with external torques and damping, this CV research method achieved 0.42-0.89 degree error at 0.8 second horizons across five scenarios, outperforming SO(3)-GRU (0.78-1.81°), conservation-based methods (2.14-2.72°), and LEAP (4.12-11.45°). On real-world Oxford Motion Dataset trajectories, it achieved 2.18-2.32 degree error compared to 2.83-3.04° for SO(3)-GRU. The method also demonstrated order-of-magnitude fewer function evaluations during training, indicating more efficient numerical integration.

Broader research impacts

This CV research enables robust rotation forecasting in realistic scenarios where objects experience friction, air resistance, or active control forces, which are conditions ubiquitous in robotics and autonomous systems. The ability to predict rotational motion from noisy, sparse sensor data has immediate applications in object tracking (compensating for missed detections), robotic manipulation (dynamic grasping of spinning objects), aerospace (attitude prediction for satellites), and sensor fusion (combining irregular measurements from multiple sources). By learning dynamics in simulation and generalizing to real-world sensor noise, this approach offers a practical path towards an AI future that can deploy rotation forecasting in production systems.

Wrapping up day 4

The five papers we've explored today challenge us to think differently about computer vision research:

From limited sensing to rich observation: Hyperspectral imaging captures material properties invisible to RGB cameras
From misleading benchmarks to honest evaluation: Creating tests that actually measure what we claim to measure
From brute-force to structured solutions: Leveraging problem structure and domain knowledge for more efficient learning
From static to dynamic understanding: Building models that truly comprehend motion through spacetime, not just frame sequences

As computer vision matures and we look towards the future of AI, we need more than incremental performance gains on established benchmarks. We need deeper sensing modalities, more rigorous evaluation that doesn't reward memorization, architectures that respect the structure of visual problems, and genuine understanding of the physical dynamics that govern the real world. Day 4 shows us CV researchers willing to question assumptions, expose limitations honestly, and build an AI future that moves towards more capable and trustworthy vision systems.

Thanks for following along. If you haven’t already, explore other exciting breakthroughs from our The Best of ICCV 2025 Series:

References

[1] Perez, F., Rojas, S., Hinojosa, C., Rueda-Chacón, H., and Ghanem, B. "UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields,"in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2506.21884.

[2] Kravets, A., Chen, D., and Namboodiri, V.P. "Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2507.20834.

[3] Roy, A., Borse, S., Kadambi, S., Das, D., Mahajan, S., Garrepalli, R., Park, H., Nayak, A., Chellappa, R., Hayat, M., and Porikli, F. "DuoLoRA: Cycle-consistent and Rank-disentangled Content-Style Personalization,"in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2504.13206.

[4] Zhou, S., Vilesov, A., He, X., Wan, Z., Zhang, S., Nagachandra, A., Chang, D., Chen, D., Wang, X.E., and Kadambi, A. "VLM4D: Towards Spatiotemporal Awareness in Vision Language Models," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2508.02095.

[5] Bastian, L., Rashed, M., Navab, N., and Birdal, T. "Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)," in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2508.07775.

Talk to a computer vision expert

UnMix-NeRF: What if cameras could see what materials are made of? [1]

What it’s about

Why this CV research matters

How it works

Key research results

Broader research impacts

Limitations of few shot CLIP benchmarks: Are we actually testing what we think we’re testing? [2]

What it's about

Why this CV research matters

How it works

Key research results

Broader research impacts

DuoLoRA: Teaching AI to mix any subject with any style [3]

What it’s about

Why this CV research matters

How it works

Key research results

Broader research impacts

VLM4D: Can vision models actually understand motion through time? [4]

What it’s about

Why this CV research matters

How it works

Key research results

Broader research impacts

Can we predict how spinning objects will move? [5]

What it’s about

Why this CV research matters

How it works

Key research results

Broader research impacts

Wrapping up day 4

References

Talk to a computer vision expert