Best of WACV 2026 - April 30, 2026

Name: Best of WACV 2026 - April 30, 2026
Start: 2026-04-30
End: 2026-04-30

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

Apr 30, 2026

9AM - 11AM PST

Online. Register for the Zoom!

Speakers

About this event

Welcome to the Best of WACV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you.

Schedule

Zero-Shot Coreset Selection via Iterative Subspace Sampling

Deep learning's reliance on massive datasets leads to significant costs in storage, annotation, and training. Although coreset selection aims to mitigate these costs by finding performant data subsets, state-of-the-art methods typically require expensive ground-truth labels and dataset-specific training. To overcome these scalability issues, ZCore introduces a zero-shot approach that functions without labels or prior training on candidate data. Instead, ZCore uses foundation models to generate a zero-shot embedding space for unlabeled data, then quantifies the relative importance of each example based on overall coverage and redundancy within the embedding distribution. On ImageNet, ZCore outperforms previous label-based methods at a 90% prune rate while eliminating the need to annotate over one million images.

Resources

ENCORE: A Neural Collapse Perspective on Out-of-Distribution Detection in Deep Neural Networks

We present ENCORE, a post-hoc out-of-distribution (OOD) detection method grounded in the geometric properties of neural collapse in deep neural networks. By leveraging the observation that in-distribution features align with class means while OOD features tend to be misaligned or orthogonal, ENCORE modifies inference through cosine-based scoring and adaptive feature scaling to enhance separation between known and unknown inputs. The method approximates neural collapse behavior at test time without requiring retraining, enabling more reliable uncertainty estimation. It is lightweight, memory-efficient, and compatible with a wide range of architectures, including convolutional networks and vision transformers. Extensive experiments on standard benchmarks demonstrate consistent improvements over existing OOD detection approaches in both near- and far-distribution shifts.

Resources

Paper

Synthesizing Compositional Videos from Text Description

Existing pre-trained text-to-video diffusion models can generate high-quality videos, but often struggle with misalignment between the generated content and the input text, particularly while composing scenes with multiple objects. To tackle this issue, we propose a straightforward, training-free approach for compositional video generation from text. We introduce Video-ASTAR for test-time aggregation and segregation of attention with a novel centroid loss to enhance alignment, which enables the generation of multiple objects in the scene, modeling the actions and interactions. Additionally, we extend our approach to the Multi-Action video generation setting, where only the specified action should vary across a sequence of prompts. To ensure coherent action transitions, we introduce a novel token-swapping and latent interpolation strategy.

Resources

The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs

Multimodal large language models can answer impressively complex visual questions, but do they truly understand what they see? We present The Perceptual Observatory, a framework for characterizing robustness and grounding in MLLMs beyond standard leaderboard scores. We evaluate models on interpretable tasks such as image matching, grid pointing game, and attribute localization across pixel-level corruptions and diffusion-based stylized illusions. Our analysis reveals that scaling the language model alone does not guarantee better perceptual grounding, uncovering systematic weaknesses in robustness, spatial invariance, fairness, and reasoning-based perception. The Perceptual Observatory offers a more principled way to study multimodal perception and provides actionable insights for building future MLLMs that are reliable and truly grounded in visual evidence.

Resources