Best of WACV 2026 - May 1, 2026

Virtual

Americas

Meetups

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

May 01, 2026

9 AM - 11 AM Pacific

Online. Register for the Zoom!

Speakers

About this event

Welcome to the Best of WACV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you. View more CV events here.

Schedule

Perceptually Guided 3DGS Streaming and Rendering for Mixed Reality

Recent advances in 3D Gaussian Splatting (3DGS) enable high-quality rendering but fall short of mixed reality's demanding requirements for high refresh rates, stereo viewing, and limited compute budgets. We propose a perception-guided, continuous level-of-detail framework that exploits human visual system limitations through a lightweight, gaze-contingent model to predict and adaptively modulate rendering quality across the visual field, maximizing perceived quality under compute constraints.

Combined with an edge-cloud collaborative rendering framework for untethered MR devices, our method achieves superior computational efficiency with minimal perceptual quality loss compared to vanilla and foveated baselines, validated through objective metrics and user studies.

SAVIOR: Sample-efficient Adaptation of Vision-Language Models for OCR Representation

OCR pipelines and vision-language models systematically underperform on document patterns critical to financial workflows, such as vertical text, logo-embedded vendor names, degraded scans, and complex multi-column layouts. While underrepresented in public datasets, these patterns constitute a substantial portion of real-world failure cases.

We introduce SAVIOR, a sample-efficient data curation methodology that targets such high-impact failure scenarios to adapt vision-language models for robust financial OCR, and PaIRS, a structure-aware evaluation metric that measures layout fidelity by comparing pairwise spatial relationships between tokens. When fine-tuned with SAVIOR-Train, Qwen2.5-VL-Instruct demonstrates robust financial OCR performance, outperforming both open and closed-source baselines including GPT-4o, Mistral-OCR, PaddleOCR-VL, and DeepSeek-OCR.

SynthForm: Towards a DLA-free E2E Form understanding model

We present SynthForm-3k, the first large-scale publicly available dataset of synthetically perturbed forms, comprising 3,417 samples across six domains: taxation, immigration, finance, healthcare, dental, and insurance. Ground-truth Markdown is constructed via an intermediate HTML representation generated by GPT-5 under high-reasoning inference, followed by deterministic HTML-to-Markdown conversion and scan-like perturbations (dust, scan lines, blur, rotation) that simulate real-world faxed and scanned documents.

We further introduce SynthForm-VL, a family of 2B, 4B, and 8B models obtained via full-parameter supervised fine-tuning of Qwen3-VL on this dataset. All three variants outperform their respective baselines, yielding ANLS improvements of +5.8, +9.3, and +10.3, with the fine-tuned 2B model exceeding the performance of the 4× larger Qwen3-VL-8B baseline — demonstrating that targeted domain adaptation on perturbation-robust data offers a more favorable cost–performance tradeoff than scale alone for structured form understanding.

Beyond Pixels: Type-Aware Contrastive Learning for Global Urban Similarity

Standard visual models often fail to distinguish between superficial appearances and meaningful structural variations in urban environments. We present a type-aware contrastive learning framework that measures city similarity by explicitly modeling infrastructure elements like intersections and bus lanes. Our framework integrates a type-conditioned Vision Transformer that actively fuses visual features with CLIP-derived semantic embeddings via a novel adaptive per-type contrastive loss. This allows the model to dynamically prioritize the most discriminative infrastructure categories while down-weighting less informative visual noise. We demonstrate that this method significantly improves clustering quality and generalizes to unseen cities, providing a scalable, interpretable foundation for urban analysis.