The Best of CVPR is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 Conference on Computer Vision and Pattern Recognition (CVPR).
👉 Register for this session to get access to all three days of the Best of CVPR.
Each session features a curated lineup of speakers sharing cutting-edge research across computer vision, deep learning, and multimodal AI — straight from papers accepted at one of the field’s top conferences.
Whether you’re a researcher, engineer, or practitioner, you’ll leave with a sharper view of where the field is heading.
Schedule
Advancing Generative Quality and Reasoning in Multimodal AI
This talk exposes hidden limitations of frontier multimodal models across reasoning and visual generation, demonstrates the inherent brittleness of VLMs and audio-visual MLLMs, and introduces simple yet effective techniques to build robustness. It also covers human-centric metrics for perceptually accurate evaluation of generative media.
HyperRealm: Hyperbolic Vision Language Models for Real-World Hierarchical Multimodal Understanding
Real-world multimodal data naturally exhibits hierarchical structure, yet standard VLMs like CLIP align images and text in Euclidean space, which cannot preserve tree-like hierarchies. HyperRealm embeds images and text in a Poincaré ball to encode hierarchical relationships, introducing an adaptive entropy-driven entailment loss. Evaluated on 18 zero-shot classification benchmarks, it shows consistent improvements over Euclidean CLIP baselines.
Cross-Modal Domain Adaptation using Semantic Parametric Mapping
XD-MAP is a framework that transfers semantic knowledge from image datasets to LiDAR by constructing semantic parametric maps from monocular detections and geometric priors. Unlike previous approaches, XD-MAP does not require overlapping sensor views and enables scalable 360° supervision for LiDAR perception without manual annotation.
WalkGPT: Pixel-Grounded Navigation Guidance for Pedestrians
Pedestrian navigation requires more than generic scene description; users need to understand walkable areas, obstacles, and the distance of surrounding objects. In this talk, I will present WalkGPT, a grounded vision-language model for accessibility-aware pedestrian navigation. WalkGPT connects language reasoning with segmentation masks and object-level distance estimates to generate grounded navigation guidance from pedestrian-view images. I will also introduce PAVE, a 41k-sample benchmark for depth-aware accessibility reasoning in real pedestrian environments. The talk will highlight how grounded multimodal AI can support safer and more interpretable pedestrian assistance.