Best of CVPR - July 10, 2026

Name: Best of CVPR - July 10, 2026
Start: 2026-07-10
End: 2026-07-10

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

Jul 10, 2026

9 AM - 11 AM PT

Online. Register for Zoom!

Day 1 Day 2 Day 3

Speakers

About this event

The Best of CVPR is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 Conference on Computer Vision and Pattern Recognition (CVPR).

👉 Register for this session to get access to all three days of the Best of CVPR.

Each session features a curated lineup of speakers sharing cutting-edge research across computer vision, deep learning, and multimodal AI — straight from papers accepted at one of the field’s top conferences.

Whether you’re a researcher, engineer, or practitioner, you’ll leave with a sharper view of where the field is heading.

Schedule

HyperRealm: Hyperbolic Vision Language Models for Real-World Hierarchical Multimodal Understanding

This work was honored with the Industry Innovation Award at CVPR 2026.

Real-world multimodal data naturally exhibits hierarchical structure, yet standard VLMs like CLIP align images and text in Euclidean space, which cannot preserve tree-like hierarchies. HyperRealm embeds images and text in a Poincaré ball to encode hierarchical relationships, introducing an adaptive entropy-driven entailment loss. Evaluated on 18 zero-shot classification benchmarks, it shows consistent improvements over Euclidean CLIP baselines.

Resources

Paper

Advancing Generative Quality and Reasoning in Multimodal AI

This talk exposes hidden limitations of frontier multimodal models across reasoning and visual generation, demonstrates the inherent brittleness of VLMs and audio-visual MLLMs, and introduces simple yet effective techniques to build robustness. It also covers human-centric metrics for perceptually accurate evaluation of generative media.

Cross-Modal Domain Adaptation using Semantic Parametric Mapping

XD-MAP is a framework that transfers semantic knowledge from image datasets to LiDAR by constructing semantic parametric maps from monocular detections and geometric priors. Unlike previous approaches, XD-MAP does not require overlapping sensor views and enables scalable 360° supervision for LiDAR perception without manual annotation.

Resources

WalkGPT: Pixel-Grounded Navigation Guidance for Pedestrians

Pedestrian navigation requires more than generic scene description; users need to understand walkable areas, obstacles, and the distance of surrounding objects. In this talk, I will present WalkGPT, a grounded vision-language model for accessibility-aware pedestrian navigation. WalkGPT connects language reasoning with segmentation masks and object-level distance estimates to generate grounded navigation guidance from pedestrian-view images. I will also introduce PAVE, a 41k-sample benchmark for depth-aware accessibility reasoning in real pedestrian environments. The talk will highlight how grounded multimodal AI can support safer and more interpretable pedestrian assistance.

Resources

Paper