Register for the Zoom
Virtual
1 of 3
Americas
Meetups

Best of CVPR - July 8, 2026

Jul 08, 2026
9 AM - 11 AM PT
Online. Register for Zoom!
Speakers
About this event
The Best of CVPR is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 Conference on Computer Vision and Pattern Recognition (CVPR).
👉 Register for this session to get access to all three days of the Best of CVPR.
Each session features a curated lineup of speakers sharing cutting-edge research across computer vision, deep learning, and multimodal AI — straight from papers accepted at one of the field’s top conferences.
Whether you’re a researcher, engineer, or practitioner, you’ll leave with a sharper view of where the field is heading.
Schedule
CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation
This paper presents CylinderDepth, a self-supervised surround depth estimation method leveraging cylindrical spatial attention for multi-view consistency across camera rigs.
Your ViT is Secretly Also a Video Segmentation Model
Existing online video segmentation models typically combine a per-frame segmentation module with complex, specialized tracking modules. This work shows that a plain Vision Transformer encoder with a lightweight temporal module can match that performance, resulting in VidEoMT — up to 5–10x faster, running at up to 160 FPS with a ViT-L encoder.
LinkedOut: Linking World Knowledge Out of Video LLMs for Next-Generation Video Recommendation
This CVPR 2026 work links structured world knowledge representations out of Video LLMs for next-generation video recommendation, covering how large vision-language models can provide rich semantic priors for video understanding while addressing efficiency and deployment challenges in real recommendation systems.
Some Modalities Are More Equal Than Others: Understanding and Improving Multimodal Integration in MLLMs
Multimodal large language models can process vision, audio, and text, but it remains unclear whether they truly integrate these modalities or rely on shortcut cues. In this talk, I will present our recent work, “Some Modalities Are More Equal Than Others,” where we introduce MMA-Bench, a benchmark designed to probe MLLMs under controlled audio–visual conflict, misleading text, and modality-specific queries. Through black-box evaluation and white-box attention analysis, we show that current MLLMs often struggle when modalities disagree, exhibit model-specific modality biases, and can be distracted by irrelevant textual context. We further propose an alignment-aware tuning strategy that trains models to answer based on the queried modality, improving robustness and multimodal grounding. This talk will highlight both the failure modes of current MLLMs and practical directions toward more reliable cross-modal reasoning.