Best of CVPR - July 8, 2026

Name: Best of CVPR - July 8, 2026
Start: 2026-07-08
End: 2026-07-08

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

Jul 08, 2026

9 AM - 11 AM PT

Online. Register for Zoom!

Day 1 Day 2 Day 3

Speakers

About this event

The Best of CVPR is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 Conference on Computer Vision and Pattern Recognition (CVPR).

👉 Register for this session to get access to all three days of the Best of CVPR.

Each session features a curated lineup of speakers sharing cutting-edge research across computer vision, deep learning, and multimodal AI — straight from papers accepted at one of the field’s top conferences.

Whether you’re a researcher, engineer, or practitioner, you’ll leave with a sharper view of where the field is heading.

Schedule

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

This paper presents CylinderDepth, a self-supervised surround depth estimation method leveraging cylindrical spatial attention for multi-view consistency across camera rigs.

Resources

Paper

Your ViT is Secretly Also a Video Segmentation Model

Existing online video segmentation models typically combine a per-frame segmentation module with complex, specialized tracking modules. This work shows that a plain Vision Transformer encoder with a lightweight temporal module can match that performance, resulting in VidEoMT — up to 5–10x faster, running at up to 160 FPS with a ViT-L encoder.

Resources

Paper

LinkedOut: Linking World Knowledge Out of Video LLMs for Next-Generation Video Recommendation

This CVPR 2026 work links structured world knowledge representations out of Video LLMs for next-generation video recommendation, covering how large vision-language models can provide rich semantic priors for video understanding while addressing efficiency and deployment challenges in real recommendation systems.

Resources

Paper

Some Modalities Are More Equal Than Others: Understanding and Improving Multimodal Integration in MLLMs

Multimodal large language models can process vision, audio, and text, but it remains unclear whether they truly integrate these modalities or rely on shortcut cues. In this talk, I will present our recent work, “Some Modalities Are More Equal Than Others,” where we introduce MMA-Bench, a benchmark designed to probe MLLMs under controlled audio–visual conflict, misleading text, and modality-specific queries. Through black-box evaluation and white-box attention analysis, we show that current MLLMs often struggle when modalities disagree, exhibit model-specific modality biases, and can be distracted by irrelevant textual context. We further propose an alignment-aware tuning strategy that trains models to answer based on the queried modality, improving robustness and multimodal grounding. This talk will highlight both the failure modes of current MLLMs and practical directions toward more reliable cross-modal reasoning.

Resources

Paper