Upcoming events

AI, ML, and Computer Vision Meetup - July 23, 2026

Meetups • Jul 23, 2026

MCP/Agents/Skills Meetup - July 29, 2026

Meetups • Jul 29, 2026

See all events

Build better computer vision models.

Annotate samples
Curate datasets
Evaluate models

pip install fiftyone

View All Events

Virtual

2 of 3

Americas

Meetups

View All Events

Best of CVPR - July 9, 2026

Name: Best of CVPR - July 9, 2026
Start: 2026-07-09
End: 2026-07-09

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

Jul 09, 2026

9 AM - 11 AM PT

Online. Register for Zoom!

Day 1 Day 2 Day 3

Speakers

About this event

The Best of CVPR is a three-day virtual meetup series featuring researchers presenting their accepted papers from the 2026 Conference on Computer Vision and Pattern Recognition (CVPR).

👉 Register for this session to get access to all three days of the Best of CVPR.

Each session features a curated lineup of speakers sharing cutting-edge research across computer vision, deep learning, and multimodal AI — straight from papers accepted at one of the field’s top conferences.

Whether you’re a researcher, engineer, or practitioner, you’ll leave with a sharper view of where the field is heading.

Schedule

Efficient Representation and Coding of Dynamic Light Fields

This talk presents a data-driven approach that integrates aperture and pixel-wise exposure coding with Dynamic Mode Decomposition (DMD) to achieve compact representation of dynamic light fields. By modeling them as mathematical dynamical systems, the framework captures coherent structures across all dimensions and achieves scalable compression, bitrate savings, and high-quality reconstructions.

Resources

Paper

PHANTOM: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Recent video generation models can produce visually striking results, but they often fail to capture the physical dynamics that govern how real-world scenes evolve. In this talk, I will present PHANTOM, a physics-infused video generation model that jointly predicts visual content and latent physical dynamics. PHANTOM uses a physics-aware video representation to guide generation toward videos that are both visually realistic and physically consistent, without requiring explicit simulator-based physical specifications. I will discuss the model design, key results on standard and physics-aware video generation benchmarks, and how this work supports broader progress toward multimodal world models for physical AI and embodied reasoning.

Resources

Paper

LoST: Level of Semantics Tokenization for 3D Shapes

Tokenization is fundamental to generative modeling and especially important for autoregressive 3D generation. However, current 3D shape tokenizers rely on geometric level-of-detail hierarchies that are token-inefficient and poorly aligned with semantic structure. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience so early tokens produce complete, plausible shapes and later tokens refine detailed geometry and semantics. LoST is trained with Relational Inter-Distance Alignment (RIDA), a semantic alignment loss that matches relationships in 3D shape latent space to those in DINO feature space. Experiments show that LoST achieves state-of-the-art reconstruction and efficient high-quality AR 3D generation while using only 0.1%–10% of the tokens required by prior methods.

Resources

Paper

3D Reconstruction Improves Weakly-Supervised Semantic Segmentation

Semantic segmentation typically requires expensive, dense annotations, making large-scale training a significant bottleneck. We address this by introducing a framework that leverages recent advances in feed-forward 3D reconstruction to improve weakly supervised semantic segmentation on 2D images, using only sparse labels such as points, scribbles, or coarse masks. Our core insight is that 3D geometric structure recovered directly from casual 2D video sequences provides powerful cross-view consistency constraints that can propagate sparse annotations across entire scenes. A dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, injecting geometric supervision into the learning process while keeping inference purely 2D. Our solution achieves state-of-the-art performance, outperforming existing methods by 2–7% across a range of datasets and annotation types, without requiring additional labels or inference overhead.

Resources

Paper