Best of 3DV 2026 - May 12, 2026

Virtual

2 of 3

Americas

Meetups

Best of 3DV 2026 - May 12, 2026

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

May 12, 2026

9AM PST

Online. Register for Zoom!

Day 1 Day 2 Day 3

Speakers

About this event

Welcome to the Best of 3DV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you. View more CV events here.

Schedule

Precise lighting control in diffusion models by drawing shadows

Diffusion models can now be used as powerful neural rendering engines that can be leveraged for realistically inserting virtual objects into images. However, unlike traditional 3D rendering engines (e.g., Blender), they lack precise control over the lighting, an essential requirement in an artistic workflow. We demonstrate that fine-grained lighting control can be achieved for object relighting simply by specifying the desired shadow of the object and injecting it into the diffusion denoising process. The model then produces a realistic relighting of the object consistent with the input shadow direction.

Resources

SmokeSeer: 3D Gaussian Splatting for Smoke Removal and Scene Reconstruction

Smoke in real-world scenes can severely degrade image quality and hamper visibility. Recent image restoration methods either rely on data-driven priors that are susceptible to hallucinations, or are limited to static low-density smoke. We introduce SmokeSeer, a method for simultaneous 3D scene reconstruction and smoke removal from multi-view video sequences. Our method uses thermal and RGB images, leveraging the reduced scattering in thermal images to see through smoke. We build upon 3D Gaussian splatting to fuse information from the two image modalities, and decompose the scene into smoke and non-smoke components. Unlike prior work, SmokeSeer handles a broad range of smoke densities and adapts to temporally varying smoke. We validate our method on synthetic data and a new real-world smoke dataset with RGB and thermal images.

Resources

Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device.

Resources

Broadening View Synthesis of Dynamic Scenes from Constrained Monocular Videos

Novel view synthesis of dynamic scenes from monocular video tends to break down once the camera deviates far from the training trajectory, leaving applications in mixed reality, autonomous driving, and immersive media without reliable wide-angle renderings. We present ExpanDyNeRF, a dynamic NeRF that broadens the reliable synthesis range to large-angle rotations by leveraging Gaussian splatting priors as pseudo ground truth to jointly refine density and color at novel viewpoints. To benchmark side-view fidelity, an axis largely missing from prior datasets, we introduce SynDM, the first synthetic dynamic multi-view dataset with paired primary and rotated views, built on a custom GTA V pipeline. Across SynDM, DyNeRF, and NVIDIA, ExpanDyNeRF substantially outperforms prior dynamic NeRF and Gaussian methods under extreme viewpoint shifts. We close by previewing PanoWorld, our follow-up that pushes view expansion to its natural limit, namely geometry-consistent 360° panoramic video generation from a single image and text prompt.

Resources

Paper