Best of CVPR - July 9, 2025

Virtual

Americas

Meetups

Best of CVPR – July 9, 2025

Jul 9, 2025

9 AM Pacific

Online. Register for the Zoom!

Speakers

About this event

Welcome to the Best of CVPR series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you.

Schedule

What Foundation Models really need to be capable of for Autonomous Driving - The Drive4C Benchmark

Foundation models hold the potential to generalize the driving task and support language-based interaction in autonomous driving. However, they continue to struggle with specific reasoning tasks essential for robotic navigation. Current benchmarks typically provide only aggregate performance scores, making it difficult to assess the underlying capabilities these models require. Drive4C addresses this gap by introducing a closed-loop benchmark that evaluates semantic, spatial, temporal, and physical understanding—enabling more targeted improvements to advance foundation models for autonomous driving.

Human Motion Prediction – Enhanced Realism via Nonisotropic Gaussian Diffusion

Predicting future human motion is a key challenge in generative AI and computer vision, as generated motions should be realistic and diverse at the same time. This talk presents a novel approach that leverages top-performing latent generative diffusion models with a novel paradigm. Nonisotropic Gaussian diffusion leads to better performance, fewer parameters, and faster training at no additional computational cost. We will also discuss how such benefits can be obtained in other application domains.

Efficient Few-Shot Adaptation of Open-Set Detection Models

We propose an efficient few-shot adaptation method for the Grounding-DINO open-set object detection model, designed to improve performance on domain-specific specialized datasets like agriculture, where extensive annotation is costly. The method circumvents the challenges of manual text prompt engineering by removing the standard text encoder and instead introduces randomly initialized, trainable text embeddings. These embeddings are optimized directly from a few labeled images, allowing the model to quickly adapt to new domains and object classes with minimal data. This approach demonstrates superior performance over zero-shot methods and competes favorably with other few-shot techniques, offering a promising solution for rapid model specialization.

OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit

Optical imaging capable of resolving nanoscale features would revolutionize scientific research and engineering applications across biomedicine, smart manufacturing, and semiconductor quality control. However, due to the physical phenomenon of diffraction, the optical resolution is limited to approximately half the wavelength of light, which impedes the observation of subwavelength objects such as the native state coronavirus, typically smaller than 200 nm. Fortunately, deep learning methods have shown remarkable potential in uncovering underlying patterns within data, promising to overcome the diffraction limit by revealing the mapping pattern between diffraction images and their corresponding ground truth object images. However, the absence of suitable datasets has hindered progress in this field —— collecting high-quality optical data of subwavelength objects is highly difficult as these objects are inherently invisible under conventional microscopy, making it impossible to perform standard visual calibration and drift correction. Therefore, we provide the first general optical imaging dataset based on the “building block” concept for challenging the diffraction limit. Drawing an analogy to modular construction principles, we construct a comprehensive optical imaging dataset comprising subwavelength fundamental elements, i.e., small square units that can be assembled into larger and more complex objects. We then frame the task as an image-to-image translation task and evaluate various vision methods. Experimental results validate our “building block” concept, demonstrating that models trained on basic square units can effectively generalize to realistic, more complex unseen objects. Most importantly, by highlighting this underexplored AI-for-science area and its potential, we aspire to advance optical science by fostering collaboration with the vision and machine learning communities.