The Best of CVPR 2025 Series – Day 1
May 29, 2025 – Written by Paula Ramos
Building Smarter, Safer, and More Grounded Vision AI
CVPR brings together some of the most exciting research in computer vision, but sometimes it’s hard to keep up, especially when great ideas are shared quickly or secret in complex papers. That’s why we created “The Best of CVPR” virtual meetup and blog series. We want to give more visibility to the people behind the work and help everyone see these projects’ potential beyond the conference.
The goal is simple: to help more people understand how this research connects to real-life problems — things like farming, healthcare, driving, and imaging — and to highlight its potential to impact our communities positively. We want to show who’s doing this work, what they’re building, and the promising future that might come next.
This blog is written in a clear and relaxed tone. We’re keeping things easy to follow, with highlights from four excellent papers, short summaries, and ample space to explore how we can learn from each other and work together. We invite you to be part of this collaborative journey.
This is the first in a three-part series. We hope it inspires new ideas and opens the door for more conversations, collaboration, and space to lift this fantastic community. Please consider how your unique perspective and expertise could contribute to these exciting developments in computer vision.

Teaching AI to See the Unseeable — OpticalNet [1]
Paper Title: OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit. Paper Authors: Benquan Wang, Ruyi An, Jin-Kyu So, Sergei Kurdiumov, Eng Aik Chan, Giorgio Adamo, Yuhan Peng, Yewen Li, Bo An. Institutions: Nanyang Technological University, Skywork AI, Singapore, University of Southampton, The University of Texas at Austin.
What if we could image nanoscale structures invisible to traditional optics — without dyes, electron beams, or damage? OpticalNet delivers a transformative AI benchmark for breaking the diffraction limit, using modular deep learning and a first-of-its-kind dataset.

What It’s About
OpticalNet is the first AI benchmark designed to reconstruct ultra-tiny objects — smaller than light can typically resolve — from blurry diffraction images. By combining experimental and simulated data, it trains deep learning models to translate invisible light patterns into clear, interpretable images.
Why It Matters
Due to light’s wave nature, optical resolution is traditionally capped at ~200nm. This limits real-time, noninvasive imaging of nanoscale biological structures (like viruses) and nanomaterials. OpticalNet enables conventional microscopes to break that limit using AI without expensive, invasive add-ons, potentially revolutionizing biomedicine, materials science, and manufacturing imaging.
How It Works
- Data Collection: Real subwavelength objects are fabricated with Focused Ion Beam (FIB) on gold films, then scanned with a high-precision optical microscope to generate diffraction images.
- Simulation Framework: A Python-based tool mimics optical wave propagation to generate synthetic data for model training and proof-of-concept validation.
- Learning Task: Formulated as an image-to-image translation problem, where models learn to convert diffraction images into binary object images.
- Evaluation: Predictions are stitched together to reconstruct full objects and tested on synthetic “Light” (curved shape) and Siemens Star (rotation benchmark) datasets.
Key Result
Transformer-based vision models outperform CNNs, successfully reconstructing complex subwavelength structures from diffraction images, even under experimental noise. The models trained on simple square blocks generalized well to unseen complex shapes, validating the “building block” approach.
Broader Impact
OpticalNet lays the groundwork for AI-powered subwavelength imaging using existing hardware, enabling affordable, non-invasive diagnostics and quality control in fields ranging from virology to semiconductor inspection. It bridges optical physics and computer vision, inviting interdisciplinary collaboration to push the boundaries of what light-based imaging can achieve.
Compositional Reasoning You Can See — Factored Generative AI [2]
Paper title: Nonisotropic Gaussian diffusion for realistic 3D human motion prediction. Paper Authors: Cecilia Curreli, Dominik Muhle, Abhishek Saroha, Zhenzhang Ye, Riccardo Marin, Daniel Cremers. Institutions: Technical University of Munich, Munich Center for Machine Learning
What if AI could think like humans — separating what’s in a scene from where it is — to reason and create new images more logically and safely? This paper introduces a generative model that does just that.

What It’s About
The paper presents SkeletonDiffusion, a novel latent diffusion model for probabilistic human motion prediction. Unlike prior approaches, which often generate implausible poses (like stretched or jittery limbs), SkeletonDiffusion introduces a nonisotropic Gaussian diffusion that better reflects the structure and relationships between human body parts.
Why It Matters
Predicting human motion accurately and realistically has critical implications for autonomous driving, robotics, virtual reality, healthcare, and human-computer interaction. This method improves realism in forecasted human poses and addresses a significant shortcoming of previous models: inconsistent or anatomically incorrect body movements.
How It Works
- SkeletonDiffusion learns to generate human motion in a latent space, using a nonisotropic diffusion process tailored to the skeleton’s kinematic structure.
- The model emphasizes bone-aware motion synthesis, enforcing realism and diversity through architectural bias and improved training strategies.
- The authors also critique commonly used diversity metrics, highlighting how some models gain higher diversity scores at the cost of anatomical accuracy.
Key Result
SkeletonDiffusion outperforms isotropic diffusion baselines across multiple benchmarks, including three real-world datasets, producing more plausible and diverse predictions. It sets a new state-of-the-art in balancing realism and diversity without compromising physical consistency.
Broader Impact
By incorporating structural awareness into generative modeling, SkeletonDiffusion enables safer, more accurate motion forecasting for critical applications such as assistive robotics, virtual avatars, and surveillance systems. It also opens the door for redefining evaluation standards in generative human motion modeling.
Teaching AI to See the Farm with Just a Handful of Images — Few-Shot Grounding DINO [3]
Paper title: Few-Shot Adaptation of Grounding DINO for Agricultural Domain. Paper Authors: Rajhans Singh, Rafael Bidese Puhl, Kshitiz Dhakal, Sudhir Sornapudi. Institutions: Corteva Agriscience, Indianapolis, USA
Why rely on expensive, labor-intensive annotations when AI can learn crop detection from just a few photos? This paper turns Grounding-DINO into a fast, prompt-free few-shot learner for agriculture.

What It’s About
The paper introduces a lightweight, few-shot adaptation of the Grounding-DINO open-set object detection model, tailored explicitly for agricultural applications. The method eliminates the text encoder (BERT) and uses randomly initialized trainable embeddings instead of hand-crafted text prompts, enabling accurate detection from minimal annotated data.
Why It Matters
High-performing agricultural AI often demands large, diverse annotated datasets, which are expensive and time-consuming. This method rapidly adapts a powerful foundation model to diverse agricultural tasks using only a few images, reducing costs and accelerating model deployment in farming and phenotyping scenarios.
How It Works
- Grounding-DINO typically uses a vision-language architecture (image and text encoders).
- This adaptation removes the BERT-based text encoder and replaces it with randomly initialized fine-tuned embeddings with minimal training images.
- Only these new embeddings are trained, while the rest of the model remains frozen.
- This simplified design avoids the complexities of manual text prompt engineering and dramatically reduces training overhead.
Key Result
Across eight agricultural datasets, including PhenoBench, Crop-Weed, BUP20, and DIOR, the few-shot method consistently outperforms:
- Zero-shot Grounding-DINO, particularly in cluttered or occluded scenarios.
- YOLOv11, by up to 24% higher mAP with just four training images.
- Prior state-of-the-art few-shot detectors in remote sensing benchmarks
Broader Impact
This work presents a scalable and cost-effective way to deploy deep learning in agriculture, even with limited data. It demonstrates how foundation models can be tailored to real-world domains like plant counting, insect detection, fruit recognition, and remote sensing, making AI more accessible and valuable for sustainable and efficient farming practices.
Can Your AI Really Drive? — Drive4C Breaks It Down [4]
Paper title: Drive4C: A Closed-Loop Benchmark on What Foundation Models Really Need to Be Capable of for Language-Guided Autonomous Driving. Paper Authors: Tin Stribor Sohn, Maximilian Dillitzer, Johannes Bach, Jason J. Corso, Tim Brühl, Robin Schwager, Tim Dieter Eberhardt, Eric Sax. Institutions: Dr. Ing. h.c. F. Porsche AG, University of Applied Science Esslingen, University of Michigan, Voxel51 Inc., Karlsruhe Institute of Technology
As language-guided autonomous driving becomes more common, one big question remains: What exactly should foundation models understand to drive safely? Drive4C answers that.
What It’s About
The paper introduces Drive4C, a closed-loop benchmark that systematically evaluates multimodal large language models (MLLMs) for language-guided autonomous driving (E2E-AD). It isolates and tests four essential capabilities: semantic, spatial, temporal, and physical understanding, along with scenario anticipation and language-guided motion (LGM)
Why It Matters
Language-guided driving is an emerging AI paradigm, but existing evaluations miss critical skills needed for safe autonomy. Drive4C allows for fine-grained performance breakdowns, helping researchers understand and fix weaknesses in modern MLLMs — an essential step for making autonomous vehicles robust and trustworthy.
How It Works
- Drive4C is built on the CARLA simulator and:
- Separates evaluation into two stages: (1) QA-based scenario understanding and (2) instruction-based driving execution.
- Covers 380 scenarios with 165K QA pairs and 87 question templates.
- Evaluates models using multiple-choice and free-form questions, scored with correctness and GPT-based similarity.
- Adds driving performance scores based on compliance with natural
language instructions (LGM). - Supports multimodal input (e.g., video, LiDAR, radar, GPS) and is
compatible with real-world sensor setups
Key Result
All evaluated models (including GPT-4o, SmolVLM, Llama-3.2, DriveMM, and Dolphins) perform well in semantic understanding and scenario anticipation, but struggle significantly with spatial, temporal, and physical understanding, and fail at complex driving maneuvers in LGM. GPT-4o had the best score (0.3012), but all models were not ready for real-world use.
Broader Impact
Drive4C exposes the core weaknesses in current AI driving agents and provides a capability-driven framework to guide future model improvements. It advocates for structured inductive biases (like physical laws and spatial layout models) to move toward safe and generalizable autonomous systems. The benchmark is open-source and aims to become a standard for evaluating foundation models in autonomous driving.
Why These Papers Matter — A New Era in Computer Vision
These four CVPR 2025 papers, which cover fields as diverse as microscopy, farming, autonomous driving, and visual reasoning, share a common thread: interpretability over black boxes. Low-data solutions over brute force scale, domain-specific realism over generic accuracy.
Together, they signal a shift in the computer vision landscape:
- From scaling up to smart modularity
- From end-to-end pipelines to compositional transparency
- From curated datasets to real-world applications
As AI moves deeper into high-stakes domains, we’ll need systems that build them. The ‘real-world applications’ here refer to the use of AI in fields such as healthcare, agriculture, and autonomous driving, where the ability to reason, adapt, and explain is crucial.
CVPR 2025 is not just about seeing better. It’s about thinking better with vision. Stay tuned for days 2 and 3 of the Best of CVPR Series Voxel51.
What is next?
If you’re interested in following along as I dive deeper into the world of AI and continue to grow professionally, feel free to connect or follow me on LinkedIn. Let’s inspire each other to embrace change and reach new heights!
You can find me at some Voxel51 events (https://voxel51.com/computer-vision-events/), or if you want to join this fantastic team, it’s worth taking a look at this page: https://voxel51.com/jobs/

References
[1] B. Wang, R. An, J.-K. So, S. Kurdiumov, E. A. Chan, G. Adamo, Y. Peng, Y. Li, and B. An, “OpticalNet: An Optical Imaging Dataset and Benchmark Beyond the Diffraction Limit,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025. Temporal Link: https://deep-see.github.io/OpticalNet/assets/paper.pdf
[2] C. Curreli, Z. Ye, D. Muhle, R. Marin, A. Saroha, and D. Cremers, “Nonisotropic Gaussian diffusion for realistic 3D human motion prediction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025. Temporal link: https://arxiv.org/abs/2501.06035
[3] R. Singh, R. B. Puhl, K. Dhakal, and S. Sornapudi, “Few-Shot Adaptation of Grounding DINO for Agricultural Domain,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025. Temporal link: https://arxiv.org/abs/2504.07252v1
[4] T. S. Sohn, M. Dillitzer, J. Bach, J. J. Corso, T. Brühl, R. Schwager, T. D. Eberhardt, and E. Sax, “Drive4C: A Closed-Loop Benchmark on What Foundation Models Really Need to Be Capable of for Language-Guided Autonomous Driving,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025.