The Best of CVPR 2025 Series – Day 2
May 29, 2025 – Written by Paula Ramos
Elevating Research Voices in Vision AI
Computer vision is evolving fast, but too often, powerful research sits unread in conference proceedings. That’s why we created the Best of CVPR virtual meetup series—to bring researchers to the forefront, spotlight how their work can solve real-world problems, and open doors for future collaboration. Research needs room to breathe, be explained, and celebrated.
This blog is the second in a three-part series highlighting papers that go beyond technical novelty — they address safety, trust, fairness, and usability across industries. From smart homes and clinical decision support to expressive avatars and factory inspection, Day 2 shows how vision AI is growing more aware, robust, and practical.
Let’s dive into the impact of the algorithms.

Teaching AI What Matters at Home [1]
Paper title: SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models. Paper Authors: Xinyi Zhao, Congjing Zhang, Pei Guo, Wei Li, Lin Chen, Chaoyue Zhao, Shuai Huang. Institutions: University of Washington, Wyze Labs
Can we build AI that understands anomalies in smart homes, not just detecting strange events but explaining them in ways humans trust?

What It’s About:
This paper introduces SmartHome-Bench, the first benchmark specifically designed for video anomaly detection (VAD) in smart home environments. The benchmark includes a dataset of 1,203 smart home videos annotated with anomaly types, detailed descriptions, and reasoning, and it evaluates the performance of multi-modal large language models (MLLMs) using various prompting strategies.
Why It Matters:
Previous benchmarks for VAD were geared toward public settings and lacked relevance for private environments like homes. Smart home anomalies — such as pet escapes, elder falls, or child safety issues — are unique and sensitive. This work fills a gap by enabling the evaluation of VAD in personal, safety-critical domains, emphasizing trust, transparency, and reasoning.
How It Works:
- The authors propose a taxonomy of 7 smart home event categories (e.g., wildlife, baby monitoring, senior care).
- They annotate videos with binary anomaly labels, rich textual descriptions, and reasoning explanations (e.g., why a behavior is abnormal).
- The paper evaluates six MLLMs (e.g., Gemini-1.5, GPT-4o, Claude-3.5-sonnet, VILA-13b) under multiple prompting setups: zero-shot, chain-of-thought (CoT), few-shot CoT, and in-context learning (ICL).
- They introduce a new method: Taxonomy-Driven Reflective LLM Chain (TRLC) — a multi-step process combining rule generation, initial prediction, and self-reflection.
Key Result:
The proposed TRLC framework achieves a notable 11.62% improvement in anomaly detection accuracy over zero-shot prompting. It delivers the highest accuracy of 79.05% with Claude-3.5-sonnet and improves model performance across ambiguous (vague abnormal) and standard scenarios.
Broader Impact:
SmartHome-Bench paves the way for trustworthy, explainable AI in home monitoring. It highlights the limitations of current MLLMs in capturing nuanced private-space anomalies and introduces strategies for improvement. The benchmark and TRLC methodology could influence safety tech, elder care systems, and smart home automation while offering tools to improve human-AI alignment in real-world video understanding.
Teaching AI to Listen: Doctor-in-the-Loop Diagnosis with Concept Reasoning [2]
Paper title: Interactive Medical Image Analysis with Concept-based Similarity Reasoning. Paper Authors: Ta Duc Huy, Sen Kim Tran, Phan Nguyen, Nguyen Hoang Tran, Tran Bao Sam, Anton van den Hengel, Zhibin Liao, Johan W. Verjans, Minh-Son To, Vu Minh Hieu Phan. Institutions: Australian Institute for Machine Learning — University of Adelaide, Flinders University

What if doctors could interact directly with a diagnostic AI model — not only to view its reasoning, but to correct it in real time and teach it what truly matters in medical images?
What It’s About:
This paper introduces CSR (Concept-based Similarity Reasoning), an interpretable and interactive model for medical image analysis. CSR addresses limitations in current concept-based and prototype-based methods by offering patch-level interpretability, localized concept grounding, and real-time doctor interaction to refine predictions without relying on post-hoc analysis
Why It Matters:
Existing AI models often act as black boxes, making them unreliable in safety-critical domains like healthcare. CSR improves trustworthiness by offering transparency, letting doctors understand the why behind model predictions and intervene during both training and test time. This enhances both accuracy and adoption of AI tools in clinical workflows.
How It Works:
CSR combines:
- Concept-based reasoning: It computes similarity between input image patches and learned concept prototypes.
- Prototype learning: Uses contrastive learning to ensure semantic consistency and compactness across concepts.
- Doctor-in-the-loop interaction: Doctors can reject irrelevant concepts or guide focus to specific regions using bounding boxes (spatial and concept-level feedback).
- Patch-level explanations: For each concept, CSR returns visual maps indicating where it found supporting evidence
Key Result:
CSR outperforms existing interpretable methods across three biomedical datasets (TBX11K, VinDr-CXR, ISIC), achieving up to 94.4% F1-score. It also demonstrated higher trustworthiness, with a Pointing Game hit rate of 79.5% when refined by doctors, surpassing previous models like ProtoPNet and CBM.
Broader Impact:
CSR offers a trust-enhancing framework for medical AI by combining interpretability with interactivity. It empowers clinicians to inspect, correct, and guide AI systems, addressing key concerns around explainability and clinical safety. The proposed method moves beyond performance alone, emphasizing collaborative intelligence between human and machine in medical diagnostics.
OFER: Filling in the Blanks in Occluded Faces [3]
Paper title: OFER: Occluded Face Expression Reconstruction. Paper Authors: Pratheba Selvaraju, Victoria F. Abrevaya, Timo Bolkart, Rick Akkerman, Tianyu Ding, Faezeh Amjadi, Ilya Zharkov. Institutions: University of Massachusetts Amherst, MPI-IS, Google Research, University of Amsterdam, Microsoft Research
How do we reconstruct 3D facial expressions from a single image when part of the face is covered by a hand, mask, or hair, yet still produce multiple plausible, expressive results?

What It’s About:
The paper introduces OFER, a novel method for reconstructing 3D faces with diverse expressions from single occluded images. OFER is the first to combine two conditional diffusion models for generating shape and expression coefficients of a face model (FLAME), and introduces a ranking mechanism to select the most accurate facial identity among multiple generated hypotheses.
Why It Matters:
Single-image 3D face reconstruction is already challenging; occlusions introduce ambiguity and variability. Traditional methods often fail under occlusion or generate unrealistic results. OFER offers a plausible, diverse, and structurally consistent alternative, which is crucial for applications in telepresence, AR/VR, medical imaging, and biometric systems.
How It Works:
- IdGen: A conditional diffusion model generates multiple candidate neutral 3D face shapes.
- IdRank: A novel ranking network scores and selects the best-fitting shape sample based on the visible parts of the face.
- ExpGen: A second diffusion model generates multiple expression variations for the selected shape.
- The result is a set of expressive 3D face reconstructions that preserve identity while varying the expression plausibly.
Key Result:
OFER outperforms state-of-the-art methods (e.g., Diverse3D, EMOCA) in both quality and diversity of expression under occlusion. It achieves lower reconstruction errors on the NoW benchmark and improved results on the new CO-545 dataset introduced by the authors.
Broader Impact:
OFER sets a new standard for multi-hypothesis 3D face reconstruction under occlusion, unlocking real-world applications in challenging environments. The proposed ranking mechanism also opens new research directions for sample selection in diffusion models, and the CO-545 dataset provides a valuable benchmark for future research.
Multi-Flow: Smarter Eyes on the Factory Floor [4]
Paper title: Multi-Flow: Multi-View-Enriched Normalizing Flows for Industrial Anomaly Detection. Paper Authors: Mathis Kruse, Bodo Rosenhahn. Institutions: Institute for Information Processing, L3S — Leibniz University Hannover
Traditional visual inspection models can miss defects by relying on just one camera angle. What if AI could reason across multiple views to spot anomalies — no matter where they hide?

What It’s About:
The paper introduces Multi-Flow, a novel multi-view industrial anomaly detection architecture using normalizing flows. It enhances anomaly detection by integrating multiple views of an object and estimating the exact likelihood across them. The method is evaluated explicitly on the challenging Real-IAD dataset, which features five fixed views per object.
Why It Matters:
Most existing methods assume that a single image is enough to detect product anomalies, which is unrealistic in industrial scenarios where defects may only be visible from certain angles. Multi-Flow addresses this limitation by combining cross-view reasoning with a robust statistical modeling framework, offering better reliability in real-world settings where quality assurance is critical.
How It Works:
- Uses a RealNVP-based normalizing flow to model the distribution of normal (non-defective) features extracted from images.
- Introduces a multi-view coupling block that allows cross-view message passing using 2D convolutions between adjacent and top-view camera images.
- Employs foreground-background segmentation (MVANet) to exclude background noise and focus learning on the object itself.
- Regularizes training via noise-conditioned data augmentation, using a modified variant of the SoftFlow and SimpleNet conditioning strategies.
- Trained in a semi-supervised setup, using only defect-free training images, and evaluated both image-wise and sample-wise.
Key Result:
Achieves state-of-the-art detection on the Real-IAD dataset:
- Sample-wise AUROC: 95.85 (↑ from 94.9 by SimpleNet)
- Image-wise AUROC: 90.27
Outperforms all prior baselines in detecting defects aggregated across multiple views.
Ablation studies show the importance of cross-view connections and background removal for boosting performance.
Broader Impact:
Multi-Flow contributes to building more trustworthy, robust AI systems for visual quality inspection in manufacturing. Its architecture applies to real-world production lines where multi-view camera setups are standard. The method reduces reliance on idealized single-view assumptions and paves the way for industry-scale, view-agnostic anomaly detection systems. The authors provide open-source code to facilitate further adoption and research.
Why These Papers Matter — A New Era in Computer Vision
The research we explored today gives us a glimpse into how vision AI is starting to show up in our daily lives — in homes, hospitals, and industrial settings. Each project brings something meaningful to the table: tools that help doctors understand AI decisions, systems that watch over smart homes with care, models that handle real-world imperfections like occlusions, and methods that make quality control more innovative and more reliable.
What stands out is the intention behind these efforts. The focus isn’t just on making things work — it’s on making them work in clear, usable, and trustworthy ways.
We’re excited to continue sharing the stories behind this work. We’ll see you soon on Day 3.
What is next?
If you’re interested in following along as I dive deeper into the world of AI and continue to grow professionally, feel free to connect or follow me on LinkedIn. Let’s inspire each other to embrace change and reach new heights!
You can find me at some Voxel51 events (https://voxel51.com/computer-vision-events/), or if you want to join this fantastic team, it’s worth taking a look at this page: https://voxel51.com/jobs/

References
[1] Xinyi Zhao, Congjing Zhang, Pei Guo, Wei Li, Lin Chen, Chaoyue Zhao, and Shuai Huang, “SmartHome-Bench: A Comprehensive Benchmark for Video Anomaly Detection in Smart Homes Using Multi-Modal Large Language Models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.
[2] Ta Duc Huy, Sen Kim Tran, Phan Nguyen, Nguyen Hoang Tran, Tran Bao Sam, Anton van den Hengel, Zhibin Liao, Johan W. Verjans, Minh-Son To, and Vu Minh Hieu Phan, “Interactive Medical Image Analysis with Concept-based Similarity Reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Temporal link: https://arxiv.org/abs/2503.06873
[3] Pratheba Selvaraju, Victoria Fernandez Abrevaya, Timo Bolkart, Rick Akkerman, Tianyu Ding, Faezeh Amjadi, and Ilya Zharkov, “OFER: Occluded Face Expression Reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Temporal link: https://arxiv.org/abs/2410.21629
[4] Mathis Kruse and Bodo Rosenhahn, “Multi-Flow: Multi-View-Enriched Normalizing Flows for Industrial Anomaly Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.Temporal link: https://arxiv.org/pdf/2504.03306