Welcome to our coverage of the International Conference on Computer Vision (ICCV) 2025! ICCV showcases some of the brightest work in computer vision, but it can feel overwhelming when groundbreaking ideas are presented quickly or buried in technical papers.
That's the reason behind "The Best of ICCV 2025," a four-part
virtual meetup and blog series designed to shine a spotlight on the researchers and their work, helping the broader community understand the real-world impact of these projects long after the conference ends.
Day 1 brings an exciting lineup of papers tackling diverse multimodal ai challenges—from wildlife conservation to fashion design, medical AI transparency to unsupervised learning. These papers are all united by a core theme: creating AI systems that are transparent, explainable, and adaptable to the complexities of the real world.
ProtoMedX: A multimodal AI prototype for explainable bone health diagnosis [1]
Paper title: ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification
Paper authors: Alvaro Lopez Pellicer, André Mariucci, Plamen Angelov, Marwan Bukhari, Jemma G. Kerns
Institution: School of Computing and Communications, Lancaster Medical School, Lancaster University
What it's about
Osteoporosis affects 500 million people worldwide, yet diagnosis remains challenging. Existing bone health classification methods suffer from three limitations: binary classification (normal vs. osteoporosis) that oversimplifies and overlooks the clinically important osteopenia stage, vision-only approaches that ignore clinical context, and lack of explainability. ProtoMedX reimagines bone health classification through case-based reasoning using prototype learning.
Why it matters
In medical settings where decisions directly affect patient outcomes, models must be built with transparency at their core, especially under the new EU AI Act requirements for high-risk applications. Clinicians need to understand why an AI model makes predictions, not just what it predicts. ProtoMedX mirrors how clinicians actually think: comparing patients to archetypal cases rather than learning abstract features. This provides inherent AI transparency while achieving state-of-the-art performance.
How it works
ProtoMedX identifies representative prototypes for each diagnostic category and classifies patients based on similarity to learned examples. The architecture uses dual prototype spaces—separate prototypes for visual features (DEXA scans via frozen CrossViT) and clinical features (patient records)—later fused through multimodal AI attention. Multi-task learning jointly optimizes classification and T-score regression, forcing the model to understand bone density as continuous rather than discrete.
Each prototype represents an actual patient case, periodically projected back to training examples to maintain AI transparency. The model learns 6 prototypes per class, empirically determined to capture intra-class diversity without redundancy.
Key results
Trained on 4,160 real NHS patients, ProtoMedX achieves 89.8% accuracy on three-class classification, a 14-27% improvement over prior methods. It demonstrates 91.2% sensitivity for Normal vs. Abnormal detection. Critically, these results come with inherent explanations: correct predictions average 85.3% confidence vs. 48.9% for errors, with most mistakes occurring at diagnostic boundaries where AI transparency is most valuable.
Broader impact
ProtoMedX demonstrates that explainability and performance aren't mutually exclusive. The multimodal ai model provides multiple transparency levels: classification confidence from k-NN voting, prototype-based reasoning tracing to similar patients, feature-level analysis highlighting atypical values, and misclassification analysis flagging uncertain cases. For the broader CV community, this shows how to build multi-modal AI models that maintain transparency at scale—lessons extending beyond medical imaging to any domain requiring trustworthy AI.
LOTS of Fashion: Multi-conditioning for fashion image generation [2]
Paper title: LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing
Paper authors: Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, and Marco Cristani
Institution: University of Verona, Fondazione Bruno Kessler, Polytechnic Institute of Turin, University of Reykjavik
What it's about
Fashion design is inherently multimodal—designers express ideas through sketches (spatial structure and silhouettes) and text (materials, textures, styles). Yet most text-to-image models struggle with fine-grained localized control, often suffering from "attribute confusion" where properties bleed between garments. LOTS (LOcalized Text and Sketch) introduces a novel multimodal AI approach that pairs localized sketches with corresponding text descriptions for each garment item.
Why it matters
Tell a model to generate "a floral t-shirt with checkered shorts," and you might get florals on the shorts or checks on the shirt. This lack of precise control limits AI's utility for fashion designers who need to specify exactly which attributes go where. LOTS provides unprecedented control over outfit generation, enabling rapid iteration on design concepts with precise specification of each garment's appearance. At ICCV 2025, this work highlights how multimodal AI enables practical creative collaboration between humans and machines.
How it works
First, Modularized Pair-Centric representation encodes each sketch-text pair independently to prevent information leakage between garments. Then, the Pair-Former merges text and sketch information within each pair. Diffusion Pair Guidance feeds the localized representations as conditioning inputs to a diffusion model alongside global text through cross-attention across multiple denoising steps. This prevents merging to the diffusion model itself, allowing progressive integration throughout generation.
For model training and evaluation, the team built Sketchy: a new multimodal ai dataset based on Fashionpedia with hierarchical garment annotations paired with LLM-generated descriptions. They used an image-to-sketch model with careful masking to isolate individual garments.
Key results
LOTS achieves state-of-the-art performance: 0.679 GlobalCLIP, 0.813 LocalCLIP (measuring garment-level alignment), and 0.749 VQAScore (compositional semantic alignment). Human evaluation confirms superiority with 0.722 F1 score for attribute localization—a 10.8% improvement over the next-best method. Qualitatively, LOTS successfully generates complex multi-garment outfits where each item maintains its specified attributes.
Broader impact
Beyond fashion design, the multimodal ai approach to multi-condition fusion offers insights for any domain requiring coordinated generation of multiple objects. For e-commerce, automated generation of diverse product combinations could revolutionize online fashion retail. The step-based merging strategy represents a new paradigm for controllable image generation that could benefit numerous applications requiring precise spatial and semantic control.
AnimalClue: The first large-scale dataset for tracing with indirect evidence [3]
Paper title: AnimalClue: Recognizing Animals by their Traces
Paper authors: Risa Shinoda, Nakamasa Inoue, Iro Laina, Christian Rupprecht, Hirokatsu Kataoka
Institution: The University of Osaka, Kyoto University, Tokyo Institute of Technology, National Institute of Advanced Industrial Science and Technology (AIST), Visual Geometry Group - University of Oxford
What it's about
While computer vision has made tremendous progress in direct animal detection from camera traps, a critical gap remains: identifying species from indirect evidence like footprints, feces, bones, eggs, and feathers. AnimalClue is the first large-scale multimodal AI dataset designed specifically for this task, containing 159,605 bounding boxes across five trace categories, covering 968 species, 200 families, and 65 orders.
Why it matters
Wildlife conservation depends on effective monitoring, but many species are nocturnal, elusive, or exhibit camouflage behaviors that make direct observation difficult. Indirect evidence is often more abundant and provides insights into animal presence, behavior, and health over longer timescales. This non-invasive approach is essential for comprehensive wildlife monitoring without disturbing natural behaviors.
How it works
The team curated
research-grade images from iNaturalist, verified by multiple citizen scientists. Then, organized the data hierarchically and provided multiple annotation types depending on trace type: bounding boxes for footprints, and pixel-level segmentation masks for feces, eggs, bones, and feathers. Uniquely, each species is annotated with 22 fine-grained traits including habitat, diet, activity patterns, and behavioral characteristics.
Four benchmark tasks were established: classification, detection, instance segmentation, and trait prediction. Experiments with state-of-the-art models (Swin Transformers, YOLO variants, DINO) reveal interesting patterns: feathers achieve the highest accuracy despite having the most species, likely due to distinctive colors and patterns, while bones prove most challenging due to variation across body parts.
Key results
The benchmark results show that there is still room for improvement. While order-level classification achieves respectable accuracy (up to 81.8% with Swin-B on feathers), species-level identification remains challenging (65.3% at best). Detection tasks prove even more difficult—the best mAP for order-level detection is just 0.57. Rare species are particularly difficult, with models struggling to generalize beyond frequent categories.
Broader impact
AnimalClue opens exciting possibilities for scalable, non-invasive wildlife monitoring. The multimodal AI dataset enables automated processing of field data at a scale impossible with manual inspection. For conservation practitioners, this means faster surveys, better population estimates, and deeper insights into ecological relationships. The open release of data and code makes AnimalClue’s AI model transparent and accessible for researchers to advance this important application area.
___________________________________
CLASP: Unsupervised image segmentation [4]
Authors: Max Curie and Paulo da Costa (Integral Ad Science)
Paper: CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation
Paper Title: CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation
Paper Authors: Max Curie, Paulo da Costa
Institution: Integral Ad Science
What it's about
Supervised segmentation requires expensive pixel-level annotations. While self-supervised vision transformers like DINO produce rich embeddings without labels, applying them to segmentation faces challenges: most methods require fixed cluster counts, manual hyperparameter tuning, and struggle with boundary refinement. CLASP (CLustering via Adaptive Spectral Processing) is a training-free framework that automatically discovers meaningful segments without any labeled data.
Why it matters
Industry applications often lack labeled data. Digital advertising workflows need to process vast amounts of content for brand safety screening, creative asset curation, and social media moderation—all at scale without manual annotation. CLASP's zero-training paradigm offers immediate deployment without data collection overhead, perfect for large unannotated corpora.
How it works
CLASP's elegance lies in its simplicity. It extracts per-patch features using a pretrained DINO ViT, builds a cosine-similarity graph, and performs spectral clustering. The key innovation is automatic cluster selection via eigengap-silhouette search: eigendecomposition finds the "elbow" in the eigenvalue spectrum (where K eigenvalues decay sharply), then searches a bandwidth around this estimate, validating candidates with silhouette score. Finally, DenseCRF sharpens boundaries by integrating spatial and appearance cues.
This multimodal AI method uses the affinity matrix directly rather than the normalized Laplacian, preserving DINO embeddings' natural clustering geometry. This avoids k-means' randomness and instability.
Key results
Despite being training-free, CLASP achieves competitive performance: 36% mIoU on COCO-Stuff with 64.4% pixel accuracy (vs. 30.2% mIoU for U2Seg which requires training), and 35.4% mIoU on ADE20K with 65.3% pixel accuracy. Ablations confirm that eigengap + silhouette search outperforms both fixed K and simple averaging, and that using the affinity matrix directly improves results.
Broader impact
CLASP provides a strong, reproducible baseline for unsupervised segmentation. Its training-free nature eliminates convergence concerns and enables immediate cross-domain deployment. For practitioners, this means faster time-to-value for applications where labeled data is scarce. For researchers, the simple architecture makes it easy to understand, implement, and extend. The upcoming code release will enable the community to build on this foundation for diverse applications beyond digital advertising.
Wrapping up day 1
Day 1 of ICCV 2025 has showcased the breadth of modern computer vision research, spanning from wildlife monitoring, fashion design, medical diagnostics, and unsupervised learning. What connects these diverse works is their shared commitment to practical solutions: domain-aware approaches outperform generic models for specialized use cases, and AI transparency isn't optional—especially for high-stakes applications like
healthcare.
Together, they point to a clear shift in how we approach computer vision:
- From direct observation to indirect reasoning
- From single-modal brute force to coordinated multimodal AI model fusion
- From black-box predictions to case-based explainability
- From annotation-heavy pipelines to training-free adaptability
As computer vision expands into conservation, healthcare, creative industries, and content moderation—domains where mistakes have real consequences—we need transparent AI systems that can explain their decision and adapt to new contexts. To achieve this, teams need to work with the right data, not just more data.
ICCV 2025 isn't just about seeing more accurately. It's about building vision systems we can understand, trust, and deploy responsibly.
Continue exploring the breakthroughs from The Best of ICCV 2025 Series:
- ICCV Day 2: Advancing Vision Language Models
- ICCV Day 3: Achieving Next Level AI Accuracy
- ICCV Day 4: Pushing the Boundaries of Computer Vision
Register for our
virtual meetup to connect and discover how AI transparency and multimodal AI are shaping the next era of computer vision.
References
[1] Lopez Pellicer, A., Mariucci, A., Angelov, P., Bukhari, M., and Kerns, J.G. "ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025.
[2] Girella, F., Talon, D., Liu, Z., Ruan, Z., Wang, Y., and Cristani, M. "LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2507.22627v2.
[3] Shinoda, R., Inoue, N., Laina, I., Rupprecht, C., and Kataoka, H. "AnimalClue: Recognizing Animals by their Traces," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2507.20240v1.
[4] Curie, M., and da Costa, P. "CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation," in Proceedings of the IEEE/CVF Int. Conf. Computer Vision (ICCV), 2025. arXiv:2509.25016v1.