Best of ICCV - November 19, 2025

Upcoming events

Mar 5, 2026 · Virtual

AI, ML and Computer Vision Meetup – March 5, 2026

Mar 11, 2026 · Virtual

Debugging the Future: Strategies for Validating World Models and Action-Conditioned Video - March 11, 2026

See all events

Talk to a computer vision expert

Book a demo

Virtual

1 of 4

Americas

Conferences

Best of ICCV - November 19, 2025

This event has ended, but you can still catch up! Watch the on-demand recordings and register for our future events.

Nov 19, 2025

9 AM Pacific

Online. Register for the Zoom!

Day 1 Day 2 Day 3 Day 4

Speakers

About this event

Welcome to the Best of ICCV series, your virtual pass to some of the groundbreaking research, insights, and innovations that defined this year’s conference. Live streaming from the authors to you.

Schedule

AnimalClue: Recognizing Animals by their Traces

Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces.

Resources

About the Speaker

Risa Shinoda received her M.S. and Ph.D. in Agricultural Science from Kyoto University in 2022 and 2025. Since April 2025, she has been serving as a Specially Appointed Assistant Professor at the Graduate School of Information Science and Technology, the University of Osaka. She is engaged in research on the application of image recognition to plants and animals, as well as vision-language models.

LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation.

First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.

Resources

About the Speaker

Federico Girella is a third-year Ph.D. student at the University of Verona (Italy), supervised by Prof. Marco Cristani, with expected graduation in May 2026. His research involves joint representations in the Image and Language multi-modal domain, working with deep neural networks such as (Large) Vision and Language Models and Text-to-Image Generative Models. His main body of work focuses on Text-to-Image Retrieval and Generation in the Fashion domain.

ProtoMedX: Explainable Multi-Modal Prototype Learning for Bone Health Assessment

Early detection of osteoporosis and osteopenia is critical, yet most AI models for bone health rely solely on imaging and offer little transparency into their decisions. In this talk, I will present ProtoMedX, the first prototype-based framework that combines lumbar spine DEXA scans with patient clinical records to deliver accurate and inherently explainable predictions. Unlike black-box deep networks, ProtoMedX classifies patients by comparing them to learned case-based prototypes, mirroring how clinicians reason in practice. Our method not only achieves state-of-the-art accuracy on a real NHS dataset of 4,160 patients but also provides clear, interpretable explanations aligned with the upcoming EU AI Act requirements for high-risk medical AI. Beyond bone health, this work illustrates how prototype learning can make multi-modal AI both powerful and transparent, offering a blueprint for other safety-critical domains.

Resources

Paper

About the Speaker

Alvaro Lopez Pellicer is a PhD candidate in Explainable AI at Lancaster University and an AI Research Associate at J.P. Morgan in London. His research focuses on prototype-based learning, multi-modal AI, and AI security. He has led projects on medical AI, fraud detection, and adversarial robustness, with applications ranging from healthcare to financial systems.

CLASP: Adaptive Spectral Clustering for Unsupervised Per-Image Segmentation

We introduce CLASP (Clustering via Adaptive Spectral Processing), a lightweight framework for unsupervised image segmentation that operates without any labeled data or fine-tuning. CLASP first extracts per-patch features using a self-supervised ViT encoder (DINO); then, it builds an affinity matrix and applies spectral clustering. To avoid manual tuning, we select the segment count automatically with a eigengap-silhouette search, and we sharpen the boundaries with a fully connected DenseCRF. Despite its simplicity and training-free nature, CLASP attains competitive mIoU and pixel-accuracy on COCO-Stuff and ADE20K, matching recent unsupervised baselines. The zero-training design makes CLASP a strong, easily reproducible baseline for large unannotated corpora—especially common in digital advertising and marketing workflows such as brand-safety screening, creative asset curation, and social-media content moderation.

Resources

Paper

About the Speaker

Max Curie is a Research Scientist at Integral Ad Science, building fast, lightweight solutions for brand safety, multi-media classification, and recommendation systems. As a former nuclear physicist at Princeton University, he brings rigorous analytical thinking and modeling discipline from his physics background to advance ad tech.