CVPR 2024 Might Spark Interest in Agglomerative Vision Models

I stumbled upon the RADIOv2.5 poster at CVPR this year, and it immediately caught my attention amid the sea of incremental improvements.

The conference floor was packed with the usual suspects — minor tweaks to transformers, yet another CLIP variant, and countless papers promising fractional gains on ImageNet. But this poster was different. The visualizations, which show consistent feature maps across resolutions, made me stop in my tracks, especially when contrasted with the bizarre mode-switching behaviour of previous approaches. After a brief chat with the authors about their multi-resolution training strategy, I knew this was something I needed to investigate further once I got home.

Sometimes the most important advances don’t come with the flashiest presentations, but they fundamentally change how we approach problems.

The Rise of Agglomerative Vision Models

Foundation models have transformed computer vision, but the real power lies in combining their strengths.

The vision landscape is cluttered with specialized models:

CLIP excels at connecting images to text
DINO captures semantic structure
SAM delivers precise segmentation masks.

Each model is impressive on its own, but each also has significant limitations when applied outside its comfort zone. Agglomerative models like RADIOv2.5 solve this problem by distilling knowledge from multiple teacher models into a single, versatile student.

“For the student model to be consistently accurate across resolutions, it is sufficient to match all teachers at all resolutions, and to train at two resolutions simultaneously in the final training stage.”

The era of single-purpose vision models is ending.

What Exactly Is “Agglomerative” Modeling?

Agglomerative vision models represent a fundamental shift from how we’ve traditionally built computer vision systems.

Most practitioners are familiar with two approaches: using a single pre-trained model (like CLIP or ResNet) fine-tuned for a specific task, or creating ensembles that average predictions from multiple models at inference time. Agglomerative models take a radically different approach — they use knowledge distillation to transfer the learned representations from multiple “teacher” models into a single “student” model during training. This isn’t simple ensembling; the student learns to produce features that simultaneously match those of CLIP, DINO, and SAM for the same input image, effectively compressing multiple specialized models into one.

It’s knowledge distillation on steroids, creating a single backbone that inherits the superpowers of all its teachers.

What Makes RADIOv2.5 Different

RADIOv2.5 is the first agglomerative model that truly works across all input resolutions.

Previous agglomerative models like AM-RADIO suffered from “mode switching” — they’d produce DINO-like features at low resolutions but SAM-like features at high resolutions. This inconsistency made them unreliable for production use. RADIOv2.5 solves this through multi-resolution training, carefully balancing teacher influence, and applying PHI Standardization to normalize feature distributions across models.

Resolution robustness is the killer feature you didn’t know you needed.

The Token Compression Breakthrough

“Token Merging is very effective at retaining the most diverse information under high compression ratios.”

Token compression is the secret to efficiently integrating high-resolution vision with language models.

Traditional approaches like pixel unshuffling blindly compress visual tokens without considering information content, treating every region of an image as equally important. RADIOv2.5 introduces a bipartite matching approach that intelligently merges similar tokens, preserving detail in information-rich regions while aggressively compressing homogeneous areas. This selective compression means OCR, document understanding, and fine-grained tasks all perform dramatically better.

Your LLM doesn’t need 4,096 tokens to understand a simple image.

Practical Performance Advantages

The numbers don’t lie: RADIOv2.5 consistently outperforms specialized models across diverse benchmarks.

On semantic segmentation, RADIOv2.5-B scores 48.94 mIoU on ADE20k, surpassing DINOv2-g’s 48.79 despite being one-tenth the size. For vision-language tasks, RADIOv2.5-H achieves 68.9% on TextVQA and 53.9% on DocVQA, handily beating SigLIP’s 67.6% and 57.1% respectively. Most impressively, RADIOv2.5 maintains consistent performance as resolution increases, while competitors degrade.

Better features translate directly to better downstream performance.

Real-World Applications

RADIOv2.5 shines in complex real-world scenarios that demand multi-faceted understanding.

Document AI applications benefit enormously from RADIOv2.5’s ability to understand both high-level document structure and fine text details simultaneously. Robotics applications leverage its combination of SAM-like spatial awareness and DINO’s semantic understanding for more robust perception. In medical imaging, the model’s resolution flexibility means it can process everything from whole-slide pathology scans to focused ROIs without quality degradation.

The next generation of vision applications will be built on agglomerative models.

Getting Started with RADIO in FiftyOne

The fastest way to experiment with RADIO models is through the FiftyOne computer vision platform’s integration.

FiftyOne makes it dead simple to leverage RADIO’s powerful embeddings for similarity search, clustering, and data curation. The integration provides access to all model variants (B/L/H/g) with both summary and spatial feature outputs. Setting up is straightforward — just register the model source, load your preferred variant, and start extracting features. The real power comes from combining RADIO’s rich embeddings with FiftyOne’s Brain workflows for duplicate detection, outlier discovery, and representative sample selection.

Five minutes of setup saves days of custom implementation work.

Advanced FiftyOne Workflows

RADIO’s dual-output capability opens up powerful workflows that most models can’t support.

The spatial features option is particularly valuable for understanding what your model is focusing on — something CLIP simply can’t provide. By loading a spatial model variant with output_type="spatial", you can generate attention heatmaps that reveal which image regions are driving your model's decisions. This is invaluable for debugging, bias detection, and understanding failure cases. Combine this with the global embeddings for a complete understanding of your dataset's structure and individual sample characteristics.

The ability to extract both global and spatial features from a single model is a workflow accelerator.

Lessons from Implementing RADIO in FiftyOne

I spent an entire day implement NVIDIA’s RADIO model as a FiftyOne zoo model. What started as a “simple” model wrapper turned into a fascinating journey through the complexities of modern vision transformers.

Here are my key takeaways:

Work WITH the Model, Not Against It

The biggest lesson: Don’t fight the model’s native output format. I spent hours trying to reshape patch tokens back to spatial grids, making assumptions about patch ordering that were wrong. The breakthrough came when I realized RADIO already provides spatial structure in NCHW format — I just needed to use it correctly.

Wrong approach: sqrt(num_patches) → guess spatial dimensions → reshape and hope
Right approach: Trust RADIO's spatial layout → apply PCA across channels → preserve native structure

The Devil is in the Data Types

Storage efficiency matters more than you think. High-resolution spatial features can easily exceed MongoDB’s 16MB document limit. Converting from float32 to uint8 gave us a 75% size reduction with negligible quality loss. Sometimes the "obvious" solution (store raw floats) isn't practical at scale.

Dual Output Strategies Are Powerful

Rather than forcing everything into one output type, I implemented separate pathways:

Summary embeddings → 1D vectors for similarity/search
Spatial features → 2D heatmaps for interpretability

This gives users the best of both worlds and avoided awkward compromises.

Modern Models Need Modern Optimization

Mixed precision (bfloat16) and GPU compatibility checking aren't just nice-to-haves anymore. However, graceful fallbacks are crucial - not every user has an RTX 4090. Always provide a path that works on older hardware.

Iteration Speed Matters

I went through probably 15+ different approaches to the spatial heatmap problem. Having fast iteration cycles — clear error messages, good debugging output, easy rollbacks — was essential. The solution that worked was actually quite simple, but finding it required lots of experimentation.

Domain Knowledge is Irreplaceable

Understanding the difference between RADIO’s NCHW and NCL output format was the key insight that unlocked everything. No amount of clever engineering could substitute for understanding what the model actually outputs. Read the papers, understand the architecture, don't just treat models as black boxes.

The Final Touch Makes All the Difference

The solution that finally worked included one small detail that transformed everything: gaussian_filter(attention_1d, sigma=1.0). This tiny addition turned noisy, hard-to-interpret attention maps into smooth, professional-looking visualizations. Never underestimate the importance of that final 10% of polish.

"The best technical solutions often look obvious in retrospect, but the path to get there rarely is. Today was a perfect reminder that building good developer tools is as much about understanding the problem domain as it is about writing code."

Embracing the Agglomerative Future

The vision field stands at an inflection point where single-purpose models are giving way to agglomerative approaches that combine the best of multiple worlds.

RADIOv2.5 exemplifies this shift with its multi-resolution training that eliminates mode switching, token compression that preserves critical information, and teacher balancing that creates truly unified representations. The practical benefits are clear across benchmarks: outperforming DINOv2-g on segmentation with one-tenth the parameters, beating SigLIP on document understanding tasks, and maintaining consistency across resolutions where other models falter.

Implementing this model in production environments like FiftyOne unlocks these capabilities while teaching us valuable lessons about modern vision architectures — from respecting native tensor formats to understanding the importance of dual output pathways and optimization techniques.

What makes these models truly special isn’t just their theoretical elegance but their practical utility across a stunning range of applications: document understanding, medical imaging, robotics, and more. As specialized teachers continue to emerge in the vision landscape, the agglomerative approach will only become more powerful, creating a virtuous cycle where progress in specialized models accelerates the capabilities of generalist backbones.

The future belongs to models that don’t just excel at one thing but combine multiple specialized strengths into a cohesive, adaptable whole — and RADIOv2.5 is showing us the way.

Talk to a computer vision expert