C-RADIOv4: A Distilled Vision Foundation Model

Introduction

The field of computer vision has seen remarkable progress with vision foundation models that excel at specific tasks—CLIP for vision-language alignment, DINOv3 for dense perception, SAM3 for segmentation. But what if you could distill the best of these foundation models into a single, unified architecture? That's exactly what C-RADIOv4 achieves.

C-RADIOv4 is the latest release in NVIDIA Labs' agglomerative vision foundation model family. Through innovative multi-teacher distillation, it creates a unified student model that retains and improves upon the distinct capabilities of multiple state-of-the-art teachers: SigLIP2, DINOv3, and SAM3.

In this blog post, we'll explore what makes C-RADIOv4 special, dive into its technical innovations, and show you how to leverage its powerful capabilities in FiftyOne for computer vision workflows.

What are foundation models?

Foundation models are large, pretrained models that learn broad, reusable representations from massive datasets and can be adapted to many downstream tasks with little or no task-specific training.

What is C-RADIOv4?

C-RADIOv4 is an agglomerative foundation model that learns from multiple teacher models simultaneously. Instead of training on labeled data, it distills foundation model knowledge from three cutting-edge vision models:

SigLIP2: The frontier text-image foundation encoder, providing enhanced zero-shot classification capabilities
DINOv3: A self-supervised learning powerhouse that delivers strong dense perception and semantic segmentation
SAM3: The latest iteration of Segment Anything Model, enabling powerful segmentation capabilities

The result? A single vision language model that combines text-image alignment, dense perception, and segmentation—all while maintaining competitive performance at a fraction of the parameters of its teachers.

Get started with C-RADIOv4

C-RADIOv4: Vision foundation model variants

C-RADIOv4 comes in two variants:

Both foundation models achieve impressive results:

Zero-shot accuracy: 83.09% (H) and 82.01% (SO400M) on ImageNet-1K
kNN accuracy: 86.59% (H) and 85.76% (SO400M) on ImageNet-1K
Semantic segmentation: 55.20 mIoU (H) and 55.14 mIoU (SO400M) on ADE20k

Remarkably, C-RADIOv4-H achieves performance competitive with DINOv3-7B (which has 6.7B parameters) on dense tasks, despite having an order of magnitude fewer parameters.

C-RADIOv4: Key technical innovations for vision foundation models

1. Stochastic resolution training

Previous RADIO vision foundation models trained at just two resolutions, which could lead to inconsistent behavior at inference time—a phenomenon called "mode switching." C-RADIOv4 solves this by training across a wide range of resolutions:

Low-resolution partition: {128, 192, 224, 256, 384, 432} pixels
High-resolution partition: {512, 768, 1024, 1152} pixels

This enables smooth resolution scaling and substantial quality improvements at low resolutions. The model demonstrates strong robustness even at resolutions higher than those it was trained on, achieving 57.72 mIoU at 1536px on ADE20k.

2. Shift equivariance

One challenge in foundation model distillation is that students can learn not just useful features, but also fixed-pattern noise from teachers. C-RADIOv4 addresses this with two forms of shift equivariance:

Shift Equivariant Loss: Randomly shifts crops for both student and teachers independently, preventing the model from learning position-dependent artifacts
Shift Equivariant MESA: Applies exponential moving average matching with different crops, further reducing fixed-pattern noise

These techniques ensure the model learns robust, semantically meaningful features rather than spurious patterns.

3. ViTDet mode for high-resolution efficiency

For high-resolution inference, C-RADIOv4 supports "ViTDet mode," which uses windowed attention instead of full global attention. This dramatically reduces inference time:

The SO400M variant with ViTDet window size ≤ 12 is faster than SAM3's encoder
ViT-H with window size 8 is nearly as fast as SAM3
Latency scales much more favorably with resolution compared to full attention

This makes C-RADIOv4 practical for real-world applications requiring high-resolution processing.

4. Balanced summary loss

Different teachers produce summary features with different angular dispersions. Without normalization, teachers with larger dispersion would dominate the loss. C-RADIOv4 uses an angular loss formulation that normalizes for each teacher's cone radius, ensuring balanced learning from all teachers.

Getting started with C-RADIOv4 in FiftyOne

FiftyOne provides seamless integration with C-RADIOv4, making it easy to leverage these powerful capabilities in your computer vision workflows.

Installation

First, install FiftyOne and register the C-RADIOv4 model source:

Quick start: Computing embeddings

The most common use case is extracting image embeddings for similarity search, clustering, and visualization:

Model configuration options

C-RADIOv4 offers flexible configuration for different use cases:

C-RADIOv4: Practical use cases for vision foundation models

1. Embedding visualization with UMAP

Create beautiful 2D visualizations of your image embeddings to understand dataset structure:

The visualization reveals clusters of similar images, making it easy to identify patterns and outliers in your dataset.

2. Similarity search

Build powerful similarity search capabilities:

This is incredibly useful for:

Finding duplicate or near-duplicate images
Discovering similar content for data augmentation
Building recommendation systems
Quality control and anomaly detection

3. Spatial attention heatmaps

Visualize what regions of an image the model focuses on using spatial attention features:

The heatmaps use PCA visualization (as described in the C-RADIOv4 paper) to show object boundaries and regions of interest. This is particularly useful for:

Understanding model behavior
Debugging and interpretability
Identifying important image regions
Quality assessment

4. Dataset curation: Uniqueness and representativeness

Use embeddings to score how unique or representative each sample is:

These scores help with:

Active learning: Select diverse samples for labeling
Dataset balancing: Identify underrepresented regions
Data cleaning: Find outliers and edge cases
Efficient annotation: Prioritize representative samples

5. Duplicate detection

Automatically find and remove near-duplicate images:

This is essential for:

Cleaning datasets before training
Reducing storage costs
Improving dataset quality
Preventing data leakage in train/test splits

Advanced workflows for vision foundation models

Comprehensive analysis pipeline

Combine multiple C-RADIOv4 outputs for thorough dataset analysis:

This pipeline gives you:

Similarity search for finding related images
Attention visualization for understanding model focus
Uniqueness scores for identifying outliers
Representativeness scores for dataset balancing
Interactive exploration in the FiftyOne App

Performance optimization for vision foundation models

Batch processing

For optimal performance on large datasets, use batching with parallel data loading:

Recommended batch sizes by GPU:

RTX 3090/4090 (24GB): batch_size=16-32
A100 (40GB/80GB): batch_size=32-64
Smaller GPUs: batch_size=4-8

GPU acceleration

C-RADIOv4 automatically uses GPU acceleration with mixed precision (bfloat16) on compatible hardware (Ampere+). The vision foundation model automatically handles:

Device placement (GPU/CPU)
Mixed precision inference
Efficient memory management

High-resolution processing

For high-resolution images, consider using the SO400M variant with ViTDet mode enabled (when available through the API) for faster inference:

Real-world applications for vision foundation models

C-RADIOv4 has been successfully deployed in various domains:

Autonomous Vehicles: Dense perception for scene understanding
Robotics: Visual navigation and manipulation
Document Processing: OCR and document parsing (Nemotron Parse)
Open Vocabulary Segmentation: RADSeg uses RADIO for state-of-the-art segmentation
Vision-Language Models: Used as vision encoder in Nemotron Nano V2 VL

The vision foundation model's permissive license (NVIDIA Open Model License Agreement) makes it suitable for both research and commercial applications.

Why C-RADIOv4?

Unified capabilities

Instead of maintaining separate models for different tasks, C-RADIOv4 provides:

Zero-shot classification (from SigLIP2)
Dense perception (from DINOv3)
Segmentation capabilities (from SAM3)

All in a single, efficient model.

Efficiency

With 631M parameters (H variant) or 412M parameters (SO400M variant), C-RADIOv4 achieves performance competitive with models having 10x more parameters, making it practical for deployment.

Flexibility

Works across a wide range of resolutions (128px to 1152px+)
Supports both global embeddings and spatial features
Can replace SAM3's vision encoder for segmentation tasks
ViTDet mode enables efficient high-resolution processing

Quality

The vision foundation model demonstrates:

Cleaner object boundaries (as shown in PCA visualizations)
Strong resolution scaling properties
Robust performance across diverse benchmarks
Reduced fixed-pattern noise compared to individual teachers

Foundation model distillation and the road ahead for vision models

C-RADIOv4 represents a significant advancement in agglomerative vision foundation models. By combining the strengths of SigLIP2, DINOv3, and SAM3 through innovative foundation model distillation techniques, it delivers a unified model that's both powerful and efficient.

The integration with FiftyOne makes it easy to leverage these capabilities for:

Dataset exploration and visualization
Similarity search and duplicate detection
Attention visualization and interpretability
Dataset curation and active learning
Quality control and anomaly detection

Whether you're working on autonomous vehicles, robotics, document processing, or any computer vision application, C-RADIOv4 provides a solid foundation for your workflows.

Resources

GitHub Repository: CRADIOv4 FiftyOne Integration
NVIDIA RADIO GitHub: Original Implementation
Hugging Face Models: C-RADIOv4 Models
Technical Report: C-RADIOv4 Tech Report
FiftyOne Documentation: docs.voxel51.com

Citation

If you use C-RADIOv4 in your research, please cite:

Ready to get started? Install FiftyOne, register the C-RADIOv4 model source, and begin exploring your datasets with this powerful unified vision foundation model!

Get started

Talk to a computer vision expert