Choosing the right embedding model can make or break your computer vision pipeline. While the AI community has produced dozens of powerful embedding architectures, practical guidance on which models perform best for specific tasks remains surprisingly sparse.

Today, we're sharing benchmark results from our comprehensive evaluation of embedding models for image classification across natural domain datasets. Our research compared four popular embedding models—including ResNet, CLIP, and DINOv2 variants—across three diverse datasets totaling over 6 million images and 10,000+ classes.

The standout finding? DINOv2-ViT-B14 consistently outperformed all other embeddings, achieving up to 93% classification accuracy while requiring no fine-tuning.

The embedding model challenge

Foundation models have revolutionized computer vision, but selecting the optimal embedding for your classification task involves critical tradeoffs. Larger models offer richer representations but require more compute. Vision-language models like CLIP promise better generalization but may sacrifice task-specific performance. Self-supervised approaches like DINOv2 claim superior feature learning, but do they deliver in practice?

To answer these questions definitively, we designed a rigorous benchmarking study.

Our benchmarking methodology

We evaluated four embedding models across three diverse natural image datasets. In computer vision, "natural domain" refers to real-world photography/images, which is representative of the datasets that many of our customers work with.

iNaturalist-2021: 3.8M samples, 10,000 species classes (highly challenging)
Places: 2.2M samples, 365 scene categories (moderate complexity)
Food-101: 101K samples, 101 food categories (balanced, smaller scale)

Embedding models tested:

resnet18-imagenet-torch (512-dim)
resnet101-imagenet-torch (2048-dim)
clip-vit-base32-torch (512-dim)
dinov2-vitb14-torch (768-dim)
Concatenated combinations of CLIP/DINOv2 with ResNet-18

Evaluation methodology:

We extracted embeddings for each dataset, then trained simple classifiers on top of these frozen representations—no fine-tuning, no pre-trained weights. This approach provides a fair comparison of the representational quality of each embedding space. Differences in downstream performance reflect how effectively each embedding captures the underlying structure and patterns in the data.We measured performance using three complementary approaches:

Classification metrics (accuracy, precision, recall, F1) on validation sets
UMAP visualization analysis with clustering quality metrics (V-measure, Davies-Bouldin scores)
RER (Reconstruction Error Ratio) mistakenness scoring to identify problematic samples

DINOv2 dominates across the board

The results were striking. DINOv2-ViT-B14 emerged as the clear winner across all datasets and evaluation methods.

Performance on Food-101

On the balanced Food-101 dataset, DINOv2 achieved 93% accuracy—a full 28 percentage points higher than ResNet-18 (65%) and 5 points better than CLIP (88%).

Results

Performance on Places

For scene classification with 365 categories, DINOv2 reached 53% accuracy, outperforming CLIP (51%), ResNet-101 (48%), and ResNet-18 (41%). While absolute accuracy was lower due to the complexity of distinguishing similar scene types, DINOv2's relative advantage remained consistent.

Results

Performance on iNaturalist-2021

The most challenging dataset—10,000 fine-grained species classes—revealed DINOv2's true strength. It achieved 70% accuracy compared to just 15% for CLIP and 12-14% for ResNet variants. This 5× performance advantage demonstrates DINOv2's superior ability to learn discriminative features for visually similar categories.

Results

Why DINOv2 wins: better learned representations

DINOv2's self-supervised training approach produces embeddings that inherently organize visual information more effectively for classification tasks.

UMAP dimensionality reduction revealed why DINOv2 excels. Across all datasets, DINOv2 embeddings showed:

Tighter intra-class clustering - Images belonging to the same category are encoded to nearby points in the embedding space (lowest mean centroid distance)
Greater inter-class separation - Different categories occupy distinct regions of the embedding space with minimal overlap (highest V-measure scores, lowest Davies-Bouldin scores)
More discriminative feature boundaries - The decision boundaries between classes are clearer and more consistent

On iNaturalist-2021, DINOv2 achieved a V-measure score of 0.908 versus 0.719 for CLIP and 0.708 for ResNet-18. This metric quantifies how well the embedding space reflects the true class structure - higher scores indicate that the learned representations naturally preserve categorical relationships.

The practical implication: DINOv2's training process - which learns visual representations by finding similarities and differences across massive unlabeled image datasets - produces embeddings where semantically similar images are already positioned close together in high-dimensional space. This makes downstream classification easier because the hard work of learning discriminative features has already been done by the embedding model itself.

Mistakenness analysis: finding difficult samples

To further understand how different embeddings impact classification performance, we used class-wise autoencoders to compute Reconstruction Error Ratio (RER). Class-wise autoencoders (AE) are trained to reconstruct an image of its own class with lower error than other classes. For a dataset with ground-truth annotations, RER is the ratio of the reconstruction error obtained with ground truth class autoencoder to the minimum reconstruction error across autoencoders of all other classes. The expectation is that more performant embeddings will generate more robust class-wise autoencoders. Therefore, we will have a lower number of samples with RER score greater than 1.

Using RER-based mistakenness scoring, where RER = reconstruction_error(gt_label) / reconstruction_error(min_error_label≠gt, we identified samples where the reconstruction error ratios suggested potential misclassification. The fewest samples with RER > 1 were generated by the autoencoders trained on DINOv2 embeddings indicating that the ground-truth labels are indeed the correct labels for the majority of the samples.

Food-101 validation set:

ResNet-18: 13,481 samples (53%) with mistakenness > 1.0
CLIP: 5,623 samples (22%) with mistakenness > 1.0
DINOv2: 3,555 samples (14%) with mistakenness > 1.0

Engineering implications: model selection matters

Beyond raw accuracy, our experiments revealed practical considerations:

Computational efficiency: ResNet-18 embeddings are smallest (512-dim) and fastest to compute, making them suitable for resource-constrained environments despite lower accuracy.
Balanced performance: CLIP offers a middle ground—better than ResNets, more efficient than DINOv2—for applications where 88% accuracy suffices.
Maximum accuracy: When performance is paramount and compute allows, DINOv2-ViT-B14 is the clear choice.

Recommendations for practitioners

Based on our benchmarking, we recommend:

For fine-grained classification tasks (species identification, product categorization, medical imaging): Use DINOv2-ViT-B14. Its superior ability to distinguish visually similar classes justifies the computational cost.
For general-purpose classification (scene recognition, object categorization): DINOv2 remains optimal, but CLIP-ViT-Base32 offers acceptable performance with lower compute requirements.
For resource-constrained edge deployments: Consider ResNet-18 as a baseline. While accuracy suffers, the smaller embedding size and faster inference may be necessary tradeoffs.
For complex, multi-modal tasks: CLIP's vision-language training provides flexibility for zero-shot classification and text-guided retrieval that pure vision models lack.

What's next: expanding our evaluation

Our current research focused on natural domain images with clean labels. We're expanding this work to:

Domain-specific datasets (medical, satellite, industrial inspection)
Noisy label scenarios where embedding robustness becomes critical
Few-shot learning to evaluate sample efficiency
Embedding fine-tuning to measure adaptation potential

We're also integrating these findings into FiftyOne's embedding visualization and model evaluation tools, making it easier for teams to benchmark embeddings on their own datasets.

Try it yourself

All embedding models evaluated in this research are available in FiftyOne's Model Zoo. You can reproduce our benchmarks or test these embeddings on your own datasets using our open-source framework.

"Choosing the right embedding can improve classification accuracy by 5×—but only if you benchmark systematically. Don't assume the most popular model is best for your task." —Dr. Jason Corso, Chief Science Officer at Voxel51

Ready to find the optimal embedding for your computer vision pipeline? Start benchmarking with FiftyOne today.

Talk to a computer vision expert

The embedding model challenge

Our benchmarking methodology

DINOv2 dominates across the board

Performance on Food-101

Performance on Places

Performance on iNaturalist-2021

Why DINOv2 wins: better learned representations

Mistakenness analysis: finding difficult samples

Engineering implications: model selection matters

Recommendations for practitioners

What's next: expanding our evaluation

Try it yourself

Talk to a computer vision expert

Related posts