Choosing the right embedding model can make or break your computer vision pipeline. While the AI community has produced dozens of powerful embedding architectures, practical guidance on which models perform best for specific tasks remains surprisingly sparse.
Today, we're sharing benchmark results from our comprehensive evaluation of embedding models for image classification across natural domain datasets. Our research compared the six most popular embedding approaches—including ResNet, CLIP, and DINOv2 variants—across three diverse datasets totaling over 6 million images and 10,000+ classes.
The standout finding? DINOv2-ViT-B14 consistently outperformed all other embeddings, achieving up to 93% classification accuracy while requiring no fine-tuning.
The embedding model challenge
Foundation models have revolutionized computer vision, but selecting the optimal embedding for your classification task involves critical tradeoffs. Larger models like ResNet-101 offer richer representations but require more compute. Vision-language models like CLIP promise better generalization but may sacrifice task-specific performance. Self-supervised approaches like DINOv2 claim superior feature learning, but do they deliver in practice?
To answer these questions definitively, we designed a rigorous benchmarking study.
Our benchmarking methodology
We evaluated six embedding models across three diverse natural image datasets. In computer vision, "natural domain" refers to real-world photography/images, which is representative of the datasets that many of our customers work with.
- iNaturalist-2021: 3.8M samples, 10,000 species classes (highly challenging)
- Places: 2.2M samples, 365 scene categories (moderate complexity)
- Food-101: 101K samples, 101 food categories (balanced, smaller scale)
Embedding models tested:
Evaluation methodology:
We extracted embeddings for each dataset, then trained simple classifiers on top of these frozen representations—no fine-tuning, no pre-trained weights. This approach provides a fair comparison of the representational quality of each embedding space. Differences in downstream performance reflect how effectively each embedding captures the underlying structure and patterns in the data.We measured performance using three complementary approaches:
- Classification metrics (accuracy, precision, recall, F1) on validation sets
- UMAP visualization analysis with clustering quality metrics (V-measure, Davies-Bouldin scores)
- RER (Reconstruction Error Ratio) mistakenness scoring to identify problematic samples
DINOv2 dominates across the board
The results were striking. DINOv2-ViT-B14 emerged as the clear winner across all datasets and evaluation methods.
Performance on Food-101
On the balanced Food-101 dataset, DINOv2 achieved 93% accuracy—a full 28 percentage points higher than ResNet-18 (65%) and 5 points better than CLIP (88%).
Results
Performance on Places
For scene classification with 365 categories, DINOv2 reached 53% accuracy, outperforming CLIP (51%), ResNet-101 (48%), and ResNet-18 (41%). While absolute accuracy was lower due to the complexity of distinguishing similar scene types, DINOv2's relative advantage remained consistent.
Results
Performance on iNaturalist-2021
The most challenging dataset—10,000 fine-grained species classes—revealed DINOv2's true strength. It achieved 70% accuracy compared to just 15% for CLIP and 12-14% for ResNet variants. This 5× performance advantage demonstrates DINOv2's superior ability to learn discriminative features for visually similar categories.
Results
Why DINOv2 wins: better learned representations
DINOv2's self-supervised training approach produces embeddings that inherently organize visual information more effectively for classification tasks.
UMAP dimensionality reduction revealed why DINOv2 excels. Across all datasets, DINOv2 embeddings showed:
- Tighter intra-class clustering - Images belonging to the same category are encoded to nearby points in the embedding space (lowest mean centroid distance)
- Greater inter-class separation - Different categories occupy distinct regions of the embedding space with minimal overlap (highest V-measure scores, lowest Davies-Bouldin scores)
- More discriminative feature boundaries - The decision boundaries between classes are clearer and more consistent
On iNaturalist-2021, DINOv2 achieved a V-measure score of 0.908 versus 0.719 for CLIP and 0.708 for ResNet-18. This metric quantifies how well the embedding space reflects the true class structure - higher scores indicate that the learned representations naturally preserve categorical relationships.
The practical implication: DINOv2's training process - which learns visual representations by finding similarities and differences across massive unlabeled image datasets - produces embeddings where semantically similar images are already positioned close together in high-dimensional space. This makes downstream classification easier because the hard work of learning discriminative features has already been done by the embedding model itself.
Mistakenness analysis: finding difficult samples
To further understand how different embeddings impact classification performance, we used class-wise autoencoders to compute Reconstruction Error Ratio (RER). Class-wise autoencoders (AE) are trained to reconstruct an image of its own class with lower error than other classes. For a dataset with ground-truth annotations, RER is the ratio of the reconstruction error obtained with ground truth class autoencoder to the minimum reconstruction error across autoencoders of all other classes. The expectation is that more performant embeddings will generate more robust class-wise autoencoders. Therefore, we will have a lower number of samples with RER score greater than 1.
Using
RER-based mistakenness scoring, where
RER = reconstruction_error(gt_label) / reconstruction_error(min_error_label≠gt, we identified samples where the reconstruction error ratios suggested potential misclassification. The fewest samples with RER > 1 were generated by the autoencoders trained on DINOv2 embeddings indicating that the ground-truth labels are indeed the correct labels for the majority of the samples.
Food-101 validation set:
- ResNet-18: 13,481 samples (53%) with mistakenness > 1.0
- CLIP: 5,623 samples (22%) with mistakenness > 1.0
- DINOv2: 3,555 samples (14%) with mistakenness > 1.0
Engineering implications: model selection matters
Beyond raw accuracy, our experiments revealed practical considerations:
- Computational efficiency: ResNet-18 embeddings are smallest (512-dim) and fastest to compute, making them suitable for resource-constrained environments despite lower accuracy.
- Balanced performance: CLIP offers a middle ground—better than ResNets, more efficient than DINOv2—for applications where 88% accuracy suffices.
- Maximum accuracy: When performance is paramount and compute allows, DINOv2-ViT-B14 is the clear choice.
Recommendations for practitioners
Based on our benchmarking, we recommend:
- For fine-grained classification tasks (species identification, product categorization, medical imaging): Use DINOv2-ViT-B14. Its superior ability to distinguish visually similar classes justifies the computational cost.
- For general-purpose classification (scene recognition, object categorization): DINOv2 remains optimal, but CLIP-ViT-Base32 offers acceptable performance with lower compute requirements.
- For resource-constrained edge deployments: Consider ResNet-18 as a baseline. While accuracy suffers, the smaller embedding size and faster inference may be necessary tradeoffs.
- For complex, multi-modal tasks: CLIP's vision-language training provides flexibility for zero-shot classification and text-guided retrieval that pure vision models lack.
What's next: expanding our evaluation
Our current research focused on natural domain images with clean labels. We're expanding this work to:
- Domain-specific datasets (medical, satellite, industrial inspection)
- Noisy label scenarios where embedding robustness becomes critical
- Few-shot learning to evaluate sample efficiency
- Embedding fine-tuning to measure adaptation potential
We're also integrating these findings into FiftyOne's embedding visualization and model evaluation tools, making it easier for teams to benchmark embeddings on their own datasets.
Try it yourself
All embedding models evaluated in this research are available in FiftyOne's Model Zoo. You can reproduce our benchmarks or test these embeddings on your own datasets using our open-source framework.
"Choosing the right embedding can improve classification accuracy by 5×—but only if you benchmark systematically. Don't assume the most popular model is best for your task." —Dr. Jason Corso, Chief Science Officer at Voxel51
Ready to find the optimal embedding for your computer vision pipeline? Start benchmarking with FiftyOne today.