Zero-shot auto-labeling rivals human performance
Jun 4, 2025
12 min read
One of the biggest bottlenecks in deploying visual AI and computer vision is annotation — the costly, time-consuming process of manually labeling images to train machine learning models. For years, AI hype reinforced the idea that more labels meant better models, enabling annotation providers like Scale AI to become billion-dollar giants.
That model no longer fits today’s AI pipelines.
Today, we’re introducing Verified Auto Labeling, a new approach to AI-assisted annotation that combines Voxel51’s expertise in data curation with automated labeling and QA workflows. Our research paper, Auto-Labeling Data for Object Detection, establishes new benchmarks that demonstrate that Verified Auto Labeling achieves up to 95% of human-level performance — while cutting labeling costs by up to 100,000×.
Yes, you read that right—100,000×.

How far can zero-shot auto labeling take us in the quest for labeled datasets?

As foundation models become increasingly sophisticated, it's widely believed that auto-labeling will significantly reduce the need for human annotation. But how effective are today's zero-shot models in practice?
To find out, we benchmarked leading vision-language models (VLMs) as foundation models—including YOLOE, YOLO-World, and Grounding DINO — across four widely-used datasets: Berkeley Deep Drive (BDD, autonomous driving), Microsoft Common Objects in Context (COCO), Large Vocabulary Instance Segmentation (LVIS, high complexity), and PASCAL Visual Object Classes (VOC, general imagery). These datasets span basic object categories to challenging, long-tail distributions.
While AI-assisted annotation continues to improve, most auto-labeling methods still rely on small human-labeled seed sets (typically 1–10%). Intrigued by this limitation, we sought to evaluate how effectively auto-labeling systems could perform without any initial human labels. Zero. Zip. Zilch. Through this investigation, we also gained critical insights into setup and configuration that unlock auto-labeling performance.
We started by measuring F1 scores, which provide a direct assessment of auto-label quality by balancing precision (how accurate labels are) and recall (how many true objects are correctly labeled). High F1 scores indicate auto-labels closely approximate human annotation accuracy.
On simpler datasets like VOC, YOLO-World achieved an impressive F1 score of 0.785, meaning it produced labels nearly as accurate as humans for straightforward object categories. However, performance decreased with dataset complexity: on COCO, the top models achieved approximately 0.640, and on the highly challenging LVIS dataset, scores dropped to 0.215, underscoring the model’s difficulty in accurately labeling rare classes.
For specialized or complex classes, we recommend a hybrid approach combining Verified Auto Labeling with targeted human annotation. Given the efficiency of Verified Auto Labeling, this hybrid method still delivers substantial cost savings.
We can also integrate proprietary models into Verified Auto Labeling to further improve accuracy on specialized datasets.

Verified Auto Labeling delivers up to 95% model performance on downstream inference

Evaluations based purely on metrics like precision and recall tell only part of the story. To measure the effectiveness of Verified Auto Labeling, we conducted a more practical test: we trained lightweight models (the kind you'd actually deploy on edge devices) directly from the auto labels, without using any pre-trained weights or human-labeled data.
This approach allowed us to see whether auto-labels alone could produce high-performing models in real-world scenarios. Using mean Average Precision (mAP), a key real-world metric for object detection accuracy, we found that models trained solely on auto-labels performed just as well—and sometimes even better—than models trained on traditional human labels.
On VOC, auto-labeled models achieved mAP50 scores of 0.768, closely matching the 0.817 achieved with human-labeled data. On COCO, auto-labeled models reached mAP50 of 0.538 compared to 0.588 for human-labeled counterparts, demonstrating competitive real-world performance.
Interestingly, in certain cases—such as detecting rare classes in COCO or VOC—auto-label-trained models occasionally outperformed those trained on human labels. This may occur because foundation models, trained on massive datasets, can generalize better across diverse objects or more consistently label challenging edge cases. In contrast human annotators might occasionally mislabel or overlook subtle object instances, particularly when working at scale – as illustrated by our donut example below.
Performance was notably weaker on highly complex or specialized datasets such as LVIS and BDD, which contain numerous nuanced or domain-specific classes. On LVIS, for instance, auto-label-trained models yielded very low mAP scores (less than 0.10), highlighting the significant challenges foundation models face when handling rare, ambiguous, or highly specialized object definitions. Foundation models often perform poorly in these scenarios because they're not specifically trained to distinguish extremely rare or specialized classes.
These results indicate that while auto-labels achieve about 90–95% of the performance of human labeling in many practical scenarios, careful consideration of dataset complexity and class definitions remains essential. For specialized or particularly challenging categories, teams should adopt hybrid annotation strategies, combining auto-labeling’s scalability with targeted human expertise. We’ll have more on that topic soon.
"With Verified Auto Labeling, teams can bootstrap an entire detection dataset with no human-provided seed labels and train edge-friendly detectors that nearly match fully human-supervised results. All at six orders of magnitude lower cost." –Dr. Jason Corso, Chief Science Officer at Voxel51

Verified Auto Labeling reduces annotation costs by 100,000x

While previous research qualitatively claimed auto-labeling reduces annotation costs, our study provides concrete figures:
  • Labeling 3.4 million objects on a single NVIDIA L40S GPU costs $1.18 and took just over an hour.
  • Manually labeling the same dataset via AWS SageMaker, which has among the least expensive annotation costs, would cost roughly $124,092 and take nearly 7,000 hours.
Verified Auto-Labeling is 100,000× cheaper and 5,000× faster than traditional annotation. These dramatic savings fundamentally alter the economics of bringing computer vision to production, freeing budget for quality assurance, edge-case analysis, and strategic dataset expansion.

Clean labels aren’t always better: how confidence thresholds impact model performance

Confidence thresholds determine how sure a model must be before accepting a prediction as correct. In auto-labeling, each detected object receives a confidence score (0–1), reflecting how certain the model is about the detection. Practitioners typically set thresholds (e.g., 0.5) to filter lower-confidence predictions, assuming higher thresholds yield cleaner labels and better models.
Choosing the right confidence threshold for auto-labeling seems straightforward: higher confidence should mean cleaner labels and better downstream results.
However, our benchmarks revealed a surprising insight: high-confidence labels (0.8–0.9), while appearing cleaner, consistently harmed downstream performance due to reduced recall. Optimal downstream performance (mAP) occurred at moderate confidence thresholds (0.2–0.5), balancing precision and recall effectively. However, although the best performance for all models and datasets tested was within this range, the actual best threshold varies across the study.
Understanding this balance enables better tuning of auto-labeling pipelines, prioritizing overall model effectiveness over superficial label cleanliness.

Engineering tradeoffs at scale: why model selection matters in auto-labeling

Performance and practical usability vary significantly among foundation models, making careful selection critical for auto-labeling workflows. Our experiments revealed substantial trade-offs beyond just accuracy:
  • While YOLO-World rapidly labeled large-scale datasets in minutes (~3 min for VOC), Grounding DINO was significantly slower (~38 min for VOC), due to computational constraints in handling complex text prompts.
  • Models like Grounding DINO encountered memory limitations with datasets containing verbose class descriptions (e.g., LVIS), requiring specialized adaptations and increasing labeling time dramatically.
Understanding these real-world differences matters because model choice directly impacts operational efficiency, scalability, and overall deployment costs—critical considerations for any AI project.

What is the best auto-labeling tool?

Not all auto-labeling tools are created equal. As our research demonstrates, achieving optimal results requires careful selection of foundation models, confidence thresholds, and other parameters tailored to your use case.
Developed by our world-class ML team, FiftyOne’s Verified Auto Labeling builds directly on this research, integrating automated labeling with streamlined QA workflows. Traditional auto-labeling solutions typically produce raw, noisy outputs that require extensive human cleanup. While initial costs may seem low, hidden costs quickly pile up — due to intensive manual QA, multiple review cycles, and costly re-annotations.
Verified Auto Labeling uses confidence scoring to automatically highlight labels that are most likely to need human attention, helping annotators efficiently prioritize their QA efforts. Annotators can still review any of the labels generated—even those that the system considers lower priority—providing flexibility to further refine labels. This approach streamlines the annotation workflow, reduces overall annotation costs, and enhances dataset quality and downstream model performance.
  • One‑click QA: Accept or reject auto‑labels instantly with RER‑backed confidence scoring, drastically cutting down unnecessary human oversight.
  • Intelligent ranking for human review: Direct human effort to the exact samples that most impact model quality.
  • Strategic data selection: Leverage FiftyOne’s powerful data curation capabilities to identify samples most likely to boost model performance — and reduce overall annotation volume and costs.
  • Difficulty scoring: Quantify labeling uncertainty to highlight exactly where human input adds the greatest value.
Verified Auto Labeling is currently in beta, rolling out to existing FiftyOne Enterprise customers.
Join our upcoming workshop on June 24 to learn more about Verified Auto Labeling. Alternatively, add yourself to our beta waitlist here.

Cite this post

To cite this post in your research:
Brent GriffinJacob Sela, Manushree Gangwar, Jason Corso. (June 4, 2025). Zero-shot auto-labeling rivals human performance. https://arxiv.org/abs/2506.02359

Loading related posts...