The Ty Cobb Problem in Your Training Data

Every annotation manager feels the pressure to drive down the unit cost of labeling, and tracking throughput feels like monitoring progress.

But driving down the unit cost of annotation creates an illusion of efficiency while masking degraded model performance. When you pay vendors strictly for high throughput, they cut corners. They gloss over edge cases. Your data quietly degrades. You can drive your cost per label to zero and still ship a broken model, because annotation is fundamentally a quality problem. Throughput was never what was broken.

This structural refusal to prioritize data integrity mirrors a famous accounting error in baseball history. For decades, official record books listed Ty Cobb's career mark at 4,191 hits as a settled, historical fact.

Years later, statisticians auditing the original 1910 logs discovered a clerical duplicate: a single afternoon had been counted twice, inflating his record by two phantom hits. Major League Baseball knew the data was wrong, but refused to correct it, ruling that the records were simply too old to revisit.

That is the exact condition of most machine learning training data in production today: wrong numbers, known to be wrong, left in the pipeline because fixing them is unglamorous. The day you choose to optimize strictly for a cost-per-label metric is the exact day your models start inheriting these phantom hits.

To build models that survive production, teams must stop counting raw labels per hour and start measuring how efficiently their curation spend improves model accuracy.

Key Takeaways

Cost per label is a vanity metric. Driving down the unit cost of annotation creates an illusion of progress while masking degraded model performance.
Annotation is a quality problem. Fixing underperforming models requires shifting focus to two things: selecting the right data to label, and ensuring strict label correctness.
Optimize for model improvement per dollar. Teams must stop counting raw labels per hour and start measuring how efficiently their curation spend improves model accuracy.

The Cost of Cheap Labels

This misalignment breaks production pipelines.

We surveyed 709 ML practitioners for the 2026 State of Visual and Physical AI report. The baseline finding is a wake-up call: 100 percent of teams reported shipping underperforming models.

When asked why those models failed in the real world, practitioners pointed squarely at data quality:

57% lacked training data for specific edge cases.
48% cited data artifacts and occlusion.
32% blamed outright annotation errors.
Only 19% blamed model architecture.

That 19 percent figure is the tell. We spend endless cycles agonizing over parameter counts and model selection, yet almost no one actually fails because of architecture. And absolutely no one fails because they label too slowly.

Teams fail because optimizing for cheap throughput completely ignores what data they choose to label, and how correct those labels actually are.

The Label Error Epidemic

Most teams assume their labels are mostly correct. The data says otherwise.

In a NeurIPS 2021 study, researchers audited the test sets of ten heavily used machine learning datasets. The findings are sobering. They discovered an average label error rate of at least 3.3%. The ImageNet validation set alone carried roughly 6% errors. That translates to nearly 3,000 wrong labels hiding inside the historical gold standard of computer vision.

Let's be honest. Whatever your team labeled last quarter under a tight deadline is not cleaner than ImageNet.

These hidden errors fundamentally break how we evaluate models. Look at what happened at the ECCV 2024 conference. Researchers went back and corrected the ground truth labels in the widely used COCO object detection benchmark. When they reran the evaluations, the leaderboard rankings actually shifted.

Bad labels can make a weaker model rank higher than a stronger one. The more powerful model faithfully learns the errors in your data, while the weaker one glosses over them.

A wrong label in your test set is a wrong answer to the only question you are asking.

The Data Curation Shift: Why "Right" Beats "More"

You cannot fix this by simply labeling more data. The thousandth photo of a car in clear daylight teaches an object detector almost nothing new. Labeling everything at the same priority wastes your budget confirming what the model already knows.

True value lives in the rare, hard, failure-causing samples. You need the night scene, the strange pose, and the unexpected edge case.

This waste is measurable. In our 2026 report, 36% of teams admitted that less than half of their annotated data ever reaches a production training set. Take physical AI as an example. A sensor-fused 3D cuboid (a label aligning camera and lidar data) might cost $1.50 to create. Throwing away half of those annotations easily burns hundreds of thousands of dollars.

Peer-reviewed research proves that curation works. Look at the DataComp study presented at NeurIPS 2023. Researchers carefully curated a dataset without changing the underlying model architecture or compute power. They used that data to train a Contrastive Language-Image Pretraining (CLIP) model that beat OpenAI's version by 3.7 points on ImageNet.

We need to be clear about what that means. That is not a marginal gain. In the world of foundational models, a 3.7-point jump is an absolute chasm.

Better-chosen data creates measurably better models. This is the exact discipline of data curation.

Your Data is Your Moat

Why do known label errors survive in production datasets? Because fixing data is unglamorous.

Look at the CHI 2021 study out of Google Research. They found that 92% of practitioners experience compounding downstream failures caused by undervalued data work. The paper nails the cultural dysfunction perfectly. "Everyone wants to do the model work, not the data work."

It is a sharp critique. But it understates the business risk.

Taking the data work seriously gives you a massive strategic advantage. Your curated, labeled data is the one part of your stack a competitor cannot clone. Shipping raw data to a cheap third-party vendor just to hit a throughput metric hands over your primary competitive advantage.

This is exactly why we built FiftyOne to run wherever your data already lives. Curating and annotating should never require losing control of your intellectual property.

The New Playbook: Model Improvement Per Dollar

If you accept that annotation is fundamentally a curation and quality problem, your priorities must change. You need to abandon the cost-per-label playbook entirely.

Compare and contrast the cost-per-label playbook with the data-quality playbook.

Compare and contrast the cost-per-label playbook with the data-quality playbook.
The Cost-Per-Label Playbook	The Data-Quality Playbook
Add annotators or move to a cheaper vendor	Decide what to label upstream to target edge cases
Buy a tool with faster bounding-box drawing	Hunt for errors using embedding-based outlier detection
Negotiate the unit price of a single label	Close the feedback loop between evaluation and annotation
Result: Lowers a number on the dashboard	Result: Raises model improvement per dollar

Stop counting labels per hour. Start measuring model improvement per dollar.

Ask yourself if you are labeling the right data. Ask yourself if those labels are perfectly accurate. Everything else is just a phantom hit. It is a number on a dashboard that looks like a win, but fails in the real world.

Frequently Asked Questions

Is annotation a data-quality problem or a labeling problem?

Annotation is fundamentally a data-quality problem. Quality relies on selecting the exact right data to label and labeling it flawlessly. Treating annotation merely as a high-throughput labeling task ignores these requirements. It causes teams to ship underperforming models built on cheap, fast labels.

How common are label errors in real ML datasets?

Label errors are incredibly common. They appear even in gold-standard benchmarks. A NeurIPS2021 study revealed at least 3.3% errors across ten widely used test sets. This includes a 6% error rate in ImageNet's validation set. Correcting these errors routinely changes which models actually rank as state-of-the-art.

Why is cost per label a bad metric for AI teams?

Cost per label is a vanity metric. It disconnects annotation spend from actual model performance. It creates the illusion of progress when unit costs drop. But it incentivizes vendors to cut corners, quietly eroding data quality. Teams should optimize for model improvement per dollar instead.

Does labeling more data automatically improve model performance?

No. Adding redundant data yields diminishing returns. Most standard data simply reinforces what the model already knows. Studies consistently show that rigorous data curation produces significantly better models than simply scaling up raw data volume. Curation requires selectively targeting hard edge cases.

Sources

2026 State of Visual and Physical AI Report, Voxel51 and Dimensional Research, survey of 709 practitioners: https://voxel51.com/whitepapers/state-of-physical-ai-2026
C. G. Northcutt, A. Athalye, J. Mueller. Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS 2 021: https://arxiv.org/abs/2103.14749 (error gallery: https://labelerrors.com)
S. Singh et al. Benchmarking Object Detectors with COCO: A New Path Forward (COCO-ReM). ECCV 2024: https://arxiv.org/abs/2403.18819
S. Y. Gadre et al. DataComp: In Search of the Next Generation of Multimodal Datasets. NeurIPS 2023: https://arxiv.org/abs/2304.14108
N. Sambasivan et al. "Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. CHI 2021: https://dl.acm.org/doi/10.1145/3411764.3445518
B. Sorscher et al. Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning. NeurIPS 2022: https://arxiv.org/abs/2206.14486
A. Abbas et al. SemDeDup: Data-efficient Learning at Web-scale through Semantic Deduplication. 2023: https://arxiv.org/abs/2303.09540