Medical data annotation has never been straightforward. Two radiologists can review the same CT scan and reach entirely different conclusions. One identifies a finding as benign; the other feels uncertain.
For years, ML teams averaged this disagreement away. They ran a majority vote, applied a label fusion algorithm like STAPLE, and moved forward with a single best-guess label. This approach treated inter-annotator agreement as a problem to solve rather than a signal to keep.
While imperfect, this method worked when training models from scratch on hundreds of thousands of images. Random label noise is simply diluted across a massive dataset. However, in the foundation-model era, that math no longer applies.
The ground truth problem no one fixed
Most teams now fine-tune foundation models like UNI2 and MedSAM2 on small, institution-specific datasets—often from a single hospital, geographic region, or patient population. These datasets typically contain only a few hundred to a few thousand samples.
At this scale, every disputed annotation exerts an outsized influence on model behavior. The disagreement patterns you average away during preprocessing now concentrate, and the model amplifies the noise instead of averaging it out.
This noise problem is incredibly difficult to catch. A 2025 MIT study benchmarked radiologists' uncertainty language—words like "likely," "possible," or "consistent with"—against ground-truth CT labels. The study found that uncertainty was measurably miscalibrated across both radiologists and pathology types.
Annotators cannot reliably self-report when their own labels are wrong. A label that seems like a confident ground truth may actually hide genuine uncertainty that went unrecorded. When that label becomes one of just a few hundred training samples, your model mistakenly learns that uncertainty as a definitive signal.
A better approach: Treat disagreement as a first-class signal
Expert disagreement in medical data annotation is not noise. It is critical information about the hardest cases in your dataset. These are the exact edge cases most likely to cause model failures and threaten downstream clinical reliability. You can build a practical workflow that treats expert disagreement as a high-value signal using these six steps:
1. Represent disagreement explicitly
Three radiologists independently label the same DICOM, preserving individual annotations before any consensus merge.
Store each annotator's label in its own field instead of immediately fusing them. This step is the foundation of a regulatory-grade medical image annotation workflow. You cannot reconstruct individual clinical judgments from a fused output. Without individual judgments, you lose the ability to identify which cases were genuinely contested.
2. Surface the disagreement with embeddings
Filter for direct label conflicts between different annotators, then map where those disagreements land in embedding space.
Once the disagreement is written explicitly, you can track it down. In FiftyOne, you can filter directly for cases where annotators disagreed. You can also use embeddings with a model of your choice to identify clusters of samples where labels diverged.
Using embedding to surface an outlier cluster on bottom right, which is composed of blurry images.
Embedding-based exploration also surfaces outlier clusters where annotation quality problems concentrate. These cases are often visually unusual, which explains why annotators disagreed in the first place. Data quality metrics like brightness, blurriness, and entropy provide additional dimensions to help you understand these divergence patterns.
3. Curate with disagreement in mind
When you can see exactly where your annotators diverged, you can make informed curation decisions. You can flag cases for re-labeling, exclude outliers, and surface systematic bias in how certain annotators approach findings. Instead of treating disagreement as a data-cleaning step, you actively build a dataset with a fully documented quality profile.
4. Record fine-tuning lineage with version history
When you fine-tune a model, record the exact subset of samples you used and link it directly to the model version. FiftyOne's built-in data versioning makes this process seamless. This creates the exact change-control documentation that regulatory frameworks demand, giving you the trail you need when debugging a production failure.
5. Evaluate by slice, not just aggregate
Evaluation shouldn't just measure your final model's performance—it should audit your data quality. Because medical ground truth is rarely 100% certain, you can use model predictions to surface hidden labeling errors. When a model strongly disagrees with a label, it is often pointing out a manual annotation mistake.
The goal is to build a continuous feedback loop: look for areas of strong disagreement between your model and your labels, find out why they don't match, and fix the underlying data.
To structure this evaluation workflow, start by catching glaring mistakes early with auto-labeling. Run an off-the-shelf foundation model against your dataset before fine-tuning. Where zero-shot predictions clash with your ground truth, you have immediate labeling inconsistencies worth investigating.
The interactive confusion matrix lets you click into specific error cells — here surfacing 66 cases where the model predicted "yes" but the ground truth label was "suspicious."
Then, you can target these discrepancies directly using an interactive confusion matrix. Click directly into error cells—like where ground truth says "suspicious" but the model predicts "benign"—to visually inspect and flag broken labels.
From there, you can isolate deeper clinical blind spots with scenario analysis. Because aggregate scores often mask localized failures, slicing your data by specific patient metadata (such as patient weight, demographics, or scanner types) ensures you catch hidden vulnerabilities that high-level overviews miss.
6. Drive the workflow with an AI agent
FiftyOne Data Agent surfaces label distribution issues in a dataset of 800 DICOM radiology images.
You can drive the entire workflow—embedding computation, uniqueness scoring, and mistakenness detection—in natural language with FiftyOne Data Agent. This allows you to maintain analytical depth without the operational complexity.
The regulatory dimension: Annotation quality as a compliance requirement
This workflow also happens to solve a problem most teams don't think about until it's too late: regulatory compliance. If you fuse your labels early and discard individual expert input, you permanently lose the ability to answer standard compliance questions: Which cases were disputed? How was the consensus reached? How does the model perform specifically on highly contested data? Once that data is flattened, you can't go back and recreate it for an auditor.
Meeting global standards
The EU AI Act places medical AI into a high-risk tier with strict documentation requirements. Medical image annotation, labeling, cleaning, updating, enrichment, and aggregation must be documented. The Act also requires teams to use multiple independent annotators and record measured inter-annotator agreement. Every annotation decision, correction, and quality check should be time-stamped and traceable so that a regulator can reconstruct the history of any data point.
Similarly, FDA frameworks increasingly expect teams to document how they constructed ground truth, demonstrate performance across patient subgroups, and maintain clear provenance trails. You must prove who labeled what, when, and under what conditions.
Teams that treat annotation as a first-class engineering priority from day one build compliance-ready data pipelines, saving themselves months of regulatory friction later.
What this means for your medical data annotation today
Look closely at your current pipeline: where exactly does it discard disagreement, and what would you discover if you kept it?
Most teams discard disagreement at the label fusion step, before fine-tuning or evaluation even begin. This default approach resolves the hardest cases in a dataset by averaging them, treating them as equally reliable as the easiest cases. Consequently, the fine-tuned model learns from flawed data, behaves unpredictably in production, and leaves engineering teams with no data trail to diagnose the failure.
None of this requires prohibitive engineering effort with modern tooling: storing multiple label fields, computing embeddings, running quality metrics, and recording version history. The harder shift is treating data curation as an ongoing discipline rather than a one-time chore. The annotation decisions you make today will determine the reliability of every model you build on top of them.
This post stems from our June 2026 webinar, Visual AI in Healthcare: Ground Truth in the Foundation-Model Era. In the session, we walked through this six-step pipeline live using the FiftyOne platform, showcasing embedding visualizations, data quality analysis, subgroup model evaluation, and agentic workflows.
If you want to go deeper into building and managing ground truth datasets with medical foundation models, our FiftyOne Medical Imaging Guide offers a complete starting point for workflows involving DICOM, CT scans, and volumetric data.