How Good Is Your Ground Truth? Auditing a Wildfire Smoke Dataset with FiftyOne

Jun 26, 2026
16 min read
Every object detection benchmark rests on an unspoken assumption: that the ground truth annotations humans created are correct. We rarely test that assumption. Yet a model can only ever be as good as the annotations it learns from and is measured against, and a single mislabeled validation set can quietly reward the wrong model for years.
A ground truth audit is the systematic process of testing whether your annotations are correct before training or evaluating a model against them. It treats your labels as a hypothesis to verify rather than a fact to assume.
With that in mind, we explored the Pyro-SDIS dataset, a public wildfire smoke detection dataset published by Pyronear, a non-profit building open early warning systems for forest fires. The dataset is well-built with an important practical use-case, which makes it a good test of the question we actually care about: when a dataset already looks clean, what does a disciplined quality check still find, and how do you tell a real defect from a flag that only looks like one?
We ran the audit entirely in FiftyOne, an open-source platform for dataset curation and model evaluation, which keeps the process non-destructive, reproducible, and reviewable at every step.
The short answer is that the labels turned out to be geometrically clean, the scariest-looking problem dissolved under scrutiny, and the most useful findings came not from any single tool but from making independent methods agree. The longer answer is nuanced and very interesting.
FiftyOne App grid view of nine Pyro-SDIS wildfire camera frames with ground truth smoke bounding box annotations, illustrating the challenge of labeling faint, semi-transparent smoke at long range.
Preview of the Pyro-SDIS Ground Truth: A range of ground truth annotations demonstrate the challenges in annotations for wildfire smoke.

Key findings:

  • The Pyro-SDIS annotations are geometrically clean, with no degenerate boxes and no genuine out-of-bounds coordinate errors.
  • A 95% train/validation near-duplicate rate is not data leakage. It is shared-camera familiarity: zero exact duplicates crossed the split boundary, with near-twin pairs coming from the same fixed cameras photographing the same locations days or weeks apart.
  • The most actionable label errors are a pool of 85 candidate spurious annotations, each unsupported by two independent models, making them the highest-confidence queue for human review.
  • The dataset's primary structural problem is fixed-camera redundancy. The correct fix is deduplication and camera-aware splits, not random frame splitting.

The Pyro-SDIS Dataset

Pyro-SDIS contains 33,636 images (29,537 train, 4,099 validation), each a uniform 1280x720 frame from a fixed wildfire-watch camera, with a single object class, "smoke". About 32,109 smoke boxes are spread across 28,137 images; the remaining 5,499 frames are background with no smoke. The imagery comes from 40 cameras operated by three French fire-service camera networks, and the collection is dominated by two of them. Every image carries useful provenance: which camera network it came from, the specific camera, and the capture date. That metadata becomes important later.
The targets here are challenging. Wildfire smoke at long range is faint, semi-transparent, and shapeless. Roughly 38% of the boxes cover less than a tenth of a percent of the frame, which is exactly what you would expect when the goal is to catch a plume early, while it is still small and may be located far away. This combination, a hard target and a single class, shapes everything that follows.

Three Principles for Any Dataset Audit

Three recommended best practices before touching the data.
Work non-destructively. Run all analysis on a working clone of the dataset, never the original. FiftyOne clones reference the underlying images rather than copying them, so this costs almost nothing, and records every finding as a tag or saved as a view to review later. An audit you cannot undo only makes more work for yourself.
Go from cheap to expensive. Model-free checks → Embeddings → Model Predictions. Each cheap layer can help narrow where the expensive layers need to look.
Triangulate. Treat any single flag as a hypothesis, not a verdict. A finding is only clear when independent signals agree on it. A flag raised by one method alone is a candidate for review, not a conclusion. This principle is the spine of the whole audit, and it’s what separated the real errors from the false alarms.
The audit follows a layered sequence:
  1. Model-free structural checks — pixel metadata, box geometry, exact duplicates
  2. Embedding-based analysis — near-duplicates, redundancy, train/val split leakage
  3. Model-assisted error detection — domain model predictions vs. ground truth labels
  4. Independent model corroboration — second-opinion model to confirm candidates
  5. Concentrated human review — prioritized subset of flagged samples

Model-Free Structural Checks

The first pass used only the pixels and the box coordinates, no model required. FiftyOne's metadata computation can surface corrupt files and resolution outliers, and a built-in FiftyOne Brain method flags images that are exact duplicates of one another. The geometric checks on the boxes themselves, things like zero or negative size or corners that spill past the image edge, were custom expressions over the normalized coordinates, made simple by FiftyOne's filtering and label-matching API; a separate FiftyOne utility measured box-on-box overlap to catch duplicate boxes stacked on the same object.
This is also where the audit produced its first lesson in humility. An early flag reported 287 "out-of-bounds" boxes whose corners sat outside the normalized image frame. On inspection, all of those box corners overshot the edge by as little as 5×10-7 relative to the image, the kind of rounding artifact that appears when box formats undergo coordinate conversions. Since the model’s data loader explicitly accepts bounding box corners within max of ±0.01 of the frame, all of the 287 flagged boxes were well below that tolerance. That means there were no bounding box errors at all. The flag was real; the defect was not.
The rest of the structural pass came back remarkably clean: zero degenerate boxes, zero duplicate or heavily overlapping boxes, and the abundance of tiny boxes were found to commonly be genuine distant smoke rather than noise. The annotations, geometrically speaking, were in good shape.
A related point is worth making about image quality. We did run a FiftyOne community plugin for image-quality checks for issues like blur, over- and under-exposure, and sensor noise, but treated them as a separate axis from annotation correctness. Rather than flagging label errors, these scores let a reviewer sort the dataset by image quality and surface the blurriest or darkest frames as candidates for a closer annotation look or for removal. Domain context still shapes how these are read. A measure of high blurriness may correlate to nearby smoke that interferes with a camera’s focus. Low brightness may simply be a result of images captured during the day or night, but detection of wildfire smoke is needed regardless the time of day; those samples may require extra scrutiny to ensure accurate ground truth annotations.
FiftyOne App grid view of nine low-brightness Pyro-SDIS frames sorted by image quality score, showing dusk and nighttime wildfire camera scenes flagged during structural analysis.
Dark & Blurry: A selection of images which were flagged as being dark and/or blurry, challenging but not necessarily problematic.

Train/Validation Leakage: The Scare That Wasn't

Next we moved to computing image embeddings with FiftyOne. Using DINOv2, a strong self-supervised visual backbone, creating a feature vector for each image, and used them in the near-duplicate, redundancy, and leakage analyses.
The image embeddings immediately revealed the dataset's defining visual structure. These are sequential frames from fixed cameras, so the 33,636 images collapse to only about 885 visually distinct scenes. There is enormous redundancy, including 1,042 exact duplicate images spread across, which were flagged using FiftyOne’s file-hash matching utility. Usually, with surveillance image frames, this would point data from similar blocks of time.
Then came the alarming part. Using the embeddings and FiftyOne Brain compute leaky splits method, we found that 95% of validation images (3,899/4,099) had a near-twin in the training set, and 64% had one that was extremely similar. Taken at face value, that is a catastrophic train/validation leak, the kind that invalidates every reported metric.
The triangulation principle said: do not believe it yet, test it. So we asked what those "twins" actually were. The answer reframed the whole finding. There were zero exact duplicate images shared across the training and validation splits. Of the near-duplicate images, 98-100% came from the same camera, with a median capture time of about ten days apart, with not a single pair taken within ten minutes of each other. This was not the same moment leaking across the split. It was the same camera photographing the same hillside on different days. The cameras do not move (unclear if cameras are PTZ, but they’re likely mounted), so two images from the same site months apart still look alike in feature vector space, even though they are genuinely different observations.
That distinction matters enormously. It is not severe leakage that inflates a score; it’s shared-camera familiarity, a milder and very common issue. The correct response is not to sound an alarm but to build evaluation splits that separate cameras and sites rather than splitting frames at random, so the validation set measures generalization to new locations. A scary headline number, tested rather than trusted, became a concrete and actionable recommendation.
FiftyOne App grid view of fifteen visually similar Pyro-SDIS frames from fixed wildfire cameras, flagged as near-duplicates using FiftyOne Brain embedding analysis, illustrating the dataset's heavy fixed-camera redundancy.
High Similarity Samples: A preview of how many samples were flagged as near or exact duplicates. A strong indicator that care is needed when creating splits for training and validation.

Model-Assisted Annotation Error Detection

Structural and embedding checks find redundancy and annotation geometry problems, but they cannot tell you whether a box is on the right thing. For that you need predictions to disagree with the labels. We applied Pyronear's own published smoke detection model against our FiftyOne dataset, then used FiftyOne's evaluation and label-mistake tools to localize where model and ground truth parted ways.
Two honest caveats frame what this can and cannot prove. First, the detector comes from Pyronear itself, with a possibility that it was trained on a corrected or newer Pyronear dataset drawn from the same camera networks. We could not confirm that it trained on these exact images, but the dataset was published more than a year before the model was uploaded, so it plausibly saw the same data or close relatives of it. Either way, it is not an independent referee, which makes it useful for surfacing candidate errors but not for confirming them. Second, we hit a concrete tooling subtlety: FiftyOne Brain's "possible missing" signal from the compute mistakenness utility only fires for predictions above 0.95 confidence, and this model peaked at 0.92, so it flagged nothing. Rather than conclude there were no missing annotations, we built a queue from the model's confident false positives (≥ 0.5), the frames where it predicted smoke that the ground truth labels did not contain.
The model-assisted pass produced four pools of suspicion, each an independent route to a different kind of error. Three came from the detection model: smoke the ground truth may have missed, ground truth the model didn’t output a prediction on (possible spurious labels), and ground truth boxes that may not correctly bound the smoke. The fourth came from a separate signal: we computed patch embeddings, embeddings for the crop image region of each ground truth bounding box, with DINOv2. Each patch embedding was then scored by the mean cosine distance to its ten nearest neighbors, and we flagged the patches in the top 2% of the distance score, the statistical outliers in embedding space, as candidate mislabels. Each pool was sorted worst-first and saved as a view to review, none of them conclusions yet, all of them hypotheses awaiting a second opinion.

Asking for a genuinely independent second opinion

To turn hypotheses into findings, we needed a model that had never seen this dataset. This turned out to be the hardest part of the whole project, and it is where most of the surprises lived.
The first candidate was YOLO-World, a popular open-vocabulary detector that finds objects from a text prompt. The appeal is obvious: just prompt it with "smoke" and apply the model against the samples. It did not work, and we made certain of that before giving up on it. Across an adversarial sweep of more than thirty prompts, several natural-language descriptions, prompt combinations, and seven input resolutions, the model never once detected realistic, early-stage smoke, and never reached a usable confidence on the smoke region. Control prompts like "sky" and "trees" did fire on these same images, which proved the model worked and the domain failure for this model was real. The likely reason is fundamental: open-vocabulary detectors are trained to recognize discrete, object-like things, and diffuse, textureless, boundary-less smoke sits outside that world. The lesson generalizes well beyond this dataset: always validate a second-opinion model on known-positive examples before you trust its silence to mean anything.
The second model candidate, NVIDIA's LocateAnything, a grounding vision-language model that’s available via the FiftyOne Model Zoo as a remote model, fared better than YOLO-World but presented its own difficulties. When making a prediction for smoke, its bounding boxes were not precise and so they’re only useful as a corroboration for the presence of smoke, not to refine bounding box localization or sizing. This may have been due to frames of mountainous regions showing fog or low clouds, which can easily be conflated as smoke. So we used the model to confirm smoke was present, but not as a way to refine the smoke boundaries. The general principle held again: a second model is only worth its independence if you first characterize what it can and cannot do.
FiftyOne sample view of a mountain landscape with overlapping smoke bounding boxes from three sources: pink ground truth labels, grey Pyronear model predictions, and a red LocateAnything prediction, used to corroborate annotation accuracy across independent models.
Corroborating Challenging Annotations: The red bounding box is from LocateAnything, the grey boxes are from the Pyronear pretrained model, and the pink are the ground truth Pyronear-SDIS bounding boxes. This demonstrates how certain domains can be extremely difficult to annotate, even for humans, as distinguishing between fog and smoke is not simple given a single image frame.

Building a Targeted Review Subset

Hand-reviewing 33,636 images is not realistic, and the value of the audit is in directing scarce human attention to where it pays off. So we built a small, dense evaluation set: arbitrarily choosing the 100 most suspicious images from each of the four error pools, unioned into a single 400-image subset (the pools happened to not contain overlapping samples), with each image tagged by the pool it came from. Cloning just this subset lets us run both detectors over it affordably and compare them side by side in the FiftyOne App model-evaluation panel.
This subset is deliberately adversarial, the hardest frames in the dataset rather than a randomized or representative sample, so its raw scores are not meant to reflect the dataset as a whole. Its purpose is differential: where two independent models agree, a flag becomes credible; where there’s only a single signal, it stays a candidate.
FiftyOne App model evaluation panel showing a side-by-side comparison of the Pyronear smoke detector and NVIDIA LocateAnything on the 400-image concentrated review subset, with per-class precision, recall, and F1 metrics.
Model Evaluation: the FiftyOne model-evaluation panel comparing the two detectors on the concentrated subset side by side. Note that LocateAnything doesn’t output a confidence value for its predictions.

Reading the results: clear versus uncertain

The audit's conclusions divide cleanly into what the evidence supports and what it only suggests.
Clear. The annotations are geometrically clean, with no degenerate or duplicated boxes and no genuine out-of-bounds boxes. There is no severe image-level leakage between splits; the real structural story is heavy fixed-camera redundancy, which calls for deduplication and a camera-aware split rather than a fully randomized one. And there is one high-precision pool of real label errors: among the 100 most suspect "possible spurious" boxes, 85 had no support from either model, neither the domain detector nor the independent one identified smoke as present. That is the queue a human should review first.
Uncertain. The other pools were weaker than they first appeared, and saying so is part of the rigor. The "missing annotation" pool was overstated: among the top candidates, only a small fraction drew corroboration from the independent model, which suggests many were the domain model's own false positives rather than genuine missed smoke. The "localization error" pool was mostly model imprecision rather than misplaced labels, with only about 13/100 corroborated. And the statistically "atypical" boxes were largely unconfirmed, a reminder that unusual is not the same as wrong, since a rare-but-real smoke plume looks atypical too.
It is worth being honest about why even the clear pool carries a ceiling rather than a certainty. Both models can fail together on the same faint, long-range smoke, so "neither model sees it" is strong evidence but not proof. The independent model's modest recall on these hard frames means a human reviewer, not the tooling, makes the final call. The audit's job was never to deliver verdicts; it was to rank the dataset by how likely each region is to be wrong, so that limited review time lands where it matters most.

Key Takeaways for Any Dataset Audit

Three things carry over to any dataset, not just this one.
First, a flag is a hypothesis, not a finding. The out-of-bounds boxes, the 95% "leakage," and the inflated missing-annotation count were all real signals that meant something other than what they first appeared to. Every one of them was resolved by following up with additional investigation rather than acting on the first answer.
Second, triangulation is what converts signals into conclusions. No single tool was trustworthy on its own, including the domain model that was likely trained on this data. The durable findings were the ones two independent methods agreed on, and the honest "uncertain" label went on everything else.
Third, domain knowledge sets the rules. Knowing that smoke is faint, shapeless, imagery is replicated across fixed cameras, and the dataloader bounding box tolerance determined which flags were defects, which were signal, and which off-the-shelf models could even play. A generic checklist would have flagged the wrong things and trusted the wrong models. Additionally, when it comes to evaluating a model or making choices about label corrections, we know that the domain will dictate an appropriate choice. When it comes to early wildfire detection, a high false positive rate is likely preferable as the stakes of a false negative (late or no detection) could be catastrophic or life threatening.
Pyro-SDIS came out of this audit looking like what it is: a carefully built, genuinely useful dataset whose main opportunities are structural (deduplicate, and split by camera for a more honest evaluation) with a small, well-localized set of label errors worth correcting. That is a good outcome. The larger point is that we could only say so with confidence because we tested every alarming number instead of believing it, and because the whole process stayed non-destructive and reproducible from the first check to the last.
FiftyOne sample view of a wildfire camera frame showing two Pyronear model smoke detections with confidence scores and no corresponding ground truth annotations, flagging them as candidate missing labels.
Latest Pyronear Detections: These were detected by the Pyronear model, but not included in the ground truth of the dataset, an obvious opportunity for correction.

FAQ

What is a ground truth audit?

A ground truth audit is the systematic process of testing whether your annotations are correct before training or evaluating a model against them, treating your labels as a hypothesis to verify rather than a fact to assume. The process moves from cheap, model-free structural checks through embedding-based analysis to model-assisted error detection, with human review concentrated on samples flagged by multiple independent signals.

How do you detect train/validation leakage in a dataset?

Use image embeddings to compute similarity between training and validation splits. FiftyOne Brain's compute_leaky_splits method does this directly. But don't stop at the flag: investigate the nature of the similarity. In the Pyro-SDIS dataset, 95% of validation images had a near-twin in training, which looked like catastrophic leakage. Closer inspection revealed zero exact duplicate images across splits. The near-duplicate pairs came from the same fixed cameras photographing the same locations days or weeks apart. That's shared-camera familiarity, not data leakage. The correct response was recommending camera-aware splits, not discarding the dataset.

How do you find annotation errors in object detection without a clean reference model?

No single method is sufficient. The approach that works is triangulation across independent signals. Structural checks catch geometric errors like degenerate or out-of-bounds boxes. Patch embeddings surface statistically atypical annotations: compute a feature vector for each bounding box crop, then flag the top 2% by mean cosine distance to their ten nearest neighbors. Domain model predictions surface potential missed labels (confident false positives) and spurious ones (ground truth boxes the model consistently ignores). A finding only becomes a credible error when at least two independent signals agree on it.

How do you prioritize which annotations to review manually?

Build a targeted review subset rather than reviewing the full dataset. Identify your error pools (missing annotations, spurious labels, localization errors, embedding outliers), take the top 100 most suspicious samples from each, and union them into a single prioritized subset. In this audit, four pools of 100 produced 400 non-overlapping images. Tag each sample by its source pool, then use a model evaluation panel to compare predictions side by side. Focus human review first on samples where two independent models agree, as those carry the highest precision. Samples flagged by only one signal stay candidates, not conclusions.

Why do open-vocabulary detectors like YOLO-World fail on smoke detection?

Open-vocabulary detectors are trained to find discrete, object-like things with clear boundaries. Wildfire smoke, especially early-stage long-range plumes, is diffuse, textureless, and boundary-less, which sits outside the distribution these models were built on. In this audit, YOLO-World was tested across more than thirty prompts, multiple natural-language descriptions, and seven input resolutions without a single realistic smoke detection. Control prompts like "sky" and "trees" did fire on the same images, confirming the model itself worked. The failure was domain-specific. The broader lesson: always validate a second-opinion model on known-positive examples before trusting its silence.
Loading related posts...