Download notebook: tutorials/label_mistakes.ipynb

Finding Label Mistakes with FiftyOne

Annotations mistakes create an artificial ceiling on the performance of your models. However, finding these mistakes by hand is at least as arduous as the original annotation work! Enter FiftyOne.

This tutorial shows how FiftyOne can help you find and correct label mistakes in your datasets, enabling you to curate higher quality datasets and, ultimately, train better models!


In this walkthrough, we explore how FiftyOne can be used to help you find mistakes in your annotations.

We’ll cover the following concepts:

  • Loading your existing dataset in FiftyOne

  • Adding predictions from your model to your FiftyOne dataset

  • Computing insights into your dataset relating to possible mistakes

  • Visualizing the mistake in the FiftyOne App


This tutorial requires PyTorch to be installed:

# Modify as necessary (e.g., GPU install). See for options
!pip install torch
!pip install torchvision

We’ll also need to download a pretrained CIFAR-10 PyTorch model (a ResNet-50) from the web:

# Download the software
!git clone

# Download the pretrained model (90MB)
!eta gdrive download --public \
    1dGfpeFK_QG0kV-U6QDHMX2EOGXPqaNzu \
Cloning into 'PyTorch_CIFAR10'...
remote: Enumerating objects: 551, done.
remote: Total 551 (delta 0), reused 0 (delta 0), pack-reused 551
Receiving objects: 100% (551/551), 6.54 MiB | 3.20 MiB/s, done.
Resolving deltas: 100% (182/182), done.
Downloading '1dGfpeFK_QG0kV-U6QDHMX2EOGXPqaNzu' to 'PyTorch_CIFAR10/cifar10_models/state_dicts/'
 100% |████|  719.8Mb/719.8Mb [36.2s elapsed, 0s remaining, 24.4Mb/s]

Manipulating the data

For this walkthrough, we will artificially perturb an existing dataset with mistakes on the labels. Of course, in your normal workflow, you would not add labeling mistakes; this is only for the sake of the walkthrough.

The code block below loads the test split of the CIFAR-10 dataset into FiftyOne and randomly breaks 10% (1000 samples) of the labels:

import random

import fiftyone as fo
import fiftyone.zoo as foz

# Load the CIFAR-10 test split
# Downloads the dataset from the web if necessary
dataset = foz.load_zoo_dataset("cifar10", split="test")

# Get the CIFAR-10 classes list
info = foz.load_zoo_dataset_info("cifar10")
classes = info.classes

# Artificially corrupt 10% of the labels
_num_mistakes = int(0.1 * len(dataset))
for sample in dataset.view().take(_num_mistakes):
    mistake = random.randint(0, 9)
    while classes[mistake] == sample.ground_truth.label:
        mistake = random.randint(0, 9)

    sample.ground_truth = fo.Classification(label=classes[mistake])
Split 'test' already downloaded
Loading 'cifar10' split 'test'
 100% |█████████████████████████| 10000/10000 [2.3s elapsed, 0s remaining, 4.2K samples/s]

Let’s print some information about the dataset to verify the operation that we performed:

# Verify that the `mistake` tag is now in the dataset's schema
Name:           cifar10-test
Persistent:     False
Num samples:    10000
Tags:           ['mistake', 'test']
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
# Count the number of samples with the `mistake` tag
num_mistakes = len(dataset.view().match_tag("mistake"))
print("%d ground truth labels are now mistakes" % num_mistakes)
1000 ground truth labels are now mistakes

Add predictions to the dataset

Using an off-the-shelf model, let’s now add predictions to the dataset, which are necessary for us to deduce some understanding of the possible label mistakes.

The code block below adds model predictions to another randomly chosen 10% (1000 samples) of the dataset:

import sys

import numpy as np
import torch
import torchvision
from import DataLoader

import fiftyone.utils.torch as fout

sys.path.insert(1, "PyTorch_CIFAR10")
from cifar10_models import *

def make_cifar10_data_loader(image_paths, sample_ids, batch_size):
    mean = [0.4914, 0.4822, 0.4465]
    std = [0.2023, 0.1994, 0.2010]
    transforms = torchvision.transforms.Compose(
            torchvision.transforms.Normalize(mean, std),
    dataset = fout.TorchImageDataset(
        image_paths, sample_ids=sample_ids, transform=transforms
    return DataLoader(dataset, batch_size=batch_size, num_workers=4)

def predict(model, imgs):
    logits = model(imgs).detach().cpu().numpy()
    predictions = np.argmax(logits, axis=1)
    odds = np.exp(logits)
    confidences = np.max(odds, axis=1) / np.sum(odds, axis=1)
    return predictions, confidences, logits

# Load a model
# Model performance numbers are available at:

model = resnet50(pretrained=True)
model_name = "resnet50"

# Extract a few images to process
# (some of these will have been manipulated above)

num_samples = 1000
batch_size = 20
view = dataset.view().take(num_samples)
image_paths, sample_ids = zip(
    *[(s.filepath, for s in view.iter_samples()]
data_loader = make_cifar10_data_loader(image_paths, sample_ids, batch_size)

# Perform prediction and store results in dataset

for imgs, sample_ids in data_loader:
    predictions, _, logits_ = predict(model, imgs)

    # Add predictions to your FiftyOne dataset
    for sample_id, prediction, logits in zip(sample_ids, predictions, logits_):
        sample = dataset[sample_id]
        sample[model_name] = fo.Classification(
            label=classes[prediction], logits=logits,

Let’s print some information about the predictions that were generated and how many of them correspond to samples whose ground truth labels were corrupted:

# Count the number of samples with the `processed` tag
num_processed = len(dataset.view().match_tag("processed"))

# Count the number of samples with both `processed` and `mistake` tags
num_corrupted = len(dataset.view().match_tag("processed").match_tag("mistake"))

print("Added predictions to %d samples" % num_processed)
print("%d of these samples have label mistakes" % num_corrupted)
Added predictions to 1000 samples
94 of these samples have label mistakes

Find the mistakes

Now we can run a method from FiftyOne that estimates the mistakenness of the ground samples for which we generated predictions:

import fiftyone.brain as fob

# Get samples for which we added predictions
h_view = dataset.view().match_tag("processed")

# Compute mistakenness
fob.compute_mistakenness(h_view, model_name, label_field="ground_truth")
Computing mistakenness for 1000 samples...
 100% |███████████████████████████| 1000/1000 [1.3s elapsed, 0s remaining, 808.1 samples/s]
Mistakenness computation complete

The above method added mistakenness field to all samples for which we added predictions. We can easily sort by likelihood of mistakenness from code:

# Sort by likelihood of mistake (most likely first)
mistake_view = (dataset.view()
    .sort_by("mistakenness", reverse=True)

# Print some information about the view
Dataset:        cifar10-test
Num samples:    1000
Tags:           ['test', 'processed', 'mistake']
Sample fields:
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    resnet50:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Classification)
    mistakenness: fiftyone.core.fields.FloatField
Pipeline stages:
    1. <fiftyone.core.stages.MatchTag object at 0x7f9cb80dbc50>
    2. <fiftyone.core.stages.SortBy object at 0x7f9d4bbd14e0>
# Inspect the first few samples
<Sample: {
    'dataset_name': 'cifar10-test',
    'id': '5ef384e36696dbdeabc6a88e',
    'filepath': '/home/voxel51/fiftyone/cifar10/test/data/00107.jpg',
    'tags': BaseList(['test', 'processed']),
    'ground_truth': <Classification: {'label': 'deer'}>,
    'resnet50': <Classification: {
        'label': 'horse',
        'logits': array([-0.83586901, -1.28598607,  1.54965878, -0.49650264, -0.40103185,
               -0.18043809, -1.0332154 ,  5.05314684, -1.21831954, -1.15143788]),
    'mistakenness': 1.0,
<Sample: {
    'dataset_name': 'cifar10-test',
    'id': '5ef384e36696dbdeabc6a86f',
    'filepath': '/home/voxel51/fiftyone/cifar10/test/data/00076.jpg',
    'tags': BaseList(['test', 'processed']),
    'ground_truth': <Classification: {'label': 'bird'}>,
    'resnet50': <Classification: {
        'label': 'deer',
        'logits': array([-0.72157425, -0.94043797, -0.32308894, -0.19049911,  4.82478857,
               -0.35608411, -0.35027471, -0.25426134, -0.77823019, -0.91033494]),
    'mistakenness': 1.0,
<Sample: {
    'dataset_name': 'cifar10-test',
    'id': '5ef384e36696dbdeabc6a838',
    'filepath': '/home/voxel51/fiftyone/cifar10/test/data/00021.jpg',
    'tags': BaseList(['test', 'mistake', 'processed']),
    'ground_truth': <Classification: {'label': 'frog'}>,
    'resnet50': <Classification: {
        'label': 'deer',
        'logits': array([-0.77428126, -1.11018133,  1.21526551, -0.23978873,  3.74053574,
               -0.37081209,  0.20087151, -0.54353052, -1.05138922, -1.06668639]),
    'mistakenness': 1.0,

Let’s use the App to visually inspect the results:

# Launch the FiftyOne App
session = fo.launch_app()

# Open your dataset in the App
session.dataset = dataset
App launched


# Show only the samples that were processed
session.view = dataset.view().match_tag("processed")


# Show only the samples for which we added label mistakes
session.view = dataset.view().match_tag("mistake")


# Show the samples we processed in rank order by the mistakenness
session.view = mistake_view