Skip to content

Finding and Correcting Mistakes – FiftyOne Tips and Tricks – Aug 18, 2023

Welcome to our weekly FiftyOne tips and tricks blog where we cover interesting workflows and features of FiftyOne!

Wait, what’s FiftyOne?

FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.

Ok, let’s dive into this week’s tips and tricks! Also feel free to follow along in our notebook or on YouTube!

Finding and Removing Duplicate Images

Typical image datasets can contain upwards of tens of thousands if not millions of images. It is not uncommon to find duplicate images hidden amongst the masses. These duplicated images can harm the training of any model on this data, and can have serious consequences if not corrected. 

Leveraging FiftyOne, we can use built in functionality to find these duplicate images and remove them from our dataset, courtesy of the FiftyOne Brain!

import fiftyone as fo
import fiftyone.zoo as foz

# Load the CIFAR-10 test split
# Downloads the dataset from the web if necessary
dataset = foz.load_zoo_dataset("cifar10", split="test")

session = fo.launch_app(dataset)

We are able to load in our data and take a look at it in the FiftyOne App. At first glance, the data looks normal. But with 10,000 images to look through, finding a duplicate can quickly turn into an all day affair. Luckily with the FiftyOne Brain, we are able to find duplicates in an instant with compute_uniqueness()!

import fiftyone.brain as fob

fob.compute_uniqueness(dataset)

# Sort in increasing order of uniqueness (least unique first)
dups_view = dataset.sort_by("uniqueness")

# Open view in the App
session.view = dups_view

Here we are able to see duplicate images in CIFAR10! 

Next, we can click on each of these images individually and select them in the top left corner of the box. Once all your duplicate images are selected, they can be tagged with the following code. Make sure to leave one original image!

# Get currently selected images from App
dup_ids = session.selected

# Mark as duplicates
dups_view = dataset.select(dup_ids)
dups_view.tag_samples("dups")

# Visualize duplicates-only in App
session.view = dups_view

After executing the code above, we are able to tell that duplicates have been properly tagged. Once you are confident that there are no more duplicates within your dataset, you can create a clean view that is ready for training. You can even export this view as a new and improved version of your dataset to use for future use.

from fiftyone import ViewField as F

clean_view = dataset.sort_by("uniqueness").match_tags("dups", bool=False)

export_dir = "/path/for/image-classification-dir-tree"

label_field = "ground_truth"  # for example

# Export the dataset
clean_view.export(
    export_dir=export_dir,
    dataset_type=fo.types.ImageClassificationDirectoryTree,
    label_field=label_field,
)

Finding Classification Mistakes

Another prevalent form of annotation mistakes is a classification mistake on the label of a sample. It can happen if the picture of your dog is labeled cat or vice versa and can really impact the learning capabilities of your models. FiftyOne has built in functionality using the FiftyOne Brain to catch these mistakes and help you fix them. 

In the following example, we will be taking a look at CIFAR10 again and purposely corrupt our dataset with incorrect labels. For a quick way to start this example, follow along in the linked notebook or check out the docs. Once you have corrupted your labels and had a trained model add its predictions to your dataset, we can begin. Let’s kick it off by using the FiftyOne Brain compute_mistakenness()!

import fiftyone.brain as fob

# Get samples for which we added predictions
h_view = dataset.match_tags("processed")

# Compute mistakenness
fob.compute_mistakenness(h_view, model_name, label_field="ground_truth", use_logits=True)

# Sort by likelihood of mistake (most likely first)
mistake_view = (dataset
    .match_tags("processed")
    .sort_by("mistakenness", reverse=True)
)

# Show only the samples for which we added label mistakes
session.view = mistake_view

After running our Brain function, we are able to discover all the mislabeled images we have in our dataset. With the mistakes having been found, we can tag and remove as we did previously. Alternatively, the files can also be removed from their filepaths or sent back to annotation after being tagged as well. 

Finding and Correcting Detection Mistakes

Detection mistakes can be especially difficult to find by hand. The problem can rapidly expand from checking 1000 images to 10,000 labels to make sure all the boxes are perfect. Having misplaced boxes or duplicate boxes is another heavy detriment to the training process of detection models. FiftyOne can help step in and take a load off for finding these mistakes in your dataset. Take a look below or in the docs to see how:

dataset = foz.load_zoo_dataset("coco-2017", split="validation", max_samples=1000, overwrite=True, dataset_name="Find Mistakes")

import fiftyone.utils.iou as foui

#Calculate the overlaps of boxes within your dataset
foui.compute_max_ious(dataset, "ground_truth", iou_attr="max_iou", classwise=True)

print("Max IoU range: (%f, %f)" % dataset.bounds("ground_truth.detections.max_iou"))

# Retrieve detections that overlap above a chosen threshold
dups_view = dataset.filter_labels("ground_truth", F("max_iou") > 0.75)

session.view = dups_view

In a few lines we were able to take 1000 samples and whittle down into 7 potential candidates for mistakes. Some are just two very close boxes like our first image of two baseball players. Some are truly mistakes as is the case with the man at the beach. To fix the mistake, we can open up the sample and tag the incorrect bounding box as a duplicate. 

After the label has been tagged as a duplicate it can be removed from the sample entirely, fixing the mistake in the sample with dataset.delete_labels(tags="dups"). Alternatively, you could send the tagged images for reannotation with one of FiftyOne’s native annotation integrations. One such integration with FiftyOne is CVAT, and can be accomplished quickly:

anno_key = "remove_dups"

dups_view.annotate(anno_key, label_field="ground_truth", launch_editor=True)

Hopefully, these tips will help you find and correct these mistakes in your data to allow for you to create the best models you can! Good Luck!

To learn more about fields, samples, and more FiftyOne features, head over to our User Guide for more information!

Join the FiftyOne Community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!