Skip to content

How to Curate, Annotate, and Improve Computer Vision Datasets with FiftyOne and Labelbox

A guide to using the integration between FiftyOne and Labelbox to build high-quality image and video datasets

Modern computer vision projects in the deep learning age always start with the same thing. LOTS OF DATA! For just about any task you can find countless models with open-source code ready for you to train them. The only thing that you need for your specific task is a sufficiently large, labeled dataset.

In this post, we show how to use the integration between the open-source dataset curation and model analysis tool, FiftyOne, and the widely popular annotation tool, Labelbox, in order to build a high-quality dataset for computer vision.

Follow along in Colab

You can follow along with the examples in this post directly in your browser through this Google Colab notebook!

Setup

To start, you need to install FiftyOne and Labelbox.

pip install fiftyone labelbox

You also need to and set up a Labelbox account. FiftyOne supports both standard Labelbox cloud accounts as well as Labelbox enterprise on-premises solutions.

The easiest way to get started is to use the default Labelbox server, which simply requires creating an account and then providing your API key as shown below.

export FIFTYONE_LABELBOX_API_KEY=...

Alternatively, for a more permanent solution, you can store your credentials in your FiftyOne annotation config located at ~/.fiftyone/annotation_config.json:

{
    "backends": {
        "labelbox": {
            "api_key": ...,
        }
    }
}

Raw Data

To start, you need to gather raw image or video data relevant to your task. The internet has a lot of places to look for free data. Assuming you have your raw data downloaded locally, you can easily load it into FiftyOne.

import fiftyone as fo

dataset = fo.Dataset.from_dir(
    dataset_dir="/path/to/dir",
    dataset_type=fo.types.ImageDirectory,
)

Another method is to use publically available datasets that may be relevant. For example, the Open Images dataset contains millions of images available for public use and can be accessed directly through the FiftyOne Dataset Zoo.

import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
    "open-images-v6",
    split="validation",
    classes="Person",
    max_samples=500,
)

Either way, once your data is in FiftyOne, we can visualize it in the FiftyOne App.

import fiftyone as fo

session = fo.launch_app(dataset)

FiftyOne provides a variety of methods that can help you understand the quality of the dataset and pick the best samples to annotate. For example, the compute_similarity() the method can be used to find both the most similar, and the most unique samples, ensuring that your dataset will contain an even distribution of data.

import fiftyone.brain as fob

results = fob.compute_similarity(dataset, brain_key="img_sim")
results.find_unique(10)

Now to select only the slice of our dataset that contains the 10 most unique samples.

unique_view = dataset.select(results.unique_ids)
session.view = unique_view

Annotation

The integration between FiftyOne and Labelbox allows you to begin annotating your image or video data by calling a single method!

anno_key = "annotation_run_1"
classes = ["vehicle", "animal", "plant"]
unique_view.annotate(
    anno_key,
    backend="labelbox",
    label_field="detections",
    classes=classes,
    label_type="detections",
)

The annotations can then be loaded back into FiftyOne in just one more line.

unique_view.load_annotations(anno_key)

This API provides advanced customization options for your annotation tasks. For example, we can construct a sophisticated schema to define the annotations we want and even directly assign the annotators:

anno_key = "labelbox_assign_users"

members = [
    ("fiftyone_labelbox_user1@gmail.com", "LABELER"),
    ("fiftyone_labelbox_user2@gmail.com", "REVIEWER"),
    ("fiftyone_labelbox_user3@gmail.com", "TEAM_MANAGER"),
]

# Set up the Labelbox editor to reannotate
# existing "detections" labels and
# a new "keypoints" field
label_schema = {
    "detections_new": {
        "type": "detections",
        "classes": dataset.distinct("detections.detections.label"),
    },
    "keypoints": {
        "type": "keypoints",
        "classes": ["Person"],
    }
}

unique_view.annotate(
    anno_key,
    backend="labelbox",
    label_schema=label_schema,
    members=members,
    launch_editor=True,
)
# Annotate in Labelbox
# Download results and clean the run from FiftyOne and Labelbox
unique_view.load_annotations(anno_key, cleanup=True)

Next Steps

Now that you have a labeled dataset, you can go ahead and start training a model. FiftyOne lets you export your data to disk in a variety of formats (ex: COCO, YOLO, etc) expected by most training pipelines. It also provides workflows for using popular model training libraries like PyTorchPyTorch Lightning Flash, and Tensorflow.

Once the model is trained, the model predictions can be loaded back into FiftyOne. These predictions can then be evaluated against the ground truth annotations to find where the model is performing well, and where it is performing poorly. This provides insight into the type of samples that need to be added to the training set, as well as any annotation errors that may exist.

# Load an existing dataset with predictions
dataset = foz.load_zoo_dataset("quickstart")

# Evaluate model predictions
dataset.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval",
)

We can use the powerful querying capabilities of the FiftyOne API to create a view filtering these model results by false positives with high confidence which generally indicates an error in the ground truth annotation.

from fiftyone import ViewField as F

fp_view = dataset.filter_labels(
    "predictions",
    (F("confidence") > 0.8 & F("eval") == "fp"),
)
session = fo.launch_app(view=fp_view)

This sample appears to be missing a ground truth annotation of skis. Let’s tag it in FiftyOne, and send it to Labelbox for reannotation.

view = dataset.match_tags("reannotate")
anno_key = "fix_labels"
label_schema = {
    "ground_truth_edits": {
        "type": "detections",
        "classes": dataset.distinct("ground_truth.detections.label"),
    }
}
view.annotate(
    anno_key,
    label_schema=label_schema,
    backend="labelbox",
)
view.load_annotations(anno_key, cleanup=True)
view.merge_labels("ground_truth_edits", "ground_truth")

Iterating over this process of training a model, evaluating its failure modes, and improving the dataset is the most surefire way to produce high-quality datasets and subsequently high-performing models.

Additional Utilities

You can perform additional Labelbox-specific operations to monitor the progress of an annotation project initiated through this integration with FiftyOne.

For example, you can view the status of an existing project:

results = dataset.load_annotation_results(anno_key)
results.print_status()
Project: FiftyOne_quickstart
        ID: cktixtv70e8zm0yba501v0ltz
        Created at: 2021-09-13 17:46:21+00:00
        Updated at: 2021-09-13 17:46:24+00:00
        Members:

                User: user1
                    Role: Admin
                    ID: ckl137jfiss1c07320dacd81l
                    Nickname: user1
                    Email: USER1_EMAIL@email.com

                User: user2
                    Role: Labeler
                    Name: FIRSTNAME LASTNAME
                    ID: ckl137jfiss1c07320dacd82y
                    Email: USER2_EMAIL@email.com

        Reviews:
                Positive: 2
                Zero: 0
                Negative: 1

You can also delete projects associated with an annotation run directly through the FiftyOne API.

results = dataset.load_annotation_results(anno_key)
api = results.connect_to_api()

print(results.project_id)
# "bktes8fl60p4s0yba11npdjwm"

api.delete_project(results.project_id, delete_datasets=True)

# OR

api.delete_projects([results.project_id], delete_datasets=True)

# List all projects or datasets associated with your Labelbox account
project_ids = api.list_projects()
dataset_ids = api.list_datasets()

# Delete all projects and datsets from your Labelbox account
api.delete_projects(project_ids_to_delete)
api.delete_datasets(dataset_ids_to_delete)

Summary

No matter what computer vision projects you are working on, you will need a dataset. FiftyOne makes it easy to curate and dig into your dataset to understand all aspects of it, including what needs to be annotated or reannotated. In turn, the integration with Labelbox makes this annotation process a breeze resulting in a dataset that will lead to higher-quality models.