A guide to using the integration between FiftyOne and Labelbox to build high-quality image and video datasets
Modern computer vision projects in the deep learning age always start with the same thing. LOTS OF DATA! For just about any task you can find countless models with open-source code ready for you to train them. The only thing that you need for your specific task is a sufficiently large, labeled dataset.
In this post, we show how to use the integration between the open-source dataset curation and model analysis tool, FiftyOne, and the widely popular annotation tool, Labelbox, in order to build a high-quality dataset for computer vision.
Follow along in Colab
You can follow along with the examples in this post directly in your browser through this Google Colab notebook!
Setup
To start, you need to install FiftyOne and Labelbox.
pip install fiftyone labelbox
You also need to and set up a Labelbox account. FiftyOne supports both standard Labelbox cloud accounts as well as Labelbox enterprise on-premises solutions.
The easiest way to get started is to use the default Labelbox server, which simply requires creating an account and then providing your API key as shown below.
export FIFTYONE_LABELBOX_API_KEY=...
Alternatively, for a more permanent solution, you can store your credentials in your FiftyOne annotation config located at ~/.fiftyone/annotation_config.json
:
{ "backends": { "labelbox": { "api_key": ..., } } }
Raw Data
To start, you need to gather raw image or video data relevant to your task. The internet has a lot of places to look for free data. Assuming you have your raw data downloaded locally, you can easily load it into FiftyOne.
import fiftyone as fo dataset = fo.Dataset.from_dir( dataset_dir="/path/to/dir", dataset_type=fo.types.ImageDirectory, )
Another method is to use publically available datasets that may be relevant. For example, the Open Images dataset contains millions of images available for public use and can be accessed directly through the FiftyOne Dataset Zoo.
import fiftyone.zoo as foz dataset = foz.load_zoo_dataset( "open-images-v6", split="validation", classes="Person", max_samples=500, )
Either way, once your data is in FiftyOne, we can visualize it in the FiftyOne App.
import fiftyone as fo session = fo.launch_app(dataset)
FiftyOne provides a variety of methods that can help you understand the quality of the dataset and pick the best samples to annotate. For example, the compute_similarity()
the method can be used to find both the most similar, and the most unique samples, ensuring that your dataset will contain an even distribution of data.
import fiftyone.brain as fob results = fob.compute_similarity(dataset, brain_key="img_sim") results.find_unique(10)
Now to select only the slice of our dataset that contains the 10 most unique samples.
unique_view = dataset.select(results.unique_ids) session.view = unique_view
Annotation
The integration between FiftyOne and Labelbox allows you to begin annotating your image or video data by calling a single method!
anno_key = "annotation_run_1" classes = ["vehicle", "animal", "plant"] unique_view.annotate( anno_key, backend="labelbox", label_field="detections", classes=classes, label_type="detections", )
The annotations can then be loaded back into FiftyOne in just one more line.
unique_view.load_annotations(anno_key)
This API provides advanced customization options for your annotation tasks. For example, we can construct a sophisticated schema to define the annotations we want and even directly assign the annotators:
anno_key = "labelbox_assign_users" members = [ ("fiftyone_labelbox_user1@gmail.com", "LABELER"), ("fiftyone_labelbox_user2@gmail.com", "REVIEWER"), ("fiftyone_labelbox_user3@gmail.com", "TEAM_MANAGER"), ] # Set up the Labelbox editor to reannotate # existing "detections" labels and # a new "keypoints" field label_schema = { "detections_new": { "type": "detections", "classes": dataset.distinct("detections.detections.label"), }, "keypoints": { "type": "keypoints", "classes": ["Person"], } } unique_view.annotate( anno_key, backend="labelbox", label_schema=label_schema, members=members, launch_editor=True, ) # Annotate in Labelbox # Download results and clean the run from FiftyOne and Labelbox unique_view.load_annotations(anno_key, cleanup=True)
Next Steps
Now that you have a labeled dataset, you can go ahead and start training a model. FiftyOne lets you export your data to disk in a variety of formats (ex: COCO, YOLO, etc) expected by most training pipelines. It also provides workflows for using popular model training libraries like PyTorch, PyTorch Lightning Flash, and Tensorflow.
Once the model is trained, the model predictions can be loaded back into FiftyOne. These predictions can then be evaluated against the ground truth annotations to find where the model is performing well, and where it is performing poorly. This provides insight into the type of samples that need to be added to the training set, as well as any annotation errors that may exist.
# Load an existing dataset with predictions dataset = foz.load_zoo_dataset("quickstart") # Evaluate model predictions dataset.evaluate_detections( "predictions", gt_field="ground_truth", eval_key="eval", )
We can use the powerful querying capabilities of the FiftyOne API to create a view filtering these model results by false positives with high confidence which generally indicates an error in the ground truth annotation.
from fiftyone import ViewField as F fp_view = dataset.filter_labels( "predictions", (F("confidence") > 0.8 & F("eval") == "fp"), ) session = fo.launch_app(view=fp_view)
This sample appears to be missing a ground truth annotation of skis. Let’s tag it in FiftyOne, and send it to Labelbox for reannotation.
view = dataset.match_tags("reannotate") anno_key = "fix_labels" label_schema = { "ground_truth_edits": { "type": "detections", "classes": dataset.distinct("ground_truth.detections.label"), } } view.annotate( anno_key, label_schema=label_schema, backend="labelbox", )
view.load_annotations(anno_key, cleanup=True) view.merge_labels("ground_truth_edits", "ground_truth")
Iterating over this process of training a model, evaluating its failure modes, and improving the dataset is the most surefire way to produce high-quality datasets and subsequently high-performing models.
Additional Utilities
You can perform additional Labelbox-specific operations to monitor the progress of an annotation project initiated through this integration with FiftyOne.
For example, you can view the status of an existing project:
results = dataset.load_annotation_results(anno_key) results.print_status()
Project: FiftyOne_quickstart ID: cktixtv70e8zm0yba501v0ltz Created at: 2021-09-13 17:46:21+00:00 Updated at: 2021-09-13 17:46:24+00:00 Members: User: user1 Role: Admin ID: ckl137jfiss1c07320dacd81l Nickname: user1 Email: USER1_EMAIL@email.com User: user2 Role: Labeler Name: FIRSTNAME LASTNAME ID: ckl137jfiss1c07320dacd82y Email: USER2_EMAIL@email.com Reviews: Positive: 2 Zero: 0 Negative: 1
You can also delete projects associated with an annotation run directly through the FiftyOne API.
results = dataset.load_annotation_results(anno_key) api = results.connect_to_api() print(results.project_id) # "bktes8fl60p4s0yba11npdjwm" api.delete_project(results.project_id, delete_datasets=True) # OR api.delete_projects([results.project_id], delete_datasets=True) # List all projects or datasets associated with your Labelbox account project_ids = api.list_projects() dataset_ids = api.list_datasets() # Delete all projects and datsets from your Labelbox account api.delete_projects(project_ids_to_delete) api.delete_datasets(dataset_ids_to_delete)
Summary
No matter what computer vision projects you are working on, you will need a dataset. FiftyOne makes it easy to curate and dig into your dataset to understand all aspects of it, including what needs to be annotated or reannotated. In turn, the integration with Labelbox makes this annotation process a breeze resulting in a dataset that will lead to higher-quality models.