Welcome to our weekly FiftyOne tips and tricks blog where we recap interesting questions and answers that have recently popped up on
Slack,
GitHub, Stack Overflow, and Reddit.
Wait, what’s FiftyOne?
FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.
[@portabletext/react] Unknown block type "externalImage", specify a component for it in the `components.types` prop
Ok, let’s dive into this week’s tips and tricks!
Isolating spurious or missing objects
Community Slack member George Pearse asked,
“Is there a way to just get bounding boxes around the possibly missing and possibly spurious objects in my dataset?”
Here, George is asking about how to isolate potential mistakes in ground truth labels on a dataset. When working with a new dataset, it is always important to validate the quality of the ground truth annotations. Even highly regarded and well-cited datasets
can contain a plethora of errors.
Two such common types of errors in object detection labels are:
- A ground truth label was spuriously added to the data, and does not correspond to an object in the allowed object classes
- An object is not annotated, so the ground truth detection is missing
Fortunately, the
FiftyOne Brain provides a built-in method that identifies possible spurious and missing detections. These are stored at both the sample level and the detection level.
With FiftyOne’s filtering capabilities, it is easy to create a view containing only the detections that are possibly spurious, or possibly missing, or both. In these cases, you might also find it helpful to convert the filtered view to a
PatchView so you can view each potential error on its own. Here is some code to get you started:
We can then view these in the
FiftyOne App. Inspect the possibly spurious detection patches, for instance:
Filtering by ID
Community Slack member Sylvia Schmitt asked,
“I am storing related sample IDs as StringField
objects in a separate field on my data and I want to use them to match sample IDs that are stored as ObjectIdField
objects. How do I do this?”
If you were comparing the values in two `StringFields`, you could use the
ViewField as follows:
However, sample IDs are represented as ObjectIdField
objects. They are stored under an _id
key in the underlying database, and need to be referenced with this same syntax, prepending an underscore. Additionally, the object needs to be converted to a string for the comparison.
Here is what such a matching operation might look like:
Merging datasets with shared media files
Community Slack member Joy Timmermans asked,
“I have three datasets, and some of my samples are in multiple datasets. I’d like to combine all of these datasets into one dataset for export, retaining each copy of each of the samples. How do I do this?”
If your datasets were created independently, even if there are samples that have the same media files (located at the same file paths), these samples will have different sample IDs. In this case, you can create a combined dataset with the add_collection()
method without passing in any optional arguments.
If, on the other hand, your datasets have samples with the same sample IDs, then applying the add_collection()
method without options will only lead to the “combined” dataset having a single copy of each media file.
Fortunately, you can bypass this by passing in new_ids = True
to add_collection()
. In your case, combining three datasets would look like:
Exporting GeoJSON annotations
Community Slack member Kais Bedioui asked,
“I am logging some of my production data in GeoJSON format, and I want to save it in the database in that same format. Is there a way to include the ground_truth
label in the labels.json
file so that when I reload the GeoJSON dataset, it comes with its annotations?”
To do this, you can use the optional property_makers
argument of the GeoJSON exporter to include additional properties directly in GeoJSON format. For example:
Alternatively, if you want to save the annotations but do not need to save everything in GeoJSON format, you can export it as a FiftyOne Dataset
:
When you take this approach, all you have to do to load the dataset back in is use FiftyOne’s from_dir()
method:
Picking random frames from videos
Community Slack member Joy Timmermans asked,
“Is there an equivalent of take()
for frames in a video dataset so that I can randomly select a subset of frames for each sample?”
One way to accomplish this would be to use a Python library for random sampling in conjunction with the select_frames()
method. First, you can use random sampling without replacement to pick a set of frame numbers for each video. Then, you can get the frame id
for each of these frames. Finally, you can pass this list of id
s into select_frames()
.
Here’s one implementation using numpy’s random choice method:
If you’d like, at this point you can also convert the videos to frames:
Join the FiftyOne community!
Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!