Welcome to our weekly FiftyOne tips and tricks blog where we recap interesting questions and answers that have recently popped up on Slack, GitHub, Stack Overflow, and Reddit.
Wait, what’s FiftyOne?
FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.
- If you like what you see on GitHub, give the project a star
- Get started! We’ve made it easy to get up and running in a few minutes
- Join the FiftyOne Slack community, we’re always happy to help
Ok, let’s dive into this week’s tips and tricks!
Using sample tags to get a subset of images by class
Community Slack member Gaurav Savlani asked,
“How do I do something like view.groupby([groupbycolumn]).apply(somefunction)
like I would in pandas? For example: if I have 100 classes and I want to subset 10 images from each class. Is there functionality to this?
Although there is a group_by()
view stage, it doesn’t currently support filtering the elements of each group. We’d recommend using sample tags to encode the sampling of each group. For example:
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset("cifar10", split="test")
for label in dataset.distinct("ground_truth.label"):
view = dataset.match(F("ground_truth.label") == label).take(10)
view.tag_samples("sample")
view = dataset.match_tags("sample")
print(view.count_values("ground_truth.label"))
# {'airplane': 10, 'ship': 10, ..., 'cat': 10}
Persisting datasets and loading them directly
Community Slack member Sidney Guaro asked,
“Whenever I start a new program, do I need to process the dataset again or can I directly load it using fo.load_dataset()
?”
Once you make your dataset persistent with dataset.save()
, you can load it directly with fo.load_dataset()
. Note that dataset.save()
is only required when you edit a property like dataset.info
in-place. For example:
dataset.info["new_field"] = "new-value"
dataset.save() # required in order for changes to save
# Save automatically occurs in these cases
dataset.info = {"new-field": "new-value"}
dataset.persistent = True
Learn more about saving changes to your dataset in the FiftyOne Docs.
Exploring and filtering video datasets by a specific label
Community Slack member Adian Loy asked,
“I have a dataset consisting of videos and frame-by-frame classification labels. I want to explore this dataset by filtering on snippets of a specific label. Is there a way for FiftyOne to do that automatically or on the fly?
Yes, there is! Here’s the steps:
Step 1: Load your videos into a dataset and add your frame-level classification
# Pseudocode for what that looks like, parsing depends on your raw data format
samples = []
for video_filepath, frame_classifications in my_raw_data:
sample = fo.Sample(filepath=video_filepath)
# Add frame classifications
for frame_number, frame_classification in frame_classifications:
classification = fo.Classification(label=frame_classification)
sample.frames[frame_number]["ground_truth"] = fo.Classifications(
classifications=[classification]
)
samples.append(sample)
dataset = fo.Dataset("my-dataset")
dataset.add_samples(samples)
Step 2: Filter the frame labels to create a view for the specific label of interest:
from fiftyone import ViewField as F
filtered_view = dataset.filter_labels(
"frames.ground_truth", F("label") == "your-label"
)
Step 3: You can then turn this filtered video view into a temporary clips view where every contiguous set of frames that has the classification label you filtered by gets turned into its own clip on the fly:
clips_view = filtered_view.to_clips(“frames.ground_truth”)
When you then visualize the clips_view in the App, each clip is shown separately in the grid and when you click on it, you only see the frames associated with that clip.
session = fo.launch_app(clips_view)
If you then export the clip dataset to disk in the future, only then are the clip media actually extracted from the video and saved.
Speeding up dataset load times with FiftyOne Brain
Community Slack member Oğuz Hanoğlu asked,
“find_duplicates()
takes about 5 min to run with 180k samples. Is it possible to save its results(neighbors_map...)
and load them the next time I open the notebook?”
We’d recommend taking advantage of the FiftyOne Brain in this scenario. To start, these are the attributes that are set by find_duplicates():
results._thresh = thresh
results._unique_ids = unique_ids
results._duplicate_ids = duplicate_ids
results._neighbors_map = neighbors_map
So, in your case you’ll want to make use of a brain_key
:
results = dataset.load_brain_results(brain_key)
# Load results from a previous call to `find_duplicates()`
results._thresh = thresh
results._unique_ids = unique_ids
results._duplicate_ids = duplicate_ids
results._neighbors_map = neighbors_map
plot = results.visualize_duplicates(...)
plot.show()
Importing and exporting datasets to the cloud
Community Slack member George Pearse asked,
“Do dataset exports currently support loading to the cloud? e.g. like model checkpoints that just allow you to put a cloud path in there?”
This capability is a feature of the FiftyOne Teams product. The Teams Python SDK fully supports cloud paths everywhere that you would be using local paths using the OSS library (for example when importing and exporting datasets.) FiftyOne Teams enables multiple users to securely collaborate on the same datasets and models, either on-premises or in the cloud, all built on top of the open source FiftyOne workflows that you’re already relying on.
Learn more about FiftyOne Teams.
What’s next?
- If you like what you see on GitHub, give the project a star
- Get started! We’ve made it easy to get up and running in a few minutes
- Join the FiftyOne Slack community, we’re always happy to help