Welcome to our weekly FiftyOne tips and tricks blog where we give practical pointers for using FiftyOne on topics inspired by discussions in the open source community. This week we’ll cover adding and merging data.
Wait, What’s FiftyOne?
FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.
- If you like what you see on GitHub, give the project a star.
- Get started! We’ve made it easy to get up and running in a few minutes.
- Join the FiftyOne Slack community, we’re always happy to help.
Ok, let’s dive into this week’s tips and tricks!
A primer on adding and merging
Datasets are the core data structure in FiftyOne, allowing you to represent your raw data, labels, and associated metadata. Samples are the atomic elements of a
Dataset that store all the information related to a given piece of data. When you query and manipulate a Dataset object using dataset views, a DatasetView object is returned, which represents a filtered view into a subset of the underlying dataset’s contents.
Many computer vision workflows involve operations that merge data from multiple sources, such as adding new samples to an existing dataset, or merging a model’s predictions into a dataset which contains ground truth labels. In FiftyOne,
DatasetView objects come with a variety of methods that make performing these add and merge operations easy.
Continue reading for some tips and tricks to help you master adding and merging data in FiftyOne!
Encountering a sample multiple times
If you want to add a completely new collection of samples,
samples, to a dataset,
dataset, then you can use the
add_collection() methods for the most part interchangeably. However, if there are samples that appear multiple times in your workflows, due to sources of randomness, for instance, then these two methods have different consequences.
add_samples() encounters samples that are already present in the dataset to which the method is applied, it generates a new sample with a new
id, and adds this to the dataset. On the other hand,
add_collection() ignores the sample and moves on.
In the code block below, when applied to the Quickstart Dataset with a random collection from the dataset as input,
add_collection() leaves the dataset unchanged, whereas
add_samples() increases the size of the dataset:
import fiftyone as fo import fiftyone.zoo as foz # 200 samples dataset = foz.load_zoo_dataset("quickstart") # randomly select 50 samples samples = dataset.take(50) # doesn’t change dataset size dataset.add_collection(samples) # 200 samples → 250 samples dataset.add_samples(samples)
Learn more about FiftyOne’s random utils in the FiftyOne Docs.
Add samples by directory
FiftyOne supports a variety of common computer vision data formats, making it easy to load your data into FiftyOne and accelerating your computer vision workflows. FiftyOne’s
DatasetImporter classes allow you to import data in various formats without needing to write your own loops and I/O scripts.
If you have VOC-style data stored in a single directory on disk, for instance, you can create a dataset from this data using the
import fiftyone as fo name = "my-dataset" data_path = "/path/to/images" labels_path = "/path/to/voc-labels" # Import dataset by explicitly providing paths to the source media and labels dataset = fo.Dataset.from_dir( dataset_type=fo.types.VOCDetectionDataset, data_path=data_path, labels_path=labels_path, name=name, )
add_dir() method, you can extend the logic of any existing
DatasetImporter to data that is stored in multiple directories. To add
val data in YOLOv5 format to a single dataset, you can run the following:
import fiftyone as fo name = "my-dataset" dataset_dir = "/path/to/yolov5-dataset" # The splits to load splits = ["train", "val"] # Load the dataset, using tags to mark the samples in each split dataset = fo.Dataset(name) for split in splits: dataset.add_dir( dataset_dir=dataset_dir, dataset_type=fo.types.YOLOv5Dataset, split=split, tags=split, )
This allows you to add the contents of each directory directly to the final dataset without having to instantiate temporary datasets. The
merge_dir() can also be similarly useful!
Learn more about loading data into FiftyOne in the FiftyOne Docs.
Add from archive
On a related note, if you have data in a common archived format, such as
.tar.gz stored on disk, you can use the
merge_dir() methods to add this data to your dataset. If the archived data has not been unpacked yet, FiftyOne will handle this extraction for you!
Add model predictions
In machine learning workflows, it is common practice to withhold ground truth information at inference time. To accomplish this, it is often beneficial to separate the various fields of your dataset so that only certain subsets of information are available at different steps.
When it comes to evaluating model performance at the end of the day, however, we would like to merge ground truth labels and predictions into a common dataset. In FiftyOne, this is possible with the
merge_samples() method. If we have a
predictions_view only containing predictions, and a
dataset with all other information, we can merge the predictions into our base dataset as follows:
import fiftyone as fo import fiftyone.zoo as foz # Create a dataset containing only ground truth objects dataset = foz.load_zoo_dataset("quickstart") dataset = dataset.exclude_fields("predictions").clone() # Example predictions view predictions_view = dataset1.select_fields("predictions") # Merge the predictions dataset.merge_samples(predictions_view)
Export multiple labels with
If you have multiple
Label fields and you want to export your data using a common format, you can use the
merge_labels() method to merge all of these label fields into one field for export.
For instance, if you have three labels,
model2_predictions, you can merge all of these labels as follows:
import fiftyone as fo dataset = fo.load_dataset(...) ## clone label fields into temporary fields dataset.clone_sample_field("ground_truth", "tmp") dataset.clone_sample_field("model1_predictions", "tmp1") dataset.clone_sample_field("model2_predictions", "tmp2") ## merge model1 predictions into ground_truth dataset.merge_labels("tmp1", "tmp") ## merge model2 predictions into ground_truth dataset.merge_labels("tmp2", "tmp") ## export the merged labels field dataset.export(..., label_field="tmp") ## clean up dataset.delete_sample_fields(["tmp", "tmp1", "tmp2"])
If you want to export the data just so that you can import it at a later time, however, then you can avoid all of this and instead make your dataset persistent!
dataset.persistent = True
Join the FiftyOne community!
Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!
- 1,350+ FiftyOne Slack members
- 2,500+ stars on GitHub
- 3,100+ Meetup members
- Used by 246+ repositories
- 56+ contributors