Skip to content

FiftyOne Computer Vision Tips and Tricks for Adding and Merging Data – Feb 17, 2023

Welcome to our weekly FiftyOne tips and tricks blog where we give practical pointers for using FiftyOne on topics inspired by discussions in the open source community. This week we’ll cover adding and merging data.

Wait, What’s FiftyOne?

FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.

FiftyOne quick overview gif

Ok, let’s dive into this week’s tips and tricks!

A primer on adding and merging

Datasets are the core data structure in FiftyOne, allowing you to represent your raw data, labels, and associated metadata. Samples are the atomic elements of a Dataset that store all the information related to a given piece of data. When you query and manipulate a Dataset object using dataset views, a DatasetView object is returned, which represents a filtered view into a subset of the underlying dataset’s contents.

Many computer vision workflows involve operations that merge data from multiple sources, such as adding new samples to an existing dataset, or merging a model’s predictions into a dataset which contains ground truth labels. In FiftyOne, Dataset and DatasetView objects come with a variety of methods that make performing these add and merge operations easy.

Continue reading for some tips and tricks to help you master adding and merging data in FiftyOne!

Encountering a sample multiple times

If you want to add a completely new collection of samples, samples, to a dataset, dataset, then you can use the add_samples() and add_collection() methods for the most part interchangeably. However, if there are samples that appear multiple times in your workflows, due to sources of randomness, for instance, then these two methods have different consequences.

When add_samples() encounters samples that are already present in the dataset to which the method is applied, it generates a new sample with a new id, and adds this to the dataset. On the other hand, add_collection() ignores the sample and moves on. 

In the code block below, when applied to the Quickstart Dataset with a random collection from the dataset as input, add_collection() leaves the dataset unchanged, whereas add_samples() increases the size of the dataset:

import fiftyone as fo
import fiftyone.zoo as foz

# 200 samples
dataset = foz.load_zoo_dataset("quickstart")

# randomly select 50 samples
samples = dataset.take(50)

# doesn’t change dataset size
dataset.add_collection(samples)

# 200 samples → 250 samples
dataset.add_samples(samples)

Learn more about FiftyOne’s random utils in the FiftyOne Docs.

Add samples by directory

FiftyOne supports a variety of common computer vision data formats, making it easy to load your data into FiftyOne and accelerating your computer vision workflows. FiftyOne’s DatasetImporter classes allow you to import data in various formats without needing to write your own loops and I/O scripts. 

If you have VOC-style data stored in a single directory on disk, for instance, you can create a dataset from this data using the from_dir() method:

import fiftyone as fo

name = "my-dataset"
data_path = "/path/to/images"
labels_path = "/path/to/voc-labels"

# Import dataset by explicitly providing paths to the source media and labels
dataset = fo.Dataset.from_dir(
    dataset_type=fo.types.VOCDetectionDataset,
    data_path=data_path,
    labels_path=labels_path,
    name=name,
)

With the add_dir() method, you can extend the logic of any existing DatasetImporter to data that is stored in multiple directories. To add train and val data in YOLOv5 format to a single dataset, you can run the following:

import fiftyone as fo

name = "my-dataset"
dataset_dir = "/path/to/yolov5-dataset"

# The splits to load
splits = ["train", "val"]

# Load the dataset, using tags to mark the samples in each split
dataset = fo.Dataset(name)
for split in splits:
    dataset.add_dir(
        dataset_dir=dataset_dir,
        dataset_type=fo.types.YOLOv5Dataset,
        split=split,
        tags=split,
)

This allows you to add the contents of each directory directly to the final dataset without having to instantiate temporary datasets. The merge_dir() can also be similarly useful!

Learn more about loading data into FiftyOne in the FiftyOne Docs.

Add from archive

On a related note, if you have data in a common archived format, such as .zip, .tar, or .tar.gz stored on disk, you can use the add_dir() or merge_dir() methods to add this data to your dataset. If the archived data has not been unpacked yet, FiftyOne will handle this extraction for you!

Learn more about from_archive(), add_archive(), and merge_archive() in the FiftyOne Docs.

Add model predictions

In machine learning workflows, it is common practice to withhold ground truth information at inference time. To accomplish this, it is often beneficial to separate the various fields of your dataset so that only certain subsets of information are available at different steps. 

When it comes to evaluating model performance at the end of the day, however, we would like to merge ground truth labels and predictions into a common dataset. In FiftyOne, this is possible with the merge_samples() method. If we have a predictions_view only containing predictions, and a dataset with all other information, we can merge the predictions into our base dataset as follows:

import fiftyone as fo
import fiftyone.zoo as foz

# Create a dataset containing only ground truth objects
dataset = foz.load_zoo_dataset("quickstart")
dataset = dataset.exclude_fields("predictions").clone()

# Example predictions view
predictions_view = dataset1.select_fields("predictions")

# Merge the predictions
dataset.merge_samples(predictions_view)

Learn more about selecting and excluding fields in the FiftyOne Docs.

Export multiple labels with merge_labels()

If you have multiple Label fields and you want to export your data using a common format, you can use the merge_labels() method to merge all of these label fields into one field for export. 

For instance, if you have three labels, ground_truth, model1_predictions, and model2_predictions, you can merge all of these labels as follows:

import fiftyone as fo

dataset = fo.load_dataset(...)

## clone label fields into temporary fields
dataset.clone_sample_field("ground_truth", "tmp")
dataset.clone_sample_field("model1_predictions", "tmp1")
dataset.clone_sample_field("model2_predictions", "tmp2")

## merge model1 predictions into ground_truth
dataset.merge_labels("tmp1", "tmp")
## merge model2 predictions into ground_truth
dataset.merge_labels("tmp2", "tmp")

## export the merged labels field
dataset.export(..., label_field="tmp")

## clean up
dataset.delete_sample_fields(["tmp", "tmp1", "tmp2"])

If you want to export the data just so that you can import it at a later time, however, then you can avoid all of this and instead make your dataset persistent!

dataset.persistent = True

Learn more about labels and dataset persistence in the FiftyOne Docs.

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!