Welcome to our weekly FiftyOne tips and tricks blog where we give practical pointers for using FiftyOne on topics inspired by discussions in the open source community. This week we’ll cover adding and merging data.

Wait, What’s FiftyOne?

FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.

If you like what you see on GitHub, give the project a star.
Get started! We’ve made it easy to get up and running in a few minutes.
Join the FiftyOne Slack community, we’re always happy to help.

Ok, let’s dive into this week’s tips and tricks!

A primer on adding and merging

Datasets are the core data structure in FiftyOne, allowing you to represent your raw data, labels, and associated metadata. Samples are the atomic elements of a Dataset that store all the information related to a given piece of data. When you query and manipulate a Dataset object using dataset views, a DatasetView object is returned, which represents a filtered view into a subset of the underlying dataset’s contents.

Many computer vision workflows involve operations that merge data from multiple sources, such as adding new samples to an existing dataset, or merging a model’s predictions into a dataset which contains ground truth labels. In FiftyOne, Dataset and DatasetView objects come with a variety of methods that make performing these add and merge operations easy.

Continue reading for some tips and tricks to help you master adding and merging data in FiftyOne!

Encountering a sample multiple times

If you want to add a completely new collection of samples, samples, to a dataset, dataset, then you can use the add_samples() and add_collection() methods for the most part interchangeably. However, if there are samples that appear multiple times in your workflows, due to sources of randomness, for instance, then these two methods have different consequences.

When add_samples() encounters samples that are already present in the dataset to which the method is applied, it generates a new sample with a new id, and adds this to the dataset. On the other hand, add_collection() ignores the sample and moves on.

In the code block below, when applied to the Quickstart Dataset with a random collection from the dataset as input, add_collection() leaves the dataset unchanged, whereas add_samples() increases the size of the dataset:

Learn more about FiftyOne’s random utils in the FiftyOne Docs.

Add samples by directory

FiftyOne supports a variety of common computer vision data formats, making it easy to load your data into FiftyOne and accelerating your computer vision workflows. FiftyOne’s DatasetImporter classes allow you to import data in various formats without needing to write your own loops and I/O scripts.

If you have VOC-style data stored in a single directory on disk, for instance, you can create a dataset from this data using the from_dir() method:

With the add_dir() method, you can extend the logic of any existing DatasetImporter to data that is stored in multiple directories. To add train and val data in YOLOv5 format to a single dataset, you can run the following:

This allows you to add the contents of each directory directly to the final dataset without having to instantiate temporary datasets. The merge_dir() can also be similarly useful!

Learn more about loading data into FiftyOne in the FiftyOne Docs.

Add from archive

On a related note, if you have data in a common archived format, such as .zip, .tar, or .tar.gz stored on disk, you can use the add_dir() or merge_dir() methods to add this data to your dataset. If the archived data has not been unpacked yet, FiftyOne will handle this extraction for you!

Learn more about from_archive(), add_archive(), and merge_archive() in the FiftyOne Docs.

Add model predictions

In machine learning workflows, it is common practice to withhold ground truth information at inference time. To accomplish this, it is often beneficial to separate the various fields of your dataset so that only certain subsets of information are available at different steps.

When it comes to evaluating model performance at the end of the day, however, we would like to merge ground truth labels and predictions into a common dataset. In FiftyOne, this is possible with the merge_samples() method. If we have a predictions_view only containing predictions, and a dataset with all other information, we can merge the predictions into our base dataset as follows:

Learn more about selecting and excluding fields in the FiftyOne Docs.

Export multiple labels with `merge_labels()`

If you have multiple Label fields and you want to export your data using a common format, you can use the merge_labels() method to merge all of these label fields into one field for export.

For instance, if you have three labels, ground_truth, model1_predictions, and model2_predictions, you can merge all of these labels as follows:

If you want to export the data just so that you can import it at a later time, however, then you can avoid all of this and instead make your dataset persistent!

Learn more about labels and dataset persistence in the FiftyOne Docs.

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!

1,350+ FiftyOne Slack members
2,500+ stars on GitHub
3,100+ Meetup members
Used by 246+ repositories
56+ contributors

Talk to a computer vision expert

Wait, What’s FiftyOne?

A primer on adding and merging

Encountering a sample multiple times

Add samples by directory

Add from archive

Add model predictions

Export multiple labels with `merge_labels()`

Join the FiftyOne community!

Talk to a computer vision expert

Related posts

Related posts

Talk to a computer vision expert

Wait, What’s FiftyOne?

A primer on adding and merging

Encountering a sample multiple times

Add samples by directory

Add from archive

Add model predictions

Export multiple labels with merge_labels()

Join the FiftyOne community!

Talk to a computer vision expert

Related posts

Related posts

Export multiple labels with `merge_labels()`