Skip to content

Understanding Grouped Datasets – FiftyOne Tips and Tricks – Sep 1, 2023

Welcome to our weekly FiftyOne tips and tricks blog where we cover interesting workflows and features of FiftyOne! This week will be Part one of a two part series exploring FiftyOne’s Grouped Datasets. We aim to cover the basics like creating your first grouped dataset and explaining how you can work with and create powerful views on your new dataset.

Wait, what’s FiftyOne?

FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.

Ok, let’s dive into this week’s tips and tricks! Also feel free to follow along in our notebook or on YouTube!

What Is a Grouped Dataset?

Before diving into the topic today, let’s first understand what a group dataset is and why we would want to use one. A grouped dataset is a collection of multiple slices of samples of possibly different modalities (image, video, or point cloud) that are organized into groups. Another way to look at it is multiview datasets are also representative of grouped datasets.

Samples in the same group are related. For example, grouped datasets can be used to represent multiview scenes, where data for multiple perspectives of the same scene can be stored, visualized, and queried in ways that respect the relationships between the slices of data.

Kickstarting Your First Grouped Dataset

import fiftyone as fo

dataset = fo.Dataset("first-group-dataset")
dataset.add_group_field("group", default="center")

To get started, create a dataset and add a group field. All grouped datasets must contain a group field, where samples of our chosen media will be placed. The optional parameter for `default` refers to the default slice of the group that will be returned when interacting with the dataset via Python, and the slice that will be shown when you first launch a session of the FiftyOne App. This can all be changed with ease and will be detailed later. For now let’s add some images to our dataset:

import fiftyone.utils.random as four
import fiftyone.zoo as foz

groups = ["left", "center", "right"]

d = foz.load_zoo_dataset("quickstart")

four.random_split(d, {g: 1 / len(groups) for g in groups})

filepaths = [d.match_tags(g).values("filepath") for g in groups]
filepaths = [dict(zip(groups, fps)) for fps in zip(*filepaths)]

Preparing the data is easy for your grouped dataset. Define your groups and create a dictionary of the filepath of the sample as well and the group it is in. 

With our data ready, it is time to throw it into our grouped dataset.

samples = []
for fps in filepaths:
	group = fo.Group()
	for name, filepath in fps.items():
    	sample = fo.Sample(filepath=filepath,
group=group.element(name)
)
    	samples.append(sample)

dataset.add_samples(samples)
print(dataset)
Name:        first-group-dataset
Media type:  group
Group slice: center
Num groups:  145
Persistent:  False
Tags:        []
Sample fields:
    id:       fiftyone.core.fields.ObjectIdField
    filepath: fiftyone.core.fields.StringField
    tags:     fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.Metadata)
    group:    fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.groups.Group)

All it takes to add your data into your dataset is to iterate through your groups and add all the images or data into their new samples. Once all the samples have been created, we can add all of them at once with dataset.add_samples(samples). Congrats! That’s all it takes to create to take a grouped dataset. We can visualize our first dataset with:

session = fo.launch_app(dataset)

Working with Grouped Datasets

Great, so now that we have our grouped dataset, what can we do with it? If this is your first time using FiftyOne or maybe you need a refresher on creating views and working with FiftyOne datasets, I recommend brushing up with Views Guide or some previous Tips and Tricks blogs. If you are here for grouped datasets and grouped datasets only, no worries! 

For the most part, the Python syntax for interacting with grouped datasets is identical to that of non-grouped datasets. We can start by getting some basic information about our dataset and use that access or filter to our needs.

What Are the Groups in My Dataset?

print(dataset.group_slices)
print(dataset.group_media_types)
['left', 'center', 'right']
{'left': 'image', 'center': 'image', 'right': 'image'}

Here we can see what our group slices are and what kind of media inside of them. Remember that only one slice is active at a time. By default we set it to `center` so all functions we run to grab samples or stats will return the center slice.

sample = dataset.shuffle().first()

print(sample)
<SampleView: {

    'id': '64f00a64e099d7515a34fca7',

    'media_type': 'image',

    'filepath': '/home/dan/fiftyone/quickstart/data/002748.jpg',

    'tags': [],

    'metadata': None,

    'group': <Group: {'id': '64f00a64e099d7515a34fb8b', 'name': 'center'}>,

}>

We can see it is the center slice under the `group` field on the bottom. If we wanted to change the active slice, all we need to do is:

dataset.group_slice = "left"
sample = dataset.shuffle().first()

print(sample)
<SampleView: {

    'id': '64f00a64e099d7515a34fbf5',

    'media_type': 'image',

    'filepath': '/home/dan/fiftyone/quickstart/data/001078.jpg',

    'tags': [],

    'metadata': None,

    'group': <Group: {'id': '64f00a64e099d7515a34fb50', 'name': 'left'}>,

}>

Changing the active group slice also changes it in your App as well!

The next natural question in your head is, “What if I want the entire group and not just one sample?” No problem! Just grab the group id and pull like this:

sample = dataset.shuffle().first()
group_id = sample.group.id
group = dataset.get_group(group_id)

print(group)
{'left': <Sample: {

    'id': '64efe3b8b6010b6185483e4f',

    'media_type': 'image',

    'filepath': '/home/dan/fiftyone/quickstart/data/003662.jpg',

    'tags': [],

    'metadata': None,

    'group': <Group: {'id': '64efe3b8b6010b6185483dd3', 'name': 'left'}>,

}>, 'center': <Sample: {

    'id': '64efe3b8b6010b6185483e50',

    'media_type': 'image',

    'filepath': '/home/dan/fiftyone/quickstart/data/004978.jpg',

    'tags': [],

    'metadata': None,

    'group': <Group: {'id': '64efe3b8b6010b6185483dd3', 'name': 'center'}>,

}>, 'right': <Sample: {

    'id': '64efe3b8b6010b6185483e51',

    'media_type': 'image',

    'filepath': '/home/dan/fiftyone/quickstart/data/004304.jpg',

    'tags': [],

    'metadata': None,

    'group': <Group: {'id': '64efe3b8b6010b6185483dd3', 'name': 'right'}>,

}>}

Now we can access each piece of media in the group for the sample we are looking for.

Iterating Through Your Grouped Dataset

There are two suggested ways to iterate through your group dataset: Iterating through your active slice or iterating through each group. Depending on your use case, choose which one is best for you. To iterate through just your active slice you can use:

print(dataset.group_slice)

# center

for sample in dataset:
    pass

Remember, you can always change the active slice with dataset.group_slice = slice!

To iterate over your groups, you can use the iter_groups() function:

for group in dataset.iter_groups():
    pass

Creating Views in Your Grouped Dataset

One of the best parts about creating a grouped dataset is you have the entire dataset view language at your disposal to sort, slice, and search through your dataset. Iterating through, grabbing samples, or any other basic property of grouped datasets carries over when you make a new view. There are tons of possibilities of what views or subsets of your dataset you can make, so I will highlight just a few of the great possibilities:

Filter based on class:

from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset("quickstart-groups")

print(dataset.group_slice)
# left

# Filters based on the content in the 'left' slice
view = (
	dataset
	.match_tags("train")
	.filter_labels("ground_truth", F("label") == "Pedestrian")
)

We can even filter on multiple groups at once, using the computed metadata of the samples!

from fiftyone import ViewField as F

dataset.compute_metadata()

# Match groups whose `left` image has a height of at least 640 pixels and
# whose `right` image has a height of at most 480 pixels

left_cond = F("groups.left.metadata.height") >= 640
right_cond = F("groups.right.metadata.height") <= 480
view = dataset.match(left_cond & right_cond)


print(view)

Create views of joined group slices:

print(dataset.count())  # 200
print(dataset.count("ground_truth.detections"))  # 1438

view3 = dataset.select_group_slices(["left", "right"])

print(view3.count())  # 400
print(view3.count("ground_truth.detections"))  # 2876

If you want to create a view of just two of your groups, you can easily select multiple groups to create a new view that can become a new dataset or just the next step of your filtering or sorting process.

Likewise, we can exclude individual groups from our view as well!

# Exclude two groups at random
view = dataset.take(2)

group_ids = view.values("group.id")
other_groups = dataset.exclude_groups(group_ids)
assert len(set(group_ids) & set(other_groups.values("group.id"))) == 0

Aggregations:

# Expression that computes the area of a bounding box, in pixels
bbox_width = F("bounding_box")[2] * F("$metadata.width")
bbox_height = F("bounding_box")[3] * F("$metadata.height")
bbox_area = bbox_width * bbox_height

print(dataset.group_slice)
# left

print(dataset.count("ground_truth.detections"))
# 1379

print(dataset.mean("ground_truth.detections[]", expr=bbox_area))
# 9291.53

We can still grab the statistics we want from our dataset and apply these to be used in views:

Putting it all together we can create complex views to suit exactly what you need!

dataset = foz.load_zoo_dataset("quickstart-groups")

dataset.compute_metadata()

bbox_width = F("bounding_box")[2]
bbox_height = F("bounding_box")[3]
bbox_area = bbox_width * bbox_height

view = dataset.filter_labels("ground_truth", (0.05 <= bbox_area) & (bbox_area < 0.5))
print(view)

Conclusion

I hope this quick walkthrough has allowed you to understand grouped datasets more! There is really so much that can be accomplished and possibilities are endless. Next week we will cover dynamic grouped datasets and dive into just how much you can customize your FiftyOne datasets! Stay tuned and for more conversations or for help on grouped datasets, hop into our community Slack channel where everyone is eager to help with your FiftyOne experience!

Join the FiftyOne Community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!