Using Aggregations¶
The FiftyOne Dataset is the core data structure in
FiftyOne, allowing you to represent your raw data, labels, and associated
metadata. When you query and manipulate a Dataset
object using
dataset views, a DatasetView
object is returned, which
represents a filtered view into a subset of the underlying dataset’s contents.
Complementary to this data model, one is often interested in computing
aggregate statistics about datasets, such as label counts, distributions, and
ranges, where each Sample
is reduced to a single quantity in the aggregate
results.
The fiftyone.core.aggregations
module offers a declaritive and
highly-efficient approach to computing summary statistics about your datasets
and views.
Overview¶
All builtin aggregations are subclasses of the Aggregation
class, each
encapsulating the computation of a different statistic about your data.
Aggregations are conveniently exposed as methods on all Dataset
and
DatasetView
objects:
1 2 3 4 5 6 7 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# List available aggregations
print(dataset.list_aggregations())
# ['bounds', 'count', 'count_values', 'distinct', ..., 'sum']
|
Think of aggregations as more efficient, concise alternatives to writing explict loops over your dataset to compute a statistic:
1 2 3 4 5 6 7 8 9 10 11 | from collections import defaultdict
# Compute label histogram manually
manual_counts = defaultdict(int)
for sample in dataset:
for detection in sample.ground_truth.detections:
manual_counts[detection.label] += 1
# Compute via aggregation
counts = dataset.count_values("ground_truth.detections.label")
print(counts) # same as `manual_counts` above
|
You can even aggregate on expressions that transform the data in arbitrarily complex ways:
1 2 3 4 5 6 7 8 9 | from fiftyone import ViewField as F
num_objects = F("detections").length()
# The `(min, max)` number of predictions per sample
print(dataset.bounds("predictions", expr=num_objects))
# The average number of predictions per sample
print(dataset.mean("predictions", expr=num_objects))
|
The sections below discuss the available aggregations in more detail. You can
also refer to the fiftyone.core.aggregations
module documentation for
detailed examples of using each aggregation.
Note
All aggregations can operate on embedded sample fields using the
embedded.field.name
syntax.
Aggregation fields can also include array fields. Most array fields are
automatically unwound, but you can always manually unwind an array using
the embedded.array[].field
syntax. See
this section for more details.
Compute bounds¶
You can use the
bounds()
aggregation to compute the [min, max]
range of a numeric field of a
dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Compute the bounds of the `uniqueness` field
bounds = dataset.bounds("uniqueness")
print(bounds)
# (0.15001302256126986, 1.0)
# Compute the bounds of the detection confidences in the `predictions` field
bounds = dataset.bounds("predictions.detections.confidence")
print(bounds)
# (0.05003104358911514, 0.9999035596847534)
|
Count items¶
You can use the
count()
aggregation
to compute the number of non-None
field values in a collection:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Compute the number of samples in the dataset
count = dataset.count()
print(count)
# 200
# Compute the number of samples with `predictions`
count = dataset.count("predictions")
print(count)
# 200
# Compute the number of detections in the `ground_truth` field
count = dataset.count("predictions.detections")
print(count)
# 5620
|
Count values¶
You can use the
count_values()
aggregation to compute the occurrences of field values in a collection:
1 2 3 4 5 6 7 8 9 10 11 12 13 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Compute the number of samples in the dataset
counts = dataset.count_values("tags")
print(counts)
# {'validation': 200}
# Compute a histogram of the predicted labels in the `predictions` field
counts = dataset.count_values("predictions.detections.label")
print(counts)
# {'bicycle': 13, 'hot dog': 8, ..., 'skis': 52}
|
Distinct values¶
You can use the
distinct()
aggregation to compute the distinct values of a field in a collection:
1 2 3 4 5 6 7 8 9 10 11 12 13 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Get the distinct tags on the dataset
values = dataset.distinct("tags")
print(values)
# ['validation']
# Get the distinct labels in the `predictions` field
values = dataset.distinct("predictions.detections.label")
print(values)
# ['airplane', 'apple', 'backpack', ..., 'wine glass', 'zebra']
|
Histogram values¶
You can use the
histogram_values()
aggregation to compute the histograms of numeric fields of a collection:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import numpy as np
import matplotlib.pyplot as plt
import fiftyone.zoo as foz
def plot_hist(counts, edges):
counts = np.asarray(counts)
edges = np.asarray(edges)
left_edges = edges[:-1]
widths = edges[1:] - edges[:-1]
plt.bar(left_edges, counts, width=widths, align="edge")
dataset = foz.load_zoo_dataset("quickstart")
#
# Compute a histogram of the `uniqueness` field
#
counts, edges, other = dataset.histogram_values("uniqueness", bins=50)
plot_hist(counts, edges)
plt.show(block=False)
|

Sum values¶
You can use the
sum()
aggregation to
compute the sum of the (non-None
) values of a field in a collection:
1 2 3 4 5 6 7 8 9 10 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Compute average confidence of detections in the `predictions` field
print(
dataset.sum("predictions.detections.confidence") /
dataset.count("predictions.detections.confidence")
)
# 0.34994137249820706
|
Mean values¶
You can use the
mean()
aggregation to
compute the arithmetic mean of the (non-None
) values of a field in a
collection:
1 2 3 4 5 6 7 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Compute average confidence of detections in the `predictions` field
print(dataset.mean("predictions.detections.confidence"))
# 0.34994137249820706
|
Standard deviation¶
You can use the
std()
aggregation to
compute the standard deviation of the (non-None
) values of a field in a
collection:
1 2 3 4 5 6 7 8 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Compute standard deviation of the confidence of detections in the
# `predictions` field
print(dataset.std("predictions.detections.confidence"))
# 0.3184061813934825
|
Values¶
You can use the
values()
aggregation to extract a list containing the values of a field across all
samples in a collection:
1 2 3 4 5 6 7 8 9 10 11 12 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Extract the `uniqueness` values for all samples
uniqueness = dataset.values("uniqueness")
print(len(uniqueness)) # 200
# Extract the labels for all predictions
labels = dataset.values("predictions.detections.label")
print(len(labels)) # 200
print(labels[0]) # ['bird', ..., 'bear', 'sheep']
|
Note
Unlike other aggregations,
values()
does
not automatically unwind list fields, which ensures that the returned
values match the potentially-nested structure of the documents.
You can opt-in to unwinding specific list fields using the []
syntax, or you can pass the optional unwind=True
parameter to unwind
all supported list fields. See Aggregating list fields for more
information.
Advanced usage¶
Aggregating list fields¶
Aggregations that operate on scalar fields can also be applied to the elements
of list fields by appending []
to the list component of the field path.
The example below demonstrates this capability:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | import fiftyone as fo
dataset = fo.Dataset()
dataset.add_samples(
[
fo.Sample(
filepath="/path/to/image1.png",
keypoints=fo.Keypoint(points=[(0, 0), (1, 1)]),
classes=fo.Classification(
label="cat", confidence=0.9, friends=["dog", "squirrel"]
),
),
fo.Sample(
filepath="/path/to/image2.png",
keypoints=fo.Keypoint(points=[(0, 0), (0.5, 0.5), (1, 1)]),
classes=fo.Classification(
label="dog", confidence=0.8, friends=["rabbit", "squirrel"],
),
),
]
)
# Count the number of keypoints in the dataset
count = dataset.count("keypoints.points[]")
print(count)
# 5
# Compute the values in the custom `friends` field of the predictions
counts = dataset.count_values("classes.friends[]")
print(counts)
# {'dog': 1, 'squirrel': 2, 'rabbit': 1}
|
Note
There are four cases where FiftyOne will automatically unwind array fields
without requiring you to explicitly specify this via the []
syntax:
Top-level lists: When you write an aggregation that refers to a
top-level list field of a dataset; i.e., list_field
is automatically
coerced to list_field[]
, if necessary.
Label lists: When you write an aggregation that refers to the list
field of a Label
class, such as the
Detections.detections
attribute; i.e., ground_truth.detections.label
is automatically
coerced to ground_truth.detections[].label
, if necessary.
Frame fields: When you write an aggregation that refers to a
frame-level field of a video dataset; i.e..,
frames.classification.label
is automatically coerced to
frames[].classifcation.label
if necessary.
Tags fields: When you write an aggregation that refers to the tags
attribute of a Sample
or Label
object; i.e., classification.tags
is
automatically coerced to classification.tags[]
, if necessary.
Aggregating expressions¶
Aggregations also support performing more complex computations on fields via
the optional expr
argument,
which is supported by all aggregations and allows you to specify a
ViewExpression
defining an arbitrary transformation of the field you’re
operating on prior to aggregating.
The following examples demonstrate the power of aggregating with expressions:
The code sample below computes some statistics about the number of predicted objects in a dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset("quickstart")
# Expression that computes the number of objects in a `Detections` field
num_objects = F("detections").length()
# The `(min, max)` number of predictions per sample
print(dataset.bounds("predictions", expr=num_objects))
# The average number of predictions per sample
print(dataset.mean("predictions", expr=num_objects))
# Two equivalent ways of computing the total number of predictions
print(dataset.sum("predictions", expr=num_objects))
print(dataset.count("predictions.detections"))
|
The code sample below computes some statistics about predicted object labels after doing some normalization:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset("quickstart")
ANIMALS = [
"bear", "bird", "cat", "cow", "dog", "elephant", "giraffe",
"horse", "sheep", "zebra"
]
# Expression that replaces all animal labels with "animal" and then
# capitalizes all labels
normed_labels = F("label").map_values({a: "animal" for a in ANIMALS}).upper()
# A histogram of normalized predicted labels
print(dataset.count_values("predictions.detections[]", expr=normed_labels))
|
The code sample below computes some statistics about the sizes of ground truth and predicted bounding boxes in a dataset, in pixels:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset("quickstart")
dataset.compute_metadata()
# Expression that computes the area of a bounding box, in pixels
# Bboxes are in [top-left-x, top-left-y, width, height] format
bbox_width = F("bounding_box")[2] * F("$metadata.width")
bbox_height = F("bounding_box")[3] * F("$metadata.height")
bbox_area = bbox_width * bbox_height
# Compute (min, max, mean) of ground truth bounding boxes
print(dataset.bounds("ground_truth.detections[]", expr=bbox_area))
print(dataset.mean("ground_truth.detections[]", expr=bbox_area))
# Compute same statistics for the predictions
print(dataset.bounds("predictions.detections[]", expr=bbox_area))
print(dataset.mean("predictions.detections[]", expr=bbox_area))
|
Note
When aggregating with expressions, field names may contain list fields, and such field paths are handled as explained above.
However, there is one important exception when expressions are involved:
fields paths that end in array fields are not automatically unwound,
you must specify that they should be unwound by appending []
. This
change in default behavior allows for the possiblity that the
ViewExpression
you provide is intended to operate on the array as a
whole.
import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset("quickstart")
# Counts the number of predicted objects
# Here, ``predictions.detections`` is treated as ``predictions.detections[]``
print(dataset.count("predictions.detections"))
# Counts the number of predicted objects with confidence > 0.9
# Here, ``predictions.detections`` is not automatically unwound
print(
dataset.sum(
"predictions.detections",
expr=F().filter(F("confidence") > 0.9).length()
)
)
Batching aggregations¶
Rather than computing a single aggregation by invoking methods on a Dataset
or DatasetView
object, you can also instantiate an Aggregation
object
directly. In this case, the aggregation is not tied to any dataset or view,
only to the parameters such as field name that define it.
1 2 3 4 5 6 7 8 9 10 | import fiftyone as fo
# will count the number of samples in a dataset
sample_count = fo.Count()
# will count the labels in a `ground_truth` detections field
count_values = fo.CountValues("ground_truth.detections.label")
# will compute a histogram of the `uniqueness` field
histogram_values = fo.HistogramValues("uniqueness", bins=50)
|
Instantiating aggregations in this way allows you to execute multiple
aggregations on a dataset or view efficiently in a batch via
aggregate()
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
results = dataset.aggregate([sample_count, count_values, histogram_values])
print(results[0])
# 200
print(results[1])
# {'bowl': 15, 'scissors': 1, 'cup': 21, ..., 'vase': 1, 'sports ball': 3}
print(results[2][0]) # counts
# [0, 0, 0, ..., 15, 12, ..., 0, 0]
print(results[2][1]) # edges
# [0.0, 0.02, 0.04, ..., 0.98, 1.0]
|
Transforming data before aggregating¶
You can use view stages like
map_labels()
in concert with aggregations to efficiently compute statistics on your
datasets.
For example, suppose you would like to compute the histogram of the labels in
a dataset with certain labels grouped into a single category. You can use
map_labels()
+
count_values()
to succinctly express this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
# Map `cat` and `dog` to `pet`
labels_map = {"cat": "pet", "dog": "pet"}
counts = (
dataset
.map_labels("ground_truth", labels_map)
.count_values("ground_truth.detections.label")
)
print(counts)
# {'toothbrush': 2, 'train': 5, ..., 'pet': 31, ..., 'cow': 22}
|
Or, suppose you would like to compute the average confidence of a model’s
predictions, ignoring any values less than 0.5. You can use
filter_labels()
+
sum()
+
count()
to succinctly express this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F
dataset = foz.load_zoo_dataset("quickstart")
avg_conf = (
dataset
.filter_labels("predictions", F("confidence") >= 0.5)
.mean("predictions.detections.confidence")
)
print(avg_conf)
# 0.8170506501060617
|
Aggregating frame labels¶
You can compute aggregations on the frame labels of a video dataset by adding
the frames
prefix to the relevant frame field name:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart-video")
# Count the number of video frames
count = dataset.count("frames")
print(count)
# 1279
# Compute a histogram of per-frame object labels
counts = dataset.count_values(
"frames.ground_truth_detections.detections.label"
)
print(counts)
# {'person': 1108, 'vehicle': 7511, 'road sign': 2726}
|