Skip to content

FiftyOne Computer Vision Model Evaluation Tips and Tricks – Feb 03, 2023

Welcome to our weekly FiftyOne tips and tricks blog where we give practical pointers for using FiftyOne on topics inspired by discussions in the open source community. This week we’ll cover model evaluation.

Wait, what’s FiftyOne?

FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.

FiftyOne quick overview gif

Ok, let’s dive into this week’s tips and tricks!

A primer on model evaluations

FiftyOne provides a variety of builtin methods for evaluating your model predictions, including regressions, classifications, detections, polygons, instance and semantic segmentations, on both image and video datasets.

When you evaluate a model in FiftyOne, you get access to the standard aggregate metrics such as classification reports, confusion matrices, and PR curves for your model. In addition, FiftyOne can also record fine-grained statistics like accuracy and false positive counts at the sample-level, which you can leverage via dataset views and the FiftyOne App to interactively explore the strengths and weaknesses of your models on individual data samples.

FiftyOne’s model evaluation methods are conveniently exposed as methods on all Dataset and DatasetView objects, which means that you can evaluate entire datasets or specific views into them via the same syntax.

Continue reading for some tips and tricks to help you master evaluations in FiftyOne!

Task-specific evaluation methods

In FiftyOne, the Evaluation API supports common computer vision tasks like object detection and classification with default evaluation methods that implement some of the standard routines in the field. For standard object detection, for instance, the default evaluation style is MS COCO. In most other cases, the default evaluation style is denoted "simple". If the default style for a given task is what you are looking for, then there is no need to specify the method argument.

import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
results = dataset.evaluate_detections(
    "predictions", 
    gt_field = "ground_truth"
)

Alternatively, you can explicitly specify a method to use for model evaluation:

dataset.evaluate_detections(
    "predictions", 
    gt_field = "ground_truth", 
    method = "open-images"
)

Each evaluation method has an associated evaluation config, which specifies what arguments can be passed into the evaluation routine when using that style of evaluation. For ActivityNet style evaluation, for example, you can pass in an iou argument specifying the IoU threshold to use, and you can pass in compute_mAP = True to tell the method to compute the mean average precision. 

To see which label types are available for a dataset, check out the section detailing that dataset in the FiftyOne Dataset Zoo documentation.

Learn more about evaluating object detections in the FiftyOne Docs.

Evaluations on views

All methods in FiftyOne’s Evaluation API that are applicable to Dataset instances are also exposed to DatasetView. This means that you can compute evaluations on subsets of your dataset obtained through filtering, matching, and chaining together any number of view stages. 

As an example, we can evaluate detections only on samples that are highly unique in our dataset, and which have fewer than 10 predicted detections:

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz
from fiftyone import ViewField as F

## compute uniqueness of each sample
fob.compute_uniqueness(dataset)

dataset = foz.load_zoo_dataset("quickstart")

## create DatasetView with 50 most unique images
unique_view = dataset.sort_by(
    "uniqueness", 
    reverse=True
).limit(50)

## get only the unique images with fewer than 10 predicted detections
few_pred_unique_view = unique_view.match(
    F("predictions.detections").length() < 10
)

## evaluate detections for this view
few_pred_unique_view.evaluate_detections(
    "predictions", 
    gt_field="ground_truth",
    eval_key="eval_few_unique"
)

Learn more about the FiftyOne Brain in the FiftyOne Docs.

Plotting interactive confusion matrices

For classification and detection evaluations, FiftyOne’s evaluation routines generate confusion matrices. You can plot these confusion matrices in FiftyOne with the plot_confusion_matrix() method.

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

## generate evaluation results
results = dataset.evaluate_detections(
    "predictions", 
    gt_field = "ground_truth"
)

## plot confusion matrix
classes = ["person", "kite", "car", "bird"]
plot = results.plot_confusion_matrix(classes=classes)
plot.show()

Because the confusion matrix is implemented in plotly, it is interactive! To interact visually with your data via the confusion matrix, attach the plot to a session launched with the dataset:

## create a session and attach plot
session = fo.launch_app(dataset)
session.plots.attach(plot)

Clicking into a cell in the confusion matrix then changes which samples appear in the sample grid in the FiftyOne App.

Learn more about interactive plotting in the FiftyOne Docs.

Evaluating frames of a video

All of the evaluation methods in FiftyOne’s Evaluation API can be applied to frame-level labels in addition to sample-level labels. This means that you can evaluate video samples without needing to convert the frames of a video sample to standalone image samples. 

Applying FiftyOne evaluation methods to video frames also has the added benefit that useful statistics are computed at both the frame and sample levels. For instance, the following code populates the fields eval_tp, eval_fp, and eval_fn as summary statistics on the sample level, containing the total number of true positives, false positives, and false negatives across all frames in the sample. Additionally, on each frame, the evaluation populates an eval field for each detection with a value of either tp, fp, or fn, as well as an eval_iou field where appropriate.

import random

import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset(
    "quickstart-video", 
    dataset_name="video-eval-demo"
)

## Create some test predictions 
classes = dataset.distinct("frames.detections.detections.label")

def jitter(val):
    if random.random() < 0.10:
        return random.choice(classes)

    return val

predictions = []
for sample_gts in dataset.values("frames.detections"):
    sample_predictions = []
    for frame_gts in sample_gts:
        sample_predictions.append(
            fo.Detections(
                detections=[
                    fo.Detection(
                        label=jitter(gt.label),
                        bounding_box=gt.bounding_box,
                        confidence=random.random(),
                    )
                    for gt in frame_gts.detections
                ]
            )
        )

    predictions.append(sample_predictions)

dataset.set_values("frames.predictions", predictions)

dataset.evaluate_detections(
    "frames.predictions",
    gt_field="frames.detections",
    eval_key="eval",
)

Note that the only difference in practice is the prefix “frames” used to specify the predictions field and the ground truth field.  

Learn more about video views and evaluating videos in the FiftyOne Docs.

Managing multiple evaluations

With all of the flexibility the Evaluation API provides, you’d be well within reason to wonder what evaluation you should perform. Fortunately, FiftyOne makes it easy to perform, manage, and store the results from multiple evaluations!

The results from each evaluation can be stored and accessed via an evaluation key, specified by the eval_key argument. This allows you to compare different evaluation methods on the same data,

import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
dataset.evaluate_detections(
    "predictions", 
    gt_field = "ground_truth", 
    method = "coco", 
    eval_key = "coco_eval"
)
dataset.evaluate_detections(
    "predictions", 
    gt_field = "ground_truth", 
    method = "open-images", 
    eval_key = "oi_eval"
)

evaluate predictions generated by multiple models,

dataset.evaluate_detections(
    "model1_predictions", 
    gt_field = "ground_truth", 
    eval_key = "model1_eval"
)
dataset.evaluate_detections(
    "model2_predictions", 
    gt_field = "ground_truth", 
    eval_key = "model2_eval"
)

Or compare evaluations on different subsets or views of your data, such as a view with only small bounding boxes and a view with only large bounding boxes:

from fiftyone import ViewField as F
bbox_area = (
    F("bounding_box")[2] *
    F("bounding_box")[3]
)

large_boxes = bbox_area > 0.7
small_boxes = bbox_area < 0.3

# Create a view that contains only small-sized objects
small_view = (
    dataset
    .filter_labels(
        "ground_truth", 
        small_boxes
    )
)

# Create a view that contains only large-sized objects
large_view = (
    dataset
    .filter_labels(
        "ground_truth", 
        large_boxes
    )
)

small_view.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval_small",
)

large_view.evaluate_detections(
    "predictions",
    gt_field="ground_truth",
    eval_key="eval_large",
)

Learn more about managing model evaluations in the FiftyOne Docs.

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!