Welcome to our weekly FiftyOne tips and tricks blog where we give practical pointers for using FiftyOne on topics inspired by discussions in the open source community. This week we’ll cover model evaluation.
Wait, what’s FiftyOne?
FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.
- If you like what you see on GitHub, give the project a star.
- Get started! We’ve made it easy to get up and running in a few minutes.
- Join the FiftyOne Slack community, we’re always happy to help.
Ok, let’s dive into this week’s tips and tricks!
A primer on model evaluations
FiftyOne provides a variety of builtin methods for evaluating your model predictions, including regressions, classifications, detections, polygons, instance and semantic segmentations, on both image and video datasets.
When you evaluate a model in FiftyOne, you get access to the standard aggregate metrics such as classification reports, confusion matrices, and PR curves for your model. In addition, FiftyOne can also record fine-grained statistics like accuracy and false positive counts at the sample-level, which you can leverage via dataset views and the FiftyOne App to interactively explore the strengths and weaknesses of your models on individual data samples.
FiftyOne’s model evaluation methods are conveniently exposed as methods on all Dataset
and DatasetView
objects, which means that you can evaluate entire datasets or specific views into them via the same syntax.
Continue reading for some tips and tricks to help you master evaluations in FiftyOne!
Task-specific evaluation methods
In FiftyOne, the Evaluation API supports common computer vision tasks like object detection and classification with default evaluation methods that implement some of the standard routines in the field. For standard object detection, for instance, the default evaluation style is MS COCO. In most other cases, the default evaluation style is denoted "simple"
. If the default style for a given task is what you are looking for, then there is no need to specify the method
argument.
import fiftyone as fo import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") results = dataset.evaluate_detections( "predictions", gt_field = "ground_truth" )
Alternatively, you can explicitly specify a method to use for model evaluation:
dataset.evaluate_detections( "predictions", gt_field = "ground_truth", method = "open-images" )
Each evaluation method has an associated evaluation config, which specifies what arguments can be passed into the evaluation routine when using that style of evaluation. For ActivityNet style evaluation, for example, you can pass in an iou
argument specifying the IoU threshold to use, and you can pass in compute_mAP = True
to tell the method to compute the mean average precision.
To see which label types are available for a dataset, check out the section detailing that dataset in the FiftyOne Dataset Zoo documentation.
Learn more about evaluating object detections in the FiftyOne Docs.
Evaluations on views
All methods in FiftyOne’s Evaluation API that are applicable to Dataset
instances are also exposed to DatasetView
. This means that you can compute evaluations on subsets of your dataset obtained through filtering, matching, and chaining together any number of view stages.
As an example, we can evaluate detections only on samples that are highly unique in our dataset, and which have fewer than 10 predicted detections:
import fiftyone as fo import fiftyone.brain as fob import fiftyone.zoo as foz from fiftyone import ViewField as F ## compute uniqueness of each sample fob.compute_uniqueness(dataset) dataset = foz.load_zoo_dataset("quickstart") ## create DatasetView with 50 most unique images unique_view = dataset.sort_by( "uniqueness", reverse=True ).limit(50) ## get only the unique images with fewer than 10 predicted detections few_pred_unique_view = unique_view.match( F("predictions.detections").length() < 10 ) ## evaluate detections for this view few_pred_unique_view.evaluate_detections( "predictions", gt_field="ground_truth", eval_key="eval_few_unique" )
Learn more about the FiftyOne Brain in the FiftyOne Docs.
Plotting interactive confusion matrices
For classification and detection evaluations, FiftyOne’s evaluation routines generate confusion matrices. You can plot these confusion matrices in FiftyOne with the plot_confusion_matrix()
method.
import fiftyone as fo import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") ## generate evaluation results results = dataset.evaluate_detections( "predictions", gt_field = "ground_truth" ) ## plot confusion matrix classes = ["person", "kite", "car", "bird"] plot = results.plot_confusion_matrix(classes=classes) plot.show()
Because the confusion matrix is implemented in plotly, it is interactive! To interact visually with your data via the confusion matrix, attach the plot to a session launched with the dataset:
## create a session and attach plot session = fo.launch_app(dataset) session.plots.attach(plot)
Clicking into a cell in the confusion matrix then changes which samples appear in the sample grid in the FiftyOne App.
Learn more about interactive plotting in the FiftyOne Docs.
Evaluating frames of a video
All of the evaluation methods in FiftyOne’s Evaluation API can be applied to frame-level labels in addition to sample-level labels. This means that you can evaluate video samples without needing to convert the frames of a video sample to standalone image samples.
Applying FiftyOne evaluation methods to video frames also has the added benefit that useful statistics are computed at both the frame and sample levels. For instance, the following code populates the fields eval_tp
, eval_fp
, and eval_fn
as summary statistics on the sample level, containing the total number of true positives, false positives, and false negatives across all frames in the sample. Additionally, on each frame, the evaluation populates an eval
field for each detection with a value of either tp
, fp
, or fn
, as well as an eval_iou
field where appropriate.
import random import fiftyone as fo import fiftyone.zoo as foz dataset = foz.load_zoo_dataset( "quickstart-video", dataset_name="video-eval-demo" ) ## Create some test predictions classes = dataset.distinct("frames.detections.detections.label") def jitter(val): if random.random() < 0.10: return random.choice(classes) return val predictions = [] for sample_gts in dataset.values("frames.detections"): sample_predictions = [] for frame_gts in sample_gts: sample_predictions.append( fo.Detections( detections=[ fo.Detection( label=jitter(gt.label), bounding_box=gt.bounding_box, confidence=random.random(), ) for gt in frame_gts.detections ] ) ) predictions.append(sample_predictions) dataset.set_values("frames.predictions", predictions) dataset.evaluate_detections( "frames.predictions", gt_field="frames.detections", eval_key="eval", )
Note that the only difference in practice is the prefix “frames” used to specify the predictions field and the ground truth field.
Learn more about video views and evaluating videos in the FiftyOne Docs.
Managing multiple evaluations
With all of the flexibility the Evaluation API provides, you’d be well within reason to wonder what evaluation you should perform. Fortunately, FiftyOne makes it easy to perform, manage, and store the results from multiple evaluations!
The results from each evaluation can be stored and accessed via an evaluation key, specified by the eval_key
argument. This allows you to compare different evaluation methods on the same data,
import fiftyone as fo import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") dataset.evaluate_detections( "predictions", gt_field = "ground_truth", method = "coco", eval_key = "coco_eval" ) dataset.evaluate_detections( "predictions", gt_field = "ground_truth", method = "open-images", eval_key = "oi_eval" )
evaluate predictions generated by multiple models,
dataset.evaluate_detections( "model1_predictions", gt_field = "ground_truth", eval_key = "model1_eval" ) dataset.evaluate_detections( "model2_predictions", gt_field = "ground_truth", eval_key = "model2_eval" )
Or compare evaluations on different subsets or views of your data, such as a view with only small bounding boxes and a view with only large bounding boxes:
from fiftyone import ViewField as F bbox_area = ( F("bounding_box")[2] * F("bounding_box")[3] ) large_boxes = bbox_area > 0.7 small_boxes = bbox_area < 0.3 # Create a view that contains only small-sized objects small_view = ( dataset .filter_labels( "ground_truth", small_boxes ) ) # Create a view that contains only large-sized objects large_view = ( dataset .filter_labels( "ground_truth", large_boxes ) ) small_view.evaluate_detections( "predictions", gt_field="ground_truth", eval_key="eval_small", ) large_view.evaluate_detections( "predictions", gt_field="ground_truth", eval_key="eval_large", )
Learn more about managing model evaluations in the FiftyOne Docs.
Join the FiftyOne community!
Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!
- 1,300+ FiftyOne Slack members
- 2,500+ stars on GitHub
- 2,900+ Meetup members
- Used by 241+ repositories
- 55+ contributors