TL;DR: This post introduces a a unified interface to compute, visualize, and compare model metrics. Identify error cases, filter samples, and compare multiple model runs side by side all within FiftyOne Enterprise.

Evaluating model performance remains an ongoing challenge in visual AI. Existing workflows often use a patchwork of scripts and custom tooling to measure performance and find errors. This friction often leads to outdated insights with unclear links to training data or operating conditions.

The result? Slower model improvement cycles and uncertainty about when a model is truly production-ready.

You need confidence that your models behave as expected. That confidence comes from evaluating performance, understanding the results, and identifying failure modes before they impact operations.

Today, we’re excited to announceFiftyOne’s Model Evaluation workflow. This feature introduces a unified interface for understanding model strengths and weaknesses. You can compute and visualize performance metrics, drill into specific error cases across, and compare multiple models and runs side by side.

Running Model Evaluations Within FiftyOne

Model evaluation is baked into the same FiftyOne App where you explore and analyze your visual datasets. With just a few clicks, you can provide FiftyOne with your model’s predictions, the dataset’s ground truth labels, and a chosen evaluation method. FiftyOne will then run its evaluation API to compute industry standard metrics like precision, recall, F1-score, and more.

The FiftyOne App provides a powerful Python interface to expose operations, workflows, and dashboards right Evaluation metrics can be customized to meet your needs. FiftyOne natively supports evaluating regression models, classifications, object detections, and semantic segmentations. You can also choose to define custom evaluation metrics for FiftyOne to calculate and report.

When running the evaluation computation, you can choose to take advantage of FiftyOne’s builtin compute to schedule the task as a delegated operation. The job will run in the background while you continue your existing workstream uninterrupted.

Exploring Model Evaluation Results

FiftyOne also includes a dynamic interface to explore, compare, and troubleshoot evaluation results. The Model Evaluation panel leverages FiftyOne’s Spaces framework to report critical metrics juxtaposed with the samples contributing to those metrics.

Dynamic Filtering of Predictions and Results

Each metric can be visualized down to the class level when analyzing values like confidence threshold, precision, recall, and F1-score. The metrics are interactive and linked to your dataset. Click on any histogram bar, table entry, or confusion matrix cell to instantly filter to the relevant subset of samples. Teams can zero in on patterns without writing a single line of new code.

Side-By-Side Sample and Error Views

Understanding errors is much easier when you see what the model saw. Immediately identify discrepancies by viewing each sample predicted labels alongside the ground truth. False detections, missed detections, and classification errors show up clearly when you compare the expected label versus model’s label on the same sample.

Rather than knowing “we had 50 false negatives”, your team can step through those samples to see what went wrong in each case – be it an obscured object, odd camera angle, or incorrect annotation. Technical and non-technical stakeholders alike can quickly grasp the model’s mistakes – and successes – by visualizing ground truth and predictions together.

Multi-Run Comparisons

It’s common for teams to train many model versions, and then decide which one to deploy. The Model Evaluation workflow lets you compare multiple model runs side by side in the UI. Select any two evaluation results (e.g. last week’s model vs. this week’s upgrade, or two entirely different model architectures), and instantly compare and contrast their metrics.

This comparison extends to data exploration. For example, if Model A outperforms Model B on detecting vehicles, but not pedestrians, you can drill down to see the specific images Model B got wrong and vice versa. Teams can understand the strengths and weaknesses of each model, and can collaborate on next steps (such as merge the best aspects of both models, or choose the one performing better on the more critical class).

Efficient Team Collaboration

Users and groups with access to the data set can see a common, consolidated view of model evaluations and results. Each evaluation has a toggleable review status to keep the team updated. And during or after the review process, each evaluation’s markdown notes section offers a means to summarize findings for your teammates and future self.

Get Started with the Model Evaluation Workflow

Model evaluation with FiftyOne helps you more quickly understand how your models perform – and why – so you can deploy improvements faster while maintaining visibility and control. The Model Evaluation workflow is available now with FiftyOne Enterprise. Check out ourdocumentation for easy steps to get started.

Already an enterprise user? Upgrade to FiftyOne Enterprise 2.7.1 and give it a try!

Happy modeling! 🚀

Talk to a computer vision expert