Editor ‘s note – This is the second post in the three-part series:
- Part 1 – Generate, load, and visualize YOLOv8 model predictions
- Part 2 – Evaluate YOLOv8 model predictions (this article)
- Part 3 – Fine-tune YOLOv8 models for custom computer vision applications
Evaluate YOLOv8 model predictions
Welcome to the second part in our three part series on YOLOv8! In this series, we’ll show you how to work with YOLOv8, from downloading the off-the-shelf models, to fine-tuning these models for specific use cases, and everything in between.
Throughout the series, we will be using two libraries: FiftyOne, the open source computer vision toolkit, and Ultralytics, the library that will give us access to YOLOv8.
In Part 1, we generated, loaded, and visualized YOLOv8 model predictions. Here in Part 2, we’ll delve deeper into evaluating the quality of the YOLOv8n detection model’s predictions, from one-number metrics to class-wise performance and identifying edge cases.
This post is organized as follows:
- Part 1 recap
- Printing performance metrics
- Viewing concerning classes
- Finding poorly performing samples
Continue reading to learn how you can leverage FiftyOne to take a deeper look into YOLOv8’s predictions!
Part 1 recap
In Part 1, after importing the necessary modules,
import fiftyone as fo import fiftyone.zoo as foz from fiftyone import ViewField as F
we loaded the validation split of the COCO 2017 dataset into FiftyOne with ground truth object detections. We then generated predictions with the YOLOv8n detection model from the Ultralytics YOLOv8 GitHub and added them to our dataset in the yolov8n
label field on our samples.
When we left off, we had launched the FiftyOne App to visualize our images and predictions.
Printing YOLOv8 model performance metrics
Now that we have YOLOv8 predictions loaded onto the images in our dataset from Part 1, we can evaluate the quality of these predictions using FiftyOne’s Evaluation API.
To evaluate the object detections in the yolov8_det
field relative to the ground_truth
detections field, we can run:
detection_results = dataset.evaluate_detections( "yolov8n", eval_key="eval", compute_mAP=True, gt_field="ground_truth", )
We can then get the mean average precision (mAP) of the model’s predictions:
mAP = detection_results.mAP() print("mAP = {}".format(mAP))
mAP = 0.3121319189417518
We can also look at the model’s performance on the 20 most common object classes in the dataset, where it has seen the most examples so the statistics are most meaningful:
counts = dataset.count_values("ground_truth.detections.label") top20_classes = sorted( counts, key=counts.get, reverse=True )[:20] detection_results.print_report(classes=top20_classes)
precision recall f1-score support person 0.85 0.68 0.76 11573 car 0.71 0.52 0.60 1971 chair 0.62 0.34 0.44 1806 book 0.61 0.12 0.20 1182 bottle 0.68 0.39 0.50 1051 cup 0.61 0.44 0.51 907 dining table 0.54 0.42 0.47 697 traffic light 0.66 0.36 0.46 638 bowl 0.63 0.49 0.55 636 handbag 0.48 0.12 0.19 540 bird 0.79 0.39 0.52 451 boat 0.58 0.29 0.39 430 truck 0.57 0.35 0.44 415 bench 0.58 0.27 0.37 413 umbrella 0.65 0.52 0.58 423 cow 0.81 0.61 0.70 397 banana 0.68 0.34 0.45 397 carrot 0.56 0.29 0.38 384 motorcycle 0.77 0.58 0.66 379 backpack 0.51 0.16 0.24 371 micro avg 0.76 0.52 0.61 25061 macro avg 0.64 0.38 0.47 25061 weighted avg 0.74 0.52 0.60 25061
Viewing concerning classes
From the output of the print_report()
call above, we can see that this model performs decently well, but certainly has its limitations. While its precision is relatively good on average, it is lacking when it comes to recall. This is especially pronounced for certain classes like the book
class.
Fortunately, we can dig deeper into these results with FiftyOne. Using the FiftyOne App, we can for instance filter by class for both ground truth and predicted detections so that only book
detections appear in the samples.
Scrolling through the samples in the sample grid, we can see that a lot of the time, COCO’s purported ground truth labels for the book
class appear to be imperfect. Sometimes, individual books are bounded, other times rows or whole bookshelves are encompassed in a single box, and yet other times books are entirely unlabeled. Unless our desired computer vision application specifically requires good book
detection, this should probably not be a point of concern when we are assessing the quality of the model. After all, the quality of a model is limited by the quality of the data it is trained on – this is why data-centric approaches to computer vision are so important!
For other classes like the bird
class, however, there appear to be challenges. One way to see this is to filter for bird
ground truth detections and then convert to an EvaluationPatchesView
. Some of these recall errors appear to be related to small features, where the resolution is poor.
In other cases though, quick inspection confirms that the object is clearly a bird. This means that there is likely room for improvement.
Finding poorly performing samples
Beyond summary statistics, visualizing model predictions in FiftyOne allows for deeper exploration, such as looking at samples that contain the highest number of false negative detections:
fn_view = dataset.sort_by("det_fn", reverse=True) session = fo.launch_app(fn_view)
Looking at these samples, it is immediately evident that these samples are crowded, so there are a lot of detections to possibly be missed in prediction. Instead, it might be more useful to sort by something like precision, which does not have the same dependence on the total number of objects in the image. We can compute the precision by using the number of true positives and false positives in a given image:
non_empty_view = dataset.match(F("eval_tp") > 0) ## get true and false positive counts by image tp_vals = np.array(non_empty_view.values("eval_tp")) fp_vals = np.array(non_empty_view.values("eval_fp")) ## compute precision by image precision_vals = tp_vals/(tp_vals+fp_vals) ## set precision values non_empty_view.set_values("precision", precision_vals) ## get lowest precision images low_precision_view = non_empty_view.sort_by("precision")
Conclusion
In this article, we demonstrated how to evaluate a YOLOv8 model’s performance on your data. One-number metrics like mAP provide a good starting point for evaluating model quality, but fail to give a complete picture of the model’s effectiveness on a class by class or image by image basis. By looking at individual samples, we can build an intuition for why a model fails or succeeds.
In Part 3, we’ll push the story forward and fine-tune our YOLOv8 detection model to detect birds.
Finish with Part 3!
Join the FiftyOne community!
Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!
- 1,350+ FiftyOne Slack members
- 2,550+ stars on GitHub
- 3,200+ Meetup members
- Used by 245+ repositories
- 56+ contributors