Editor 's note – This is the second post in the three-part series:
[@portabletext/react] Unknown block type "externalImage", specify a component for it in the `components.types` prop
Evaluate YOLOv8 model predictions
Welcome to the second part in our three part series on
YOLOv8! In this series, we’ll show you how to work with YOLOv8, from downloading the off-the-shelf models, to fine-tuning these models for specific use cases, and everything in between.
Throughout the series, we will be using two libraries:
FiftyOne, the open source computer vision toolkit, and
Ultralytics, the library that will give us access to YOLOv8.
In
Part 1, we generated, loaded, and visualized YOLOv8 model predictions. Here in Part 2, we’ll delve deeper into evaluating the quality of the YOLOv8n detection model’s predictions, from one-number metrics to class-wise performance and identifying edge cases.
This post is organized as follows:
Continue reading to learn how you can leverage FiftyOne to take a deeper look into YOLOv8’s predictions!
Part 1 recap
In
Part 1, after importing the necessary modules,
we loaded the validation split of the
COCO 2017 dataset into FiftyOne with ground truth object detections. We then generated predictions with the YOLOv8n detection model from the
Ultralytics YOLOv8 GitHub and added them to our dataset in the
yolov8n
label field on our samples.
When we left off, we had launched the
FiftyOne App to visualize our images and predictions.
Printing YOLOv8 model performance metrics
Now that we have YOLOv8 predictions loaded onto the images in our dataset from
Part 1, we can evaluate the quality of these predictions using FiftyOne’s
Evaluation API.
To evaluate the object detections in the yolov8_det
field relative to the ground_truth
detections field, we can run:
mAP = 0.3121319189417518
We can also look at the model’s performance on the 20 most common object classes in the dataset, where it has seen the most examples so the statistics are most meaningful:
precision recall f1-score support
person 0.85 0.68 0.76 11573
car 0.71 0.52 0.60 1971
chair 0.62 0.34 0.44 1806
book 0.61 0.12 0.20 1182
bottle 0.68 0.39 0.50 1051
cup 0.61 0.44 0.51 907
dining table 0.54 0.42 0.47 697
traffic light 0.66 0.36 0.46 638
bowl 0.63 0.49 0.55 636
handbag 0.48 0.12 0.19 540
bird 0.79 0.39 0.52 451
boat 0.58 0.29 0.39 430
truck 0.57 0.35 0.44 415
bench 0.58 0.27 0.37 413
umbrella 0.65 0.52 0.58 423
cow 0.81 0.61 0.70 397
banana 0.68 0.34 0.45 397
carrot 0.56 0.29 0.38 384
motorcycle 0.77 0.58 0.66 379
backpack 0.51 0.16 0.24 371
micro avg 0.76 0.52 0.61 25061
macro avg 0.64 0.38 0.47 25061
weighted avg 0.74 0.52 0.60 25061
Viewing concerning classes
From the output of the print_report()
call above, we can see that this model performs decently well, but certainly has its limitations. While its precision is relatively good on average, it is lacking when it comes to recall. This is especially pronounced for certain classes like the book
class.
Fortunately, we can dig deeper into these results with FiftyOne. Using the FiftyOne App, we can for instance filter by class for both ground truth and predicted detections so that only book
detections appear in the samples.
Scrolling through the samples in the sample grid, we can see that a lot of the time, COCO’s purported ground truth labels for the book
class appear to be imperfect. Sometimes, individual books are bounded, other times rows or whole bookshelves are encompassed in a single box, and yet other times books are entirely unlabeled. Unless our desired computer vision application specifically requires good book
detection, this should probably not be a point of concern when we are assessing the quality of the model. After all, the quality of a model is limited by the quality of the data it is trained on - this is why data-centric approaches to computer vision are so important!
For other classes like the
bird
class, however, there appear to be challenges. One way to see this is to filter for
bird
ground truth detections and then convert to an
EvaluationPatchesView
. Some of these recall errors appear to be related to small features, where the resolution is poor.
In other cases though, quick inspection confirms that the object is clearly a bird. This means that there is likely room for improvement.
Finding poorly performing samples
Beyond summary statistics, visualizing model predictions in FiftyOne allows for deeper exploration, such as looking at samples that contain the highest number of false negative detections:
Looking at these samples, it is immediately evident that these samples are crowded, so there are a lot of detections to possibly be missed in prediction. Instead, it might be more useful to sort by something like precision, which does not have the same dependence on the total number of objects in the image. We can compute the precision by using the number of true positives and false positives in a given image:
Conclusion
In this article, we demonstrated how to evaluate a YOLOv8 model’s performance on your data. One-number metrics like mAP provide a good starting point for evaluating model quality, but fail to give a complete picture of the model’s effectiveness on a class by class or image by image basis. By looking at individual samples, we can build an intuition for why a model fails or succeeds.
In
Part 3, we’ll push the story forward and fine-tune our YOLOv8 detection model to detect birds.
Join the FiftyOne community!
Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!