Evaluate YOLOv8 model predictions

Welcome to the second part in our three part series on YOLOv8! In this series, we’ll show you how to work with YOLOv8, from downloading the off-the-shelf models, to fine-tuning these models for specific use cases, and everything in between.

Throughout the series, we will be using two libraries: FiftyOne, the open source computer vision toolkit, and Ultralytics, the library that will give us access to YOLOv8.

In Part 1, we generated, loaded, and visualized YOLOv8 model predictions. Here in Part 2, we’ll delve deeper into evaluating the quality of the YOLOv8n detection model’s predictions, from one-number metrics to class-wise performance and identifying edge cases.

This post is organized as follows:

Part 1 recap
Printing performance metrics
Viewing concerning classes
Finding poorly performing samples

Continue reading to learn how you can leverage FiftyOne to take a deeper look into YOLOv8’s predictions!

Part 1 recap

In Part 1, after importing the necessary modules,

we loaded the validation split of the COCO 2017 dataset into FiftyOne with ground truth object detections. We then generated predictions with the YOLOv8n detection model from the Ultralytics YOLOv8 GitHub and added them to our dataset in the yolov8n label field on our samples.

When we left off, we had launched the FiftyOne App to visualize our images and predictions.

Printing YOLOv8 model performance metrics

Now that we have YOLOv8 predictions loaded onto the images in our dataset from Part 1, we can evaluate the quality of these predictions using FiftyOne’s Evaluation API.

To evaluate the object detections in the yolov8_det field relative to the ground_truth detections field, we can run:

We can then get the mean average precision (mAP) of the model’s predictions:

mAP = 0.3121319189417518

We can also look at the model’s performance on the 20 most common object classes in the dataset, where it has seen the most examples so the statistics are most meaningful:

precision recall f1-score support

person 0.85 0.68 0.76 11573
car 0.71 0.52 0.60 1971
chair 0.62 0.34 0.44 1806
book 0.61 0.12 0.20 1182
bottle 0.68 0.39 0.50 1051
cup 0.61 0.44 0.51 907
dining table 0.54 0.42 0.47 697
traffic light 0.66 0.36 0.46 638
bowl 0.63 0.49 0.55 636
handbag 0.48 0.12 0.19 540
bird 0.79 0.39 0.52 451
boat 0.58 0.29 0.39 430
truck 0.57 0.35 0.44 415
bench 0.58 0.27 0.37 413
umbrella 0.65 0.52 0.58 423
cow 0.81 0.61 0.70 397
banana 0.68 0.34 0.45 397
carrot 0.56 0.29 0.38 384
motorcycle 0.77 0.58 0.66 379
backpack 0.51 0.16 0.24 371

micro avg 0.76 0.52 0.61 25061
macro avg 0.64 0.38 0.47 25061
weighted avg 0.74 0.52 0.60 25061

Viewing concerning classes

From the output of the print_report() call above, we can see that this model performs decently well, but certainly has its limitations. While its precision is relatively good on average, it is lacking when it comes to recall. This is especially pronounced for certain classes like the book class.

Fortunately, we can dig deeper into these results with FiftyOne. Using the FiftyOne App, we can for instance filter by class for both ground truth and predicted detections so that only book detections appear in the samples.

Scrolling through the samples in the sample grid, we can see that a lot of the time, COCO’s purported ground truth labels for the book class appear to be imperfect. Sometimes, individual books are bounded, other times rows or whole bookshelves are encompassed in a single box, and yet other times books are entirely unlabeled. Unless our desired computer vision application specifically requires good book detection, this should probably not be a point of concern when we are assessing the quality of the model. After all, the quality of a model is limited by the quality of the data it is trained on - this is why data-centric approaches to computer vision are so important!

For other classes like the bird class, however, there appear to be challenges. One way to see this is to filter for bird ground truth detections and then convert to an EvaluationPatchesView. Some of these recall errors appear to be related to small features, where the resolution is poor.

In other cases though, quick inspection confirms that the object is clearly a bird. This means that there is likely room for improvement.

Finding poorly performing samples

Beyond summary statistics, visualizing model predictions in FiftyOne allows for deeper exploration, such as looking at samples that contain the highest number of false negative detections:

Looking at these samples, it is immediately evident that these samples are crowded, so there are a lot of detections to possibly be missed in prediction. Instead, it might be more useful to sort by something like precision, which does not have the same dependence on the total number of objects in the image. We can compute the precision by using the number of true positives and false positives in a given image:

Conclusion

In this article, we demonstrated how to evaluate a YOLOv8 model’s performance on your data. One-number metrics like mAP provide a good starting point for evaluating model quality, but fail to give a complete picture of the model’s effectiveness on a class by class or image by image basis. By looking at individual samples, we can build an intuition for why a model fails or succeeds.

In Part 3, we’ll push the story forward and fine-tune our YOLOv8 detection model to detect birds.

Finish with Part 3!

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!

1,350+ FiftyOne Slack members
2,550+ stars on GitHub
3,200+ Meetup members
Used by 245+ repositories
56+ contributors

Talk to a computer vision expert