Giving YOLOv8 a Second Look (Part 2)

February 21, 2023 – Written by Jacob Marks


Editor ‘s note – This is the second post in the three-part series:

YOLOv8 tutorial series - evaluate YOLOv8 model predictions

Evaluate YOLOv8 model predictions

Welcome to the second part in our three part series on YOLOv8! In this series, we’ll show you how to work with YOLOv8, from downloading the off-the-shelf models, to fine-tuning these models for specific use cases, and everything in between. 

Throughout the series, we will be using two libraries: FiftyOne, the open source computer vision toolkit, and Ultralytics, the library that will give us access to YOLOv8.

In Part 1, we generated, loaded, and visualized YOLOv8 model predictions. Here in Part 2, we’ll delve deeper into evaluating the quality of the YOLOv8n detection model’s predictions, from one-number metrics to class-wise performance and identifying edge cases.

This post is organized as follows:

Continue reading to learn how you can leverage FiftyOne to take a deeper look into YOLOv8’s predictions!

Part 1 recap

In Part 1, after importing the necessary modules,

import fiftyone as fo
import fiftyone.zoo as foz
from fiftyone import ViewField as F

we loaded the validation split of the COCO 2017 dataset into FiftyOne with ground truth object detections. We then generated predictions with the YOLOv8n detection model from the Ultralytics YOLOv8 GitHub and added them to our dataset in the yolov8n label field on our samples.

When we left off, we had launched the FiftyOne App to visualize our images and predictions.

High confidence YOLOv8n predictions for COCO validation images.
High confidence YOLOv8n predictions for COCO validation images.

Printing YOLOv8 model performance metrics

Now that we have YOLOv8 predictions loaded onto the images in our dataset from Part 1, we can evaluate the quality of these predictions using FiftyOne’s Evaluation API.

To evaluate the object detections in the yolov8_det field relative to the ground_truth detections field, we can run: 

detection_results = dataset.evaluate_detections(

We can then get the mean average precision (mAP) of the model’s predictions:

mAP = detection_results.mAP()
print("mAP = {}".format(mAP))

mAP = 0.3121319189417518

We can also look at the model’s performance on the 20 most common object classes in the dataset, where it has seen the most examples so the statistics are most meaningful:

counts = dataset.count_values("ground_truth.detections.label")

top20_classes = sorted(


               precision    recall  f1-score   support

       person       0.85      0.68      0.76     11573
          car       0.71      0.52      0.60      1971
        chair       0.62      0.34      0.44      1806
         book       0.61      0.12      0.20      1182
       bottle       0.68      0.39      0.50      1051
          cup       0.61      0.44      0.51       907
 dining table       0.54      0.42      0.47       697
traffic light       0.66      0.36      0.46       638
         bowl       0.63      0.49      0.55       636
      handbag       0.48      0.12      0.19       540
         bird       0.79      0.39      0.52       451
         boat       0.58      0.29      0.39       430
        truck       0.57      0.35      0.44       415
        bench       0.58      0.27      0.37       413
     umbrella       0.65      0.52      0.58       423
          cow       0.81      0.61      0.70       397
       banana       0.68      0.34      0.45       397
       carrot       0.56      0.29      0.38       384
   motorcycle       0.77      0.58      0.66       379
     backpack       0.51      0.16      0.24       371

    micro avg       0.76      0.52      0.61     25061
    macro avg       0.64      0.38      0.47     25061
 weighted avg       0.74      0.52      0.60     25061

Viewing concerning classes

From the output of the print_report() call above, we can see that this model performs decently well, but certainly has its limitations. While its precision is relatively good on average, it is lacking when it comes to recall. This is especially pronounced for certain classes like the book class. 

Fortunately, we can dig deeper into these results with FiftyOne. Using the FiftyOne App, we can for instance filter by class for both ground truth and predicted detections so that only book detections appear in the samples. 

Modal with an image from the COCO validation split showing ground truth labels and YOLOv8n predictions for only books.
Modal with an image from the COCO validation split showing ground truth labels and YOLOv8n predictions for only books.

Scrolling through the samples in the sample grid, we can see that a lot of the time, COCO’s purported ground truth labels for the book class appear to be imperfect. Sometimes, individual books are bounded, other times rows or whole bookshelves are encompassed in a single box, and yet other times books are entirely unlabeled. Unless our desired computer vision application specifically requires good book detection, this should probably not be a point of concern when we are assessing the quality of the model. After all, the quality of a model is limited by the quality of the data it is trained on – this is why data-centric approaches to computer vision are so important!

For other classes like the bird class, however, there appear to be challenges. One way to see this is to filter for bird ground truth detections and then convert to an EvaluationPatchesView. Some of these recall errors appear to be related to small features, where the resolution is poor.

In other cases though, quick inspection confirms that the object is clearly a bird. This means that there is likely room for improvement.

Evaluation patches for YOLOv8n predictions on ground truth bird detections in the COCO validation split.
Evaluation patches for YOLOv8n predictions on ground truth bird detections in the COCO validation split.

Finding poorly performing samples

Beyond summary statistics, visualizing model predictions in FiftyOne allows for deeper exploration, such as looking at samples that contain the highest number of false negative detections:

fn_view = dataset.sort_by("det_fn", reverse=True)
session = fo.launch_app(fn_view)

COCO validation images that have the highest number of false negative detections by YOLOv8n object detection model.
COCO validation images that have the highest number of false negative detections by YOLOv8n object detection model.

Looking at these samples, it is immediately evident that these samples are crowded, so there are a lot of detections to possibly be missed in prediction. Instead, it might be more useful to sort by something like precision, which does not have the same dependence on the total number of objects in the image. We can compute the precision by using the number of true positives and false positives in a given image:

non_empty_view = dataset.match(F("eval_tp") > 0)

## get true and false positive counts by image
tp_vals = np.array(non_empty_view.values("eval_tp"))
fp_vals = np.array(non_empty_view.values("eval_fp"))
## compute precision by image
precision_vals = tp_vals/(tp_vals+fp_vals)

## set precision values
non_empty_view.set_values("precision", precision_vals)

## get lowest precision images
low_precision_view = non_empty_view.sort_by("precision")

COCO validation set images sorted according to the precision of YOLOv8n detection predictions, with lowest precision images first.
COCO validation set images sorted according to the precision of YOLOv8n detection predictions, with lowest precision images first.


In this article, we demonstrated how to evaluate a YOLOv8 model’s performance on your data. One-number metrics like mAP provide a good starting point for evaluating model quality, but fail to give a complete picture of the model’s effectiveness on a class by class or image by image basis. By looking at individual samples, we can build an intuition for why a model fails or succeeds.

In Part 3, we’ll push the story forward and fine-tune our YOLOv8 detection model to detect birds.

Finish with Part 3!

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!