Giving YOLOv8 a Second Look (Part 3)
Feb 22, 2023
15 min read
Editor’s note – This is the third article in the three-part series:
[@portabletext/react] Unknown block type "externalImage", specify a component for it in the `components.types` prop

Fine-tune YOLOv8 models for custom computer vision applications

Welcome to the third and final installment in our three part series on YOLOv8! In this series, we’ll show you how to work with YOLOv8, from downloading the off-the-shelf models, to fine-tuning these models for specific use cases, and everything in between.
Throughout the series, we will be using two libraries: FiftyOne, the open source computer vision toolkit, and Ultralytics, the library that will give us access to YOLOv8.
Here in Part 3, we’ll demonstrate how to fine-tune a YOLOv8 model for your specific use case.
This post is organized as follows:
Continue reading to learn how you can incorporate YOLOv8 models into your computer vision workflows!

Parts 1 and 2 recap

In Part 1, we loaded the validation split of the COCO 2017 dataset, into FiftyOne with ground truth object detections. We then generated predictions with the YOLOv8n detection model from the Ultralytics YOLOv8 Github and added them to our dataset in the yolov8n label field on our samples.
In Part 2, we used FiftyOne’s Evaluation API to evaluate the quality of these detections. We found that this base model had relatively good precision, but struggled with recall for certain classes. When we dug into the details, we saw that for certain classes like the book class, imperfections in the COCO ground truth labels were at least part of the problem, but for other classes like bird, the model could benefit from fine-tuning on a broader set of examples.

Detecting birds with YOLOv8

precision recall f1-score support

person 0.85 0.68 0.76 11573
car 0.71 0.52 0.60 1971
chair 0.62 0.34 0.44 1806
book 0.61 0.12 0.20 1182
bottle 0.68 0.39 0.50 1051
cup 0.61 0.44 0.51 907
dining table 0.54 0.42 0.47 697
traffic light 0.66 0.36 0.46 638
bowl 0.63 0.49 0.55 636
handbag 0.48 0.12 0.19 540
bird 0.79 0.39 0.52 451
boat 0.58 0.29 0.39 430
truck 0.57 0.35 0.44 415
bench 0.58 0.27 0.37 413
umbrella 0.65 0.52 0.58 423
cow 0.81 0.61 0.70 397
banana 0.68 0.34 0.45 397
carrot 0.56 0.29 0.38 384
motorcycle 0.77 0.58 0.66 379
backpack 0.51 0.16 0.24 371

micro avg 0.76 0.52 0.61 25061
macro avg 0.64 0.38 0.47 25061
weighted avg 0.74 0.52 0.60 25061
As we saw in the previous section, while YOLOv8 has decent performance out of the box, it may not be suitable for specific use cases without some modification. For the bird class, for example, the base YOLOv8n model only achieved 39% recall.
Suppose you’re working for a bird conservancy group, putting computer vision models in the field to track and protect endangered species. Your goal is to detect, in real time, as many birds as possible.
Given its inference speed, the YOLOv8 architecture seems like the obvious choice. However, you are not satisfied with the 39% recall in the evaluation report on the COCO validation data. Because this is a high-stakes application, you want to squeeze every last bit of performance you can out of this real-time detection architecture.
If you wanted to, you could train a new YOLOv8 detection model from scratch, as illustrated in the YOLOv8 Quickstart guide, but ideally you would like to leverage the pretrained model’s existing knowledge. Fortunately, it is pretty straightforward to fine-tune an existing YOLOv8 model.
Before continuing, let’s pare down our task. The flexible query language built into FiftyOne makes it easy to slice and dice your datasets to find interesting views in just a line of code or the click of a button in the App.
At this point, we have a FiftyOne Dataset with our COCO validation images, ground truth detections, and YOLOv8n predictions in a yolov8n label field on each sample. Given that in our use case we are only concerned with detecting birds, let’s create a test set by filtering out all non-bird ground truth detections using filter_labels(). We will also filter out the non-bird predictions, but will pass the only_matches = False argument into filter_labels() to make sure we keep images that have ground truth bird detections without YOLOv8n bird predictions.
We then give the dataset a name, make it persistent, and save it to the underlying database. This test set has only 125 images, which we can visualize in the FiftyOne App.
We can also run evaluate_detections() on this data to evaluate the YOLOv8n model’s performance on images with ground truth bird detections. We will store the results under the base evaluation key:
precision recall f1-score support

bird 0.87 0.39 0.54 451
We note that while the recall is the same as in the initial evaluation report over the entire COCO validation split, the precision is higher. This means there are images that have YOLOv8n bird predictions but not ground truth bird detections.
The final step in preparing this test set is exporting the data into YOLOv8 format so we can run inference on just these samples with our fine-tuned model when we are done training. We will do so using the export_yolo_data() function we defined in Part 1.

Choosing the training data for your YOLOv8 model

The most important component in fine-tuning a model is the data on which the model is trained. If we want our model to exhibit high performance on a specific subset of data, then our goal should be to generate a high-quality training dataset whose examples cover all expected scenarios in that subset.
This is both an art and a science. It can involve pulling in data from other datasets, annotating more data that you’ve already collected with ground truth labels, augmenting your data with tools like Albumentations, or generating synthetic data with diffusion models or GANs.
In this article, we’ll take the first approach and incorporate existing high-quality data from Google’s Open Images dataset. For a thorough tutorial on how to work with Open Images data, see Loading Open Images V6 and custom datasets with FiftyOne.
The COCO training data on which YOLOv8 was trained contains 3237 images with bird detections. Open Images is more expansive, with the train, test, and validation splits together housing 20k+ images with Bird detections.
Let’s create our training dataset. First, we’ll create a dataset, train_dataset, by loading the bird detection labels from the COCO train split using the FiftyOne Dataset Zoo, and cloning this into a new Dataset object:
Then, we’ll load Open Images samples with Bird detection labels, passing in only_matching=True to only load the Bird labels. We then map these labels into COCO label format by changing Bird into bird.
We can add these new samples into our training dataset with merge_samples():
This dataset contains 24,226 samples with bird labels, or more than seven times as many birds as the base YOLOv8n model was trained on. In the next section, we’ll demonstrate how to fine-tune the model on this data using the YOLO Trainer class.

Fine-tuning YOLOv8 for a custom use case

The final step in preparing our data is splitting it into training and validation sets and exporting it into YOLO format. We will use an 80-20 train-val split, which we will select randomly using FiftyOne’s random utils.
Now all that is left is to do the fine-tuning! We will use the same YOLO command line syntax, but instead of setting mode=predict, we will set mode=train. We will specify the initial weights as the starting point for training, the number of epochs, image size, and batch size.
For my fine-tuning, I used an NVIDIA TITAN GPU. I set the training to run for 100 epochs, but it had basically converged after 60 epochs, so I stopped there. You may find that your specific use case requires fewer or potentially more iterations to reach your desired performance.
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/train
Starting training for 100 epochs...

Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/100 6.65G 1.392 1.627 1.345 22 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.677 0.524 0.581 0.339

Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
2/100 9.58G 1.446 1.407 1.395 30 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.669 0.47 0.54 0.316

Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
3/100 9.58G 1.54 1.493 1.462 29 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.529 0.329 0.349 0.188

......

Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
58/100 9.59G 1.263 0.9489 1.277 47 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.751 0.631 0.708 0.446

Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
59/100 9.59G 1.264 0.9476 1.277 29 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.752 0.631 0.708 0.446

Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
60/100 9.59G 1.257 0.9456 1.274 41 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.752 0.631 0.709 0.446
With fine-tuning complete, we can generate predictions on our test data with the “best” weights found during the training process, which are stored at runs/detect/train/weights/best.pt:
And load these predictions onto our data and visualize the predictions in the FiftyOne App:

Assessing YOLOv8 model performance improvement

On a holistic level, we can compare the performance of the fine-tuned model to the original, pretrained model by stacking their standard metrics against each other.
The easiest way to get these metrics is with FiftyOne’s Evaluation API:
From this, we can immediately see improvement in the mean average precision (mAP):
yolov8n mAP: 0.24897924786479841
fine-tuned mAP: 0.31339033693212076
Printing out a report, we can see that the recall has improved from 0.39 to 0.56. This major improvement offsets a minor dip in precision, giving an overall higher F1 score (0.67 compared to 0.54).
precision recall f1-score support

bird 0.81 0.56 0.67 506
We can also look more closely at individual images to see where the fine-tuned model is having trouble. When we do so, we can see that the model struggles to correctly handle small features. This is true for both false positives and false negatives.
This poor performance could be in part due to quality of the data, as many of these features are grainy. It could also be due to the training parameters, as both the pretraining and fine-tuning for this model used an image size of 640 pixels, which might not allow for fine-grained details to be captured.
To further improve the model’s performance, we could try a variety of approaches, including:
  • Using image augmentation to increase the proportion of images with small birds
  • Gathering and annotating more images with small birds
  • Increasing the image size during fine-tuning

Conclusion

The fine-tuning presented in the previous section is only for the purpose of illustration.
While YOLOv8 represents a step forward for real-time object detection and segmentation models, out-of-the-box it’s aimed at general purpose uses. Before deploying the model, it is essential to understand how it performs on your data. Only then can you effectively fine-tune the YOLOv8 architecture to suit your specific needs.
In this series, we have shown you how you can use FiftyOne to visualize, evaluate, and better understand YOLOv8 model predictions.
While YOLO may only look once, a conscientious computer vision engineer or researcher certainly looks twice (or more)!

Join the FiftyOne community!

Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!
Loading related posts...