Part 3 – Fine-tune YOLOv8 models for custom computer vision applications (this article)
[@portabletext/react] Unknown block type "externalImage", specify a component for it in the `components.types` prop
Fine-tune YOLOv8 models for custom computer vision applications
Welcome to the third and final installment in our three part series on YOLOv8! In this series, we’ll show you how to work with YOLOv8, from downloading the off-the-shelf models, to fine-tuning these models for specific use cases, and everything in between.
Throughout the series, we will be using two libraries: FiftyOne, the open source computer vision toolkit, and Ultralytics, the library that will give us access to YOLOv8.
Here in Part 3, we’ll demonstrate how to fine-tune a YOLOv8 model for your specific use case.
Continue reading to learn how you can incorporate YOLOv8 models into your computer vision workflows!
Parts 1 and 2 recap
In Part 1, we loaded the validation split of the COCO 2017 dataset, into FiftyOne with ground truth object detections. We then generated predictions with the YOLOv8n detection model from the Ultralytics YOLOv8 Github and added them to our dataset in the yolov8n label field on our samples.
In Part 2, we used FiftyOne’s Evaluation API to evaluate the quality of these detections. We found that this base model had relatively good precision, but struggled with recall for certain classes. When we dug into the details, we saw that for certain classes like the book class, imperfections in the COCO ground truth labels were at least part of the problem, but for other classes like bird, the model could benefit from fine-tuning on a broader set of examples.
As we saw in the previous section, while YOLOv8 has decent performance out of the box, it may not be suitable for specific use cases without some modification. For the bird class, for example, the base YOLOv8n model only achieved 39% recall.
Suppose you’re working for a bird conservancy group, putting computer vision models in the field to track and protect endangered species. Your goal is to detect, in real time, as many birds as possible.
Given its inference speed, the YOLOv8 architecture seems like the obvious choice. However, you are not satisfied with the 39% recall in the evaluation report on the COCO validation data. Because this is a high-stakes application, you want to squeeze every last bit of performance you can out of this real-time detection architecture.
If you wanted to, you could train a new YOLOv8 detection model from scratch, as illustrated in the YOLOv8 Quickstart guide, but ideally you would like to leverage the pretrained model’s existing knowledge. Fortunately, it is pretty straightforward to fine-tune an existing YOLOv8 model.
Before continuing, let’s pare down our task. The flexible query language built into FiftyOne makes it easy to slice and dice your datasets to find interesting views in just a line of code or the click of a button in the App.
At this point, we have a FiftyOne Dataset with our COCO validation images, ground truth detections, and YOLOv8n predictions in a yolov8n label field on each sample. Given that in our use case we are only concerned with detecting birds, let’s create a test set by filtering out all non-bird ground truth detections using filter_labels(). We will also filter out the non-bird predictions, but will pass the only_matches = False argument into filter_labels() to make sure we keep images that have ground truth bird detections without YOLOv8n bird predictions.
We then give the dataset a name, make it persistent, and save it to the underlying database. This test set has only 125 images, which we can visualize in the FiftyOne App.
We can also run evaluate_detections() on this data to evaluate the YOLOv8n model’s performance on images with ground truth bird detections. We will store the results under the base evaluation key:
precision recall f1-score support
bird 0.87 0.39 0.54 451
We note that while the recall is the same as in the initial evaluation report over the entire COCO validation split, the precision is higher. This means there are images that have YOLOv8n bird predictions but not ground truth bird detections.
The final step in preparing this test set is exporting the data into YOLOv8 format so we can run inference on just these samples with our fine-tuned model when we are done training. We will do so using the export_yolo_data() function we defined in Part 1.
Choosing the training data for your YOLOv8 model
The most important component in fine-tuning a model is the data on which the model is trained. If we want our model to exhibit high performance on a specific subset of data, then our goal should be to generate a high-quality training dataset whose examples cover all expected scenarios in that subset.
This is both an art and a science. It can involve pulling in data from other datasets, annotating more data that you’ve already collected with ground truth labels, augmenting your data with tools like Albumentations, or generating synthetic data with diffusion models or GANs.
The COCO training data on which YOLOv8 was trained contains 3237 images with bird detections. Open Images is more expansive, with the train, test, and validation splits together housing 20k+ images with Bird detections.
Let’s create our training dataset. First, we’ll create a dataset, train_dataset, by loading the bird detection labels from the COCO train split using the FiftyOne Dataset Zoo, and cloning this into a new Dataset object:
Then, we’ll load Open Images samples with Bird detection labels, passing in only_matching=True to only load the Bird labels. We then map these labels into COCO label format by changing Bird into bird.
We can add these new samples into our training dataset with merge_samples():
This dataset contains 24,226 samples with bird labels, or more than seven times as many birds as the base YOLOv8n model was trained on. In the next section, we’ll demonstrate how to fine-tune the model on this data using the YOLO Trainer class.
Fine-tuning YOLOv8 for a custom use case
The final step in preparing our data is splitting it into training and validation sets and exporting it into YOLO format. We will use an 80-20 train-val split, which we will select randomly using FiftyOne’s random utils.
Now all that is left is to do the fine-tuning! We will use the same YOLO command line syntax, but instead of setting mode=predict, we will set mode=train. We will specify the initial weights as the starting point for training, the number of epochs, image size, and batch size.
For my fine-tuning, I used an NVIDIA TITAN GPU. I set the training to run for 100 epochs, but it had basically converged after 60 epochs, so I stopped there. You may find that your specific use case requires fewer or potentially more iterations to reach your desired performance.
Image sizes 640 train, 640 val Using 8 dataloader workers Logging results to runs/detect/train Starting training for 100 epochs...
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 1/100 6.65G 1.392 1.627 1.345 22 640: 1 Class Images Instances Box(P R mAP50 m all 4845 12487 0.677 0.524 0.581 0.339
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 2/100 9.58G 1.446 1.407 1.395 30 640: 1 Class Images Instances Box(P R mAP50 m all 4845 12487 0.669 0.47 0.54 0.316
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 3/100 9.58G 1.54 1.493 1.462 29 640: 1 Class Images Instances Box(P R mAP50 m all 4845 12487 0.529 0.329 0.349 0.188
......
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 58/100 9.59G 1.263 0.9489 1.277 47 640: 1 Class Images Instances Box(P R mAP50 m all 4845 12487 0.751 0.631 0.708 0.446
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 59/100 9.59G 1.264 0.9476 1.277 29 640: 1 Class Images Instances Box(P R mAP50 m all 4845 12487 0.752 0.631 0.708 0.446
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size 60/100 9.59G 1.257 0.9456 1.274 41 640: 1 Class Images Instances Box(P R mAP50 m all 4845 12487 0.752 0.631 0.709 0.446
With fine-tuning complete, we can generate predictions on our test data with the “best” weights found during the training process, which are stored at runs/detect/train/weights/best.pt:
And load these predictions onto our data and visualize the predictions in the FiftyOne App:
Assessing YOLOv8 model performance improvement
On a holistic level, we can compare the performance of the fine-tuned model to the original, pretrained model by stacking their standard metrics against each other.
The easiest way to get these metrics is with FiftyOne’s Evaluation API:
From this, we can immediately see improvement in the mean average precision (mAP):
Printing out a report, we can see that the recall has improved from 0.39 to 0.56. This major improvement offsets a minor dip in precision, giving an overall higher F1 score (0.67 compared to 0.54).
precision recall f1-score support
bird 0.81 0.56 0.67 506
We can also look more closely at individual images to see where the fine-tuned model is having trouble. When we do so, we can see that the model struggles to correctly handle small features. This is true for both false positives and false negatives.
This poor performance could be in part due to quality of the data, as many of these features are grainy. It could also be due to the training parameters, as both the pretraining and fine-tuning for this model used an image size of 640 pixels, which might not allow for fine-grained details to be captured.
To further improve the model’s performance, we could try a variety of approaches, including:
Using image augmentation to increase the proportion of images with small birds
Gathering and annotating more images with small birds
Increasing the image size during fine-tuning
Conclusion
The fine-tuning presented in the previous section is only for the purpose of illustration.
While YOLOv8 represents a step forward for real-time object detection and segmentation models, out-of-the-box it’s aimed at general purpose uses. Before deploying the model, it is essential to understand how it performs on your data. Only then can you effectively fine-tune the YOLOv8 architecture to suit your specific needs.
In this series, we have shown you how you can use FiftyOne to visualize, evaluate, and better understand YOLOv8 model predictions.
While YOLO may only look once, a conscientious computer vision engineer or researcher certainly looks twice (or more)!
Join the FiftyOne community!
Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!