Editor’s note – This is the third article in the three-part series:
[@portabletext/react] Unknown block type "externalImage", specify a component for it in the `components.types` prop
Fine-tune YOLOv8 models for custom computer vision applications
Welcome to the third and final installment in our three part series on
YOLOv8! In this series, we’ll show you how to work with YOLOv8, from downloading the off-the-shelf models, to fine-tuning these models for specific use cases, and everything in between.
Throughout the series, we will be using two libraries:
FiftyOne, the open source computer vision toolkit, and
Ultralytics, the library that will give us access to YOLOv8.
Here in Part 3, we’ll demonstrate how to fine-tune a YOLOv8 model for your specific use case.
This post is organized as follows:
Continue reading to learn how you can incorporate YOLOv8 models into your computer vision workflows!
Parts 1 and 2 recap
In
Part 1, we loaded the validation split of the
COCO 2017 dataset, into FiftyOne with ground truth object detections. We then generated predictions with the YOLOv8n detection model from the
Ultralytics YOLOv8 Github and added them to our dataset in the
yolov8n
label field on our samples.
In
Part 2, we used FiftyOne’s
Evaluation API to evaluate the quality of these detections. We found that this base model had relatively good precision, but struggled with recall for certain classes. When we dug into the details, we saw that for certain classes like the
book
class, imperfections in the COCO ground truth labels were at least part of the problem, but for other classes like
bird
, the model could benefit from fine-tuning on a broader set of examples.
Detecting birds with YOLOv8
precision recall f1-score support
person 0.85 0.68 0.76 11573
car 0.71 0.52 0.60 1971
chair 0.62 0.34 0.44 1806
book 0.61 0.12 0.20 1182
bottle 0.68 0.39 0.50 1051
cup 0.61 0.44 0.51 907
dining table 0.54 0.42 0.47 697
traffic light 0.66 0.36 0.46 638
bowl 0.63 0.49 0.55 636
handbag 0.48 0.12 0.19 540
bird 0.79 0.39 0.52 451
boat 0.58 0.29 0.39 430
truck 0.57 0.35 0.44 415
bench 0.58 0.27 0.37 413
umbrella 0.65 0.52 0.58 423
cow 0.81 0.61 0.70 397
banana 0.68 0.34 0.45 397
carrot 0.56 0.29 0.38 384
motorcycle 0.77 0.58 0.66 379
backpack 0.51 0.16 0.24 371
micro avg 0.76 0.52 0.61 25061
macro avg 0.64 0.38 0.47 25061
weighted avg 0.74 0.52 0.60 25061
As we saw in the previous section, while YOLOv8 has decent performance out of the box, it may not be suitable for specific use cases without some modification. For the bird
class, for example, the base YOLOv8n model only achieved 39% recall.
Suppose you’re working for a bird conservancy group, putting computer vision models in the field to track and protect endangered species. Your goal is to detect, in real time, as many birds as possible.
Given its inference speed, the YOLOv8 architecture seems like the obvious choice. However, you are not satisfied with the 39% recall in the evaluation report on the COCO validation data. Because this is a high-stakes application, you want to squeeze every last bit of performance you can out of this real-time detection architecture.
If you wanted to, you could train a new YOLOv8 detection model from scratch, as illustrated in the
YOLOv8 Quickstart guide, but ideally you would like to leverage the pretrained model’s existing knowledge. Fortunately, it is pretty straightforward to fine-tune an existing YOLOv8 model.
Before continuing, let’s pare down our task. The flexible query language built into FiftyOne makes it easy to slice and dice your datasets to find interesting views in just a line of code or the click of a button in the App.
At this point, we have a FiftyOne
Dataset
with our COCO validation images, ground truth detections, and YOLOv8n predictions in a
yolov8n
label field on each sample. Given that in our use case we are only concerned with detecting birds, let’s create a test set by filtering out all non-
bird
ground truth detections using
filter_labels()
. We will also filter out the non-
bird
predictions, but will pass the
only_matches = False
argument into
filter_labels()
to make sure we keep images that have ground truth
bird
detections without YOLOv8n
bird
predictions.
We then give the dataset a name, make it persistent, and save it to the underlying database. This test set has only 125 images, which we can visualize in the
FiftyOne App.
We can also run
evaluate_detections()
on this data to evaluate the YOLOv8n model’s performance on images with ground truth bird detections. We will store the results under the
base
evaluation key:
precision recall f1-score support
bird 0.87 0.39 0.54 451
We note that while the recall is the same as in the initial evaluation report over the entire COCO validation split, the precision is higher. This means there are images that have YOLOv8n bird
predictions but not ground truth bird
detections.
The final step in preparing this test set is exporting the data into YOLOv8 format so we can run inference on just these samples with our fine-tuned model when we are done training. We will do so using the
export_yolo_data()
function we defined in
Part 1.
Choosing the training data for your YOLOv8 model
The most important component in fine-tuning a model is the data on which the model is trained. If we want our model to exhibit high performance on a specific subset of data, then our goal should be to generate a high-quality training dataset whose examples cover all expected scenarios in that subset.
This is both an art and a science. It can involve pulling in data from other datasets, annotating more data that you’ve already collected with ground truth labels, augmenting your data with tools like
Albumentations, or generating synthetic data with
diffusion models or
GANs.
The COCO training data on which YOLOv8 was trained contains 3237 images with bird
detections. Open Images is more expansive, with the train, test, and validation splits together housing 20k+ images with Bird
detections.
Let’s create our training dataset. First, we’ll create a dataset,
train_dataset
, by loading the
bird
detection labels from the COCO train split using the
FiftyOne Dataset Zoo, and cloning this into a new
Dataset
object:
Then, we’ll load Open Images samples with Bird
detection labels, passing in only_matching=True
to only load the Bird
labels. We then map these labels into COCO label format by changing Bird
into bird
.
We can add these new samples into our training dataset with merge_samples()
:
This dataset contains 24,226 samples with
bird
labels, or more than seven times as many birds as the base YOLOv8n model was trained on. In the next section, we’ll demonstrate how to fine-tune the model on this data using the
YOLO Trainer class.
Fine-tuning YOLOv8 for a custom use case
The final step in preparing our data is splitting it into training and validation sets and exporting it into YOLO format. We will use an 80-20 train-val split, which we will select randomly using
FiftyOne’s random utils.
Now all that is left is to do the fine-tuning! We will use the same
YOLO command line syntax, but instead of setting
mode=predict
, we will set
mode=train
. We will specify the initial weights as the starting point for training, the number of epochs, image size, and batch size.
For my fine-tuning, I used an
NVIDIA TITAN GPU. I set the training to run for 100 epochs, but it had basically converged after 60 epochs, so I stopped there. You may find that your specific use case requires fewer or potentially more iterations to reach your desired performance.
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/train
Starting training for 100 epochs...
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/100 6.65G 1.392 1.627 1.345 22 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.677 0.524 0.581 0.339
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
2/100 9.58G 1.446 1.407 1.395 30 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.669 0.47 0.54 0.316
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
3/100 9.58G 1.54 1.493 1.462 29 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.529 0.329 0.349 0.188
......
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
58/100 9.59G 1.263 0.9489 1.277 47 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.751 0.631 0.708 0.446
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
59/100 9.59G 1.264 0.9476 1.277 29 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.752 0.631 0.708 0.446
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
60/100 9.59G 1.257 0.9456 1.274 41 640: 1
Class Images Instances Box(P R mAP50 m
all 4845 12487 0.752 0.631 0.709 0.446
With fine-tuning complete, we can generate predictions on our test data with the “best” weights found during the training process, which are stored at runs/detect/train/weights/best.pt
:
And load these predictions onto our data and visualize the predictions in the FiftyOne App:
Assessing YOLOv8 model performance improvement
On a holistic level, we can compare the performance of the fine-tuned model to the original, pretrained model by stacking their standard metrics against each other.
The easiest way to get these metrics is with FiftyOne’s Evaluation API:
From this, we can immediately see improvement in the mean average precision (mAP):
yolov8n mAP: 0.24897924786479841
fine-tuned mAP: 0.31339033693212076
Printing out a report, we can see that the recall has improved from 0.39 to 0.56. This major improvement offsets a minor dip in precision, giving an overall higher F1 score (0.67 compared to 0.54).
precision recall f1-score support
bird 0.81 0.56 0.67 506
We can also look more closely at individual images to see where the fine-tuned model is having trouble. When we do so, we can see that the model struggles to correctly handle small features. This is true for both false positives and false negatives.
This poor performance could be in part due to quality of the data, as many of these features are grainy. It could also be due to the training parameters, as both the pretraining and fine-tuning for this model used an image size of 640 pixels, which might not allow for fine-grained details to be captured.
To further improve the model’s performance, we could try a variety of approaches, including:
- Using image augmentation to increase the proportion of images with small birds
- Gathering and annotating more images with small birds
- Increasing the image size during fine-tuning
Conclusion
The fine-tuning presented in the previous section is only for the purpose of illustration.
While YOLOv8 represents a step forward for real-time object detection and segmentation models, out-of-the-box it’s aimed at general purpose uses. Before deploying the model, it is essential to understand how it performs on your data. Only then can you effectively fine-tune the YOLOv8 architecture to suit your specific needs.
In this series, we have shown you how you can use FiftyOne to visualize, evaluate, and better understand YOLOv8 model predictions.
While YOLO may only look once, a conscientious computer vision engineer or researcher certainly looks twice (or more)!
Join the FiftyOne community!
Join the thousands of engineers and data scientists already using FiftyOne to solve some of the most challenging problems in computer vision today!