Skip to content

Mastering Drone Data Training 

A Comprehensive Guide Using FiftyOne and Ultralytics YOLOv5

Drones are revolutionizing industries from agriculture to surveillance, and the ability to accurately detect and analyze objects from aerial imagery is becoming increasingly invaluable. Embarking on the journey of harnessing the potential of drone data through advanced object detection techniques is an exciting endeavor.

This tutorial dives deep into the powerful synergy between FiftyOne and Ultralytics YOLOv5, two cutting-edge tools that, when combined, offer a robust solution for training drone data. Whether you’re a seasoned machine learning practitioner or a drone enthusiast venturing into the realm of AI, this guide will equip you with the knowledge and tools to navigate the complexities of drone data training and object detection.

Wait, What’s FiftyOne?

FiftyOne is an open source machine learning toolset that enables data science teams to improve the performance of their computer vision models by helping them curate high quality datasets, evaluate models, find mistakes, visualize embeddings, and get to production faster.

Getting Started With Your Data

Drone data is very sensitive in computer vision and it is extremely important to keep all factors in mind. One small misrepresentation can severely hurt model accuracy. When curating or collecting data, it is important to take into account the angle, height, lens type, and more as to keep it consistent with how you plan on using your drone for its computer vision task. That task may be search and rescue, surveying, agriculture, or traffic detection. All of these tasks have different flight requirements and will be “seeing” the work it is doing differently.

Drone data at different angles like the above can lead to problematic training

An easy to depict issue that can arise is the changing height of the terrain below the drone, which can lead to troublesome data. If there is an expected height above target for the use case, that should also be consistent on data collection. If the drone should be 50 feet above the ground but you are flying above a hill, the drone should also raise its altitude. Neglecting to do so will not only change the appearance of the image, but you can be missing coverage in your scan. None of these issues are going to totally impede training your model but could impact performance. 

After all the data has been collected, the only actions you can make are additional annotations or data curation. Setting some guidelines for what data is acceptable and what is not for the use case is a great start. 

For this walkthrough on using FiftyOne and Ultralytics to train an ML model on drone data, I will be using the Kaggle Roundabout Aerial Images dataset, but you can apply the same workflow to any annotated detection dataset though, so feel free to use your own.

Training With YOLOv5

One of the most painful experiences one goes through with training detection models is changing data types from one to another. Thankfully the days of parsing through several GB large COCO json files or scrambling together VOC xml files is over. FiftyOne allows you to natively convert your dataset to any type quickly and easily. We can start from data all the way to training in a few steps.

Step 1: Load Your Data

Let’s start by loading our VOC dataset. 

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.utils.random as four
import fiftyone.utils.yolo as fouy
from fiftyone import Dataset
from fiftyone.types import VOCDetectionDataset

# Path to the dataset directory
dataset_dir = "./original/original"

# Create a VOCDetectionDataset
dataset = Dataset.from_dir(
    dataset_dir,
    dataset_type=VOCDetectionDataset,
    label_field="ground_truth",
    name = "drone_original",
    overwrite=True
)
dataset.persistent =  True

Now that all of our data is loaded, it is time for us to convert it into into the YOLO format.

Step 2: Convert Into YOLO Format

To do this we will shuffle our data; we are using FiftyOne random_split. With it, we can easily split our data into a 85/15 split and export these datasets to the YOLO format. The operation will create a yaml file that points to the proper directories and will assist us in training. With these few steps, our data is already prepared to go train with Ultralytics!

four.random_split(dataset, {"val": 0.15, "train": 0.85})
val_view = dataset.match_tags("val")
train_view = dataset.match_tags("train")

val_view.export(
   export_dir="yolo_drone/",
   split="val",
   dataset_type=fo.types.YOLOv5Dataset,
)
train_view.export(
   export_dir="yolo_drone/",
   split="train",
   dataset_type=fo.types.YOLOv5Dataset,
)

Step 3: Training With Ultralytics

With our data all ready to go, next up is to train a model. Providing great tools as well as an extremely powerful model, Ultralytics is a great option when training your object detection models. To download, clone their YOLOv5 repository by running the commands in your terminal:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Ultralytics offers many different variations of training such as Multi-GPU, hyperparameter search, as well as pruning. For the sake of this demo, we will just be doing the default training. It is recommended that to achieve the highest accuracy you should tweak the training to fit your use case and data the best. For tutorials or tips on how to take advantage of YOLOv5 features, hop on over to their docs. For the default training, follow along with:

python3  /path/to/yolov5/train.py --data ./yolo_drone/dataset.yaml --weights yolov5s.pt --img 640

As is the case with most training runs, this can take quite a while on most machines and requires a graphics card. After the model has finished training and you are satisfied with the results, we can hop back into FiftyOne for some insights on how well the model trained. 

Step 4: Post Training

Our model has been trained and our weights have been saved. Now, in order to deploy or run inference on our trained model, we need to export it to the format of our choosing. I chose TorchScript due to its fast compiled nature and ease to work with. 

First locate your weights at /path/to/yolov5/runs/train/exp#/weights/best.pt then to export to Torchscript, run the following command:

python /path/to/yolov5/export.py --weights YOUR_WEIGHTS.pt --include torchscript

This will save the TorchScript file at the same location. The Ultralytics training run should have provided us with several evaluation marks such as loss or mAP. However, it is beneficial to take a deeper look to find exactly which images our model struggled with to develop a strategy to curtail this in future experiments. Let’s hop into FiftyOne for the answers!

Prepping the Model for Inference

import torch
import torchvision

#Load the Model
model = torch.load("/path/to/yolov5/runs/train/exp1/weights/best.torchscript")

#Define the Postprocessing Function
def non_max_suppression(
    	prediction,
    	conf_thres=0.25,
    	iou_thres=0.45,
    	classes=None,
    	agnostic=False,
    	multi_label=False,
    	labels=(),
    	max_det=300,
    	nm=0,  # number of masks
):
	"""Non-Maximum Suppression (NMS) on inference results to reject overlapping detections

	Returns:
        list of detections, on (n,6) tensor per image [xyxy, conf, cls]
	"""

	# Checks
	assert 0 <= conf_thres <= 1, f'Invalid Confidence threshold {conf_thres}, valid values are between 0.0 and 1.0'
	assert 0 <= iou_thres <= 1, f'Invalid IoU {iou_thres}, valid values are between 0.0 and 1.0'
	if isinstance(prediction, (list, tuple)):  # YOLOv5 model in validation model, output = (inference_out, loss_out)
    	prediction = prediction[0]  # select only inference output

	device = prediction.device
	mps = 'mps' in device.type  # Apple MPS
	if mps:  # MPS not fully supported yet, convert tensors to CPU before NMS
    	prediction = prediction.cpu()
	bs = prediction.shape[0]  # batch size
        nc = prediction.shape[2] - nm - 5  # number of classes
	xc = prediction[..., 4] > conf_thres  # candidates

	# Settings
	# min_wh = 2  # (pixels) minimum box width and height
	max_wh = 7680  # (pixels) maximum box width and height
	max_nms = 30000  # maximum number of boxes into torchvision.ops.nms()
	time_limit = 0.5 + 0.05 * bs  # seconds to quit after
	redundant = True  # require redundant detections
	multi_label &= nc > 1  # multiple labels per box (adds 0.5ms/img)
	merge = False  # use merge-NMS

	mi = 5 + nc  # mask start index
	output = [torch.zeros((0, 6 + nm), device=prediction.device)] * bs
	for xi, x in enumerate(prediction):  # image index, image inference
    	# Apply constraints
    	# x[((x[..., 2:4] < min_wh) | (x[..., 2:4] > max_wh)).any(1), 4] = 0  # width-height
    	x = x[xc[xi]]  # confidence

    	# Cat apriori labels if autolabelling
    	if labels and len(labels[xi]):
        	lb = labels[xi]
        	v = torch.zeros((len(lb), nc + nm + 5), device=x.device)
        	v[:, :4] = lb[:, 1:5]  # box
        	v[:, 4] = 1.0  # conf
        	v[range(len(lb)), lb[:, 0].long() + 5] = 1.0  # cls
      	x = torch.cat((x, v), 0)

    	# If none remain process next image
    	if not x.shape[0]:
        	continue

    	# Compute conf
    	x[:, 5:] *= x[:, 4:5]  # conf = obj_conf * cls_conf

    	# Box/Mask
    	box = xywh2xyxy(x[:, :4])  # center_x, center_y, width, height) to (x1, y1, x2, y2)
    	mask = x[:, mi:]  # zero columns if no masks

    	# Detections matrix nx6 (xyxy, conf, cls)
    	if multi_label:
        	i, j = (x[:, 5:mi] > conf_thres).nonzero(as_tuple=False).T
        	x = torch.cat((box[i], x[i, 5 + j, None], j
                     mask[i]), 1)
    	else:  # best class only
        	conf, j = x[:, 5:mi].max(1, keepdim=True)
        	x = torch.cat((box, conf, j.float(), mask), 1)[conf.view(-1) > conf_thres]

    	# Filter by class
    	if classes is not None:
        	x = x[(x[:, 5:6] == torch.tensor(classes, device=x.device)).any(1)]

    	# Apply finite constraint
    	# if not torch.isfinite(x).all():
    	# 	x = x[torch.isfinite(x).all(1)]

    	# Check shape
    	n = x.shape[0]  # number of boxes
    	if not n:  # no boxes
        	continue
    	x = x[x[:, 4].argsort(descending=True)[:max_nms]]  # sort by confidence and remove excess boxes

    	# Batched NMS
    	c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
    	boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores
    	i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS
    	i = i[:max_det]  # limit detections
    	if merge and (1 < n < 3E3):  # Merge NMS (boxes merged using weighted mean)
        	# update boxes as boxes(i,4) = weights(i,n) * boxes(n,4)
        	iou = box_iou(boxes[i], boxes) > iou_thres  # iou matrix
        	weights = iou * scores[None]  # box weights
        	x[i, :4] = torch.mm(weights, x[:, :4]).float() / weights.sum(1, keepdim=True)  # merged boxes
        	if redundant:
            	i = i[iou.sum(1) > 1]  # require redundancy

    	output[xi] = x[i]
    	if mps:
        	output[xi] = output[xi].to(device)

	return output

def xywh2xyxy(x):
	# Convert nx4 boxes from [x, y, w, h] to [x1, y1, x2, y2] where xy1=top-left, xy2=bottom-right
	y = x.clone() if isinstance(x, torch.Tensor) else np.copy(x)
	y[..., 0] = x[..., 0] - x[..., 2] / 2  # top left x
	y[..., 1] = x[..., 1] - x[..., 3] / 2  # top left y
	y[..., 2] = x[..., 0] + x[..., 2] / 2  # bottom right x
	y[..., 3] = x[..., 1] + x[..., 3] / 2  # bottom right y
	return y

def format_detections(preds):
	detections = []
	for x in preds:
        	label = x[5].cpu().detach().numpy()
        	score = x[4].cpu().detach().numpy()
        	box = x[:4].cpu().detach().numpy()
        	x1, y1, x2, y2 = box
        	rel_box = [x1 / w, y1 / h, (x2 - x1) / w, (y2 - y1) / h]
        	detections.append(
        	fo.Detection(
            	label=classes[int(label)],
            	bounding_box=rel_box,
           	confidence=score
        	)
    	)
	return detections

With our model loaded and ready to go, let’s take a slice of our dataset and introspect a bit. We start with creating a prediction view of 100 samples:

predictions_view = dataset.take(100, seed=51)

We follow up by creating an inference loop that will inference on the 100 images and add the detections to the samples. This will allow us to compare the detections in the FiftyOne App in the future.

from PIL import Image
import cv2
from torchvision.transforms import functional as func

device = "cuda"
classes = ["vehicle", "cycle", "truck", "bus", "van"]
model = model.to(device)



with fo.ProgressBar() as pb:
	for sample in pb(predictions_view):
    	# Load image
    	image = cv2.imread(sample.filepath)
    	image = cv2.resize(image, (640, 640))
    	image = func.to_tensor(image).to(device)
    	c, h, w = image.shape
    	#pint(image.shape)

    	# Perform inference
    	preds = model(image.unsqueeze(0))
    	out = non_max_suppression(preds)
    	detections = format_detections(out[0])
    	# Save predictions to dataset
    	sample["yolov5"] = fo.Detections(detections=detections)
    	sample.save()

session.view = predictions_view

With the App restarted and our new view in place, we can extract fresh insights from our data. Immediately, we’re able to observe our detections superimposed on the ground truths, facilitating a performance evaluation. Additionally, the option to hide the ground truths provides a clear picture of missed detections. Another valuable suggestion is to experiment with the label confidence sliders, offering a glimpse of our high and low-confidence predictions.

Upon analyzing the data, I can draw conclusions that were previously inaccessible without a thorough examination:

  1. The model performs strongly with high confidence on cars going around the roundabout
  2. The model struggles at closely bunched cars, such as parked cars
  3. The model mistakes rectangular objects as vehicles potentially

Investigating the Embeddings

With the FiftyOne Brain, we can take an even deeper look at our data and predictions by using embeddings. By executing a couple commands beforehand, we can open a power visualization tool that shows the groups within your data and how your model is performing on them. Take a look below:

import fiftyone.brain as fob

#Grab the mAP and IoUs
results = predictions_view.evaluate_detections(
	"yolov5",
	gt_field="ground_truth",
	eval_key="eval",
	compute_mAP=True,
)

#Compute ground_truth embeddings
results = fob.compute_visualization(
	predictions_view, patches_field="ground_truth", brain_key="gt_viz"
)

#Compute yolov5 embeddings
results = fob.compute_visualization(
	predictions_view, patches_field="yolov5", brain_key="yolo_viz"
)

session.view = predictions_view

We can grab awesome insights of where maybe we need to add to our dataset. We can potentially try to add more parked car views, different street views other than roundabouts, or even think about decreasing the height of the drone to increase the information we get about each car. Looking through our highest and lowest confidence predictions compared to ground truths allows us to learn more about data than a simple training run or evaluation could ever do.

If you’d like learn more, check out this talk I gave about many of the topics in this blog in August at the Computer Vision Meetup.

In summary, this tutorial has explored the exciting realm of harnessing the potential of drone data, spotlighting the synergy between FiftyOne and Ultralytics YOLOv5. By meticulously curating drone data and leveraging these powerful tools, this guide equips both AI enthusiasts and machine learning practitioners with the expertise to navigate the complexities of object detection training. The provided steps, from data preparation and model training to post-training analysis, underscore the importance of tailored approaches for specific use cases of working with drone data. By offering insights into model performance and utilizing embeddings for deeper analysis, this tutorial empowers users to drive accuracy and make informed decisions in the dynamic intersection of drone technology and artificial intelligence.