Using Slicing Aided Hyper Inference

Object detection is one of the fundamental tasks in computer vision. At a high level, it involves predicting the locations and classes of objects in an image. State-of-the-art (SOTA) deep learning models like those in the You-Only-Look-Once (YOLO) family have reached remarkable levels of accuracy. However, one notoriously challenging frontier in object detection is small objects.

In this post, you will learn how to detect small objects in your dataset using Slicing Aided Hyper Inference (SAHI). We’ll cover the following:

Why Is Detecting Small Objects Hard?

They Are Small

First and foremost, detecting small objects is hard because small objects are, well, small. The smaller the object, the less information the detection model has to work with. If a car is far off in the distance, it might only occupy a few pixels in our image. In much the same way humans have trouble making out distant objects, our model has a harder time identifying cars without visually discernible features like wheels and license plates!

Training Data

Models are only as good as the data they are trained on. Most of the standard object detection datasets and benchmarks focus on medium-to-large objects, which means that most off-the-shelf object detection models are not optimized for small object detection.

Fixed Input Sizes

Object detection models typically take inputs of fixed sizes. For instance, YOLOv8 is trained on images with a maximum side length of 640 pixels. This means that when we feed it an image of size 1920x1080, the model will downsample the image to 640x360 before making predictions, decreasing the resolution and discarding important information for small objects.

How SAHI Works

Illustration of Slicing Aided Hyper Inference. Image courtesy of SAHI GitHub Repo.

Theoretically, you could train a model on larger images to improve the detection of small objects. Practically, however, this would require more memory, more computational power, and datasets that are more labor-intensive to create.

An alternative to this is to leverage existing object detection, apply the model to patches or slices of fixed size in our image, and then stitch the results together. This is the idea behind Slicing-Aided Hyper Inference!

SAHI works by dividing an image into slices that completely cover it and running inference on each of these slices with a specified detection model. The predictions across all of these slices are then merged together to generate one list of detections across the entire image. The “hyper” in SAHI comes from the fact that SAHI’s output is not the result of model inference but a result of computations involving multiple model inferences.

💡SAHI slices are allowed to overlap (as illustrated in the GIF above), which can help ensure that enough of an object is in at least one slice to be detected.

The key advantage of using SAHI is that it is model-agnostic. SAHI can leverage today's SOTA object detection models and whatever the SOTA model happens to be tomorrow!

Of course, there is no such thing as a free lunch. In exchange for “hyper inference” you are running multiple times as many forward passes of your detection model, in addition to the processing required to stitch the results together.

Setup

To illustrate how SAHI can be applied to detect small objects, we will use the VisDrone detection dataset from the AISKYEYE team at the Lab of Machine Learning and Data Mining, Tianjin University, China. This dataset consists of 8,629 images with side lengths ranging from 360 pixels to 2,000 pixels, making it an ideal testing ground for SAHI. Ultralytics’ YOLOv8l will serve as our base object detection model.

We will be utilizing the following libraries:

fiftyone
huggingface_hub
ultralytics
sahi

If you haven't already, install the latest versions of these libraries. You will need fiftyone>=0.23.8 to load VisDrone from the Hugging Face Hub:

Now in a Python process, let’s import the FiftyOne modules we will use to query and manage our data:

And just like that, we are ready to load our data! We’ll use the load_from_hub() function from FiftyOne’s Hugging Face utils to load part of the VisDrone dataset directly from the Hugging Face Hub via its repo_id. For demonstration and to keep code execution as fast as possible, we will only take the first 100 images from the dataset. We will also give this new dataset we are creating the name ”sahi-test”:

Before adding any predictions, let’s take a look at our dataset in the FiftyOne App:

💡Check out FiftyOne’s Hugging Face Integration for more information.

Standard Inference with YOLOv8

In the next section, we will run hyper-inference on our data using SAHI. Before we bring SAHI into the picture, let’s run standard object detection inference on our data with the large variant of Ultralytics’ YOLOv8 model.

First, we create an ultralytics.YOLO model instance, downloading the model checkpoint if necessary. Then, we apply this model to our dataset and store the results in the field ”base_model” on our samples:

💡Check out FiftyOne’s Ultralytics Integration for more information.

We can see a few things by looking at the model's predictions next to the ground truth labels. First and foremost, the classes detected by our YOLOv8l model are different from the ground truth classes in the VisDrone dataset. Our YOLO model was trained on the COCO dataset, which has 80 classes, while the VisDrone dataset has 12 classes, including an ignore_regions class.

To simplify the comparison, we'll focus on just the few most common classes in the dataset, and will map the VisDrone classes to the COCO classes as follows:

And then filter our labels only to include the classes we're interested in:

Now that we have our base model predictions let’s use SAHI to slice and dice our images 💪.

Using SAHI for Hyper Inference

The SAHI technique is implemented in the sahi Python package we installed earlier. SAHI is a framework compatible with many object detection models, including YOLOv8. We can choose the detection model we want to use and create an instance of any classes that subclass sahi.models.DetectionModel, including YOLOv8, YOLOv5, and even Hugging Face Transformers models.

We will create our model object using SAHI's AutoDetectionModel class, specifying the model type and the path to the checkpoint file:

Before we generate sliced predictions, let's inspect the model's predictions on a trial image using SAHI's get_prediction() function:

Fortunately, SAHI results objects have a to_fiftyone_detections() method, which converts the results to a list of FiftyOne Detection objects:

This makes our lives easy so we can focus on the data, not the nitty-gritty format conversions' details. SAHI's get_sliced_prediction() function works the same way as get_prediction(), with a few additional hyperparameters that let us configure how the image is sliced. In particular, we can specify the slice height and width, and the overlap between slices. Here's an example:

As a preliminary check, we can compare the number of detections in the sliced predictions to the number of detections in the original predictions:

We can see that the number of predictions increased substantially! We have yet to determine if the additional predictions are valid or if we just have more false positives. We'll do this using FiftyOne's Evaluation API shortly. We also want to find a good set of hyperparameters for our slicing. We will need to apply SAHI to the entire dataset to do all of these things. Let's do that now!

To simplify the process, we'll define a function that adds predictions to a sample in a specified label field, and then we will iterate over the dataset, applying the function to each sample. This function will pass the sample's filepath and slicing hyperparameters to get_sliced_prediction(), and then add the predictions to the sample in the specified label field:

We'll keep the slice overlap fixed at 0.2, and see how the slice height and width affect the quality of the predictions:

Note how these inference times are much longer than the original inference time. This is because we're running the model on multiple slices per image, which increases the number of forward passes the model has to make. We're making a trade-off to improve the detection of small objects.

Now let's once again filter our labels only to include the classes we're interested in and visualize the results in the FiftyOne App:

The results certainly look promising! From a few visual examples, slicing seems to improve the coverage of ground truth detections, and smaller slices, in particular, seem to lead to more of the person detections being captured. But how can we know for sure? Let's run an evaluation routine to mark the detections as true positives, false positives, or false negatives to compare the sliced predictions to the ground truth. We'll use our filtered view's evaluate_detections() method.

Evaluating SAHI Predictions

Sticking with our filtered view of the dataset, let's run an evaluation routine comparing our predictions from each prediction label field to the ground truth labels. Here, we use the default IoU threshold of 0.5, but you can adjust this as needed:

Let's print a report for each:

We can see that as we introduce more slices, the number of false positives increases, while the number of false negatives decreases. This is expected, as the model is able to detect more objects with more slices, but also makes more mistakes! You could apply more aggressive confidence thresholding to combat this increase in false positives, but even without doing this the F1-score has significantly improved.

Let's dive a little bit deeper into these results. We noted earlier that the model struggles with small objects, so let's see how these three approaches fare on objects smaller than 32x32 pixels. We can perform this filtering using FiftyOne's ViewField:

If we evaluate our models on these views and print reports as before, we can clearly see the value that SAHI provides! The recall when using SAHI is much higher for small objects without significant dropoff in precision, leading to improved F1-score. This is especially pronounced for person detections, where the F1-score is tripled!

What’s Next

In this walkthrough, we've covered how to add SAHI predictions to your data and then rigorously evaluated the impacts of slicing on prediction quality. We've seen how Slicing-Aided Hyper Inference (SAHI) can improve the recall and F1-score for detection, especially for small objects, without needing to train a model on larger images.

To maximize the effectiveness of SAHI, you may want to experiment with the following:

Regardless of which knobs you want to turn, it is important to look beyond the one-number metrics. When working on small object detection tasks, the more small objects in your images, the more likely there are missing “ground truth” labels. SAHI can help you find potential errors, which you can correct with human-in-the-loop (HITL) workflows.

If you found this helpful, here are some additional resources you may find useful:

Tutorial on Evaluating Object Detections
Tutorial on Finding Object Detection Mistakes
FiftyOne Plugin for Comparing Models on Specific Detections

Talk to a computer vision expert