Why Are Image Segmentation Maps Superior to Bounding Boxes?

A segmentation map is a visual depiction caused by dividing an image into many parts, or segments, to get a pixel-level understanding. Segmenting is crucial our daily lives, from helping children recognize sounds to helping businesses understand their customers' preferences. It helps us make sense of chaos in various contexts of sounds, markets, or pixels. The evolution of image segmentation began in the 1950s-1970s with some early milestones and, later, in 2012, with biological neural networks. It has significantly advanced medical imaging (see ISBI'12 Challenge) and formed the backbone of modern computer vision. Today, segmentation provides a detailed lens for visual analysis. Paper: U-Net - ArXiv In computer vision, traditional object detection methods can struggle with nuanced boundaries. Segmentation maps address this by identifying “what” an object is and “where” it begins and ends, providing pixel-level granularity for better scene interpretation. A segmentation map is a visual representation of an image where each pixel is labeled a class. In this article, we’ll explore the power of segmentation maps in autonomous driving with the help of FiftyOne. We will explore segmentation tools (using FiftyOne and the KITTI dataset) as well provide practical code snippets in-line here and in the linked notebook. Here’s what you’ll learn:

Understand the basics of segmentation maps, their types, and practical segmentation scenarios.
Use FiftyOne’s tools to visualize segmentation saps and gain insights from segmented scenes.
Learn data annotation and augmentation strategies to generate high-quality, precise segmentation maps with FiftyOne.

The story behind image segmentation

Image segmentation has progressed steadily from its original deterministic methods. Earlier segmentation methods included classical techniques such as Support Vector Machines (SVMs), k-means clustering, edge detection, and graph-based approaches. These foundational methods paved the way for today's practices. The year 2015 marked a turning point with the rise of deep learning-based techniques like Fully Convolutional Networks (FCNs). The success of these models relies on their ability to learn intricate feature representations through feature maps. Feature maps are essentially sets of arrays generated by layers within these models. Each array highlights specific features detected in the input image, such as edges, textures, or shapes. These feature maps collectively form a rich feature representation, which is a transformed and more abstract encoding of the original image, optimized for the segmentation task. Later, the introduction of U-Netarchitecture revolutionized the field with its encoder-decoder architecture and efficient upsampling. This made segmentation more robust and suitable for real-time applications. U-Net’s architecture is particularly effective in segmentation because its encoder path progressively extracts feature maps at different scales. These feature maps are then upsampled and combined in the decoder path to generate precise segmentation masks. Fast-forward to 2023, and the Segment Anything Model(SAM) has emerged as a pinnacle segmentation model. SAM uses vision transformers to enable zero-shot segmentation across diverse datasets, and it sets a new standard for segmentation capabilities. By learning highly generalizable feature maps, SAM can effectively segment objects in diverse and unseen images. Let's now briefly discuss segmentation maps before exploring SAM on the Virtual KITTI dataset.

Understanding segmentation maps

Segmentation maps label each pixel in an image based on objects or categories, capturing more precise boundaries and offering clarity beyond simpler object detection techniques.

Precise boundaries: Segmentation maps accurately represent contours and edges, making them essential for surgical planning or pathfinding in autonomous systems.
Detailed overlap analysis: In cluttered scenes, segmentation maps offer a detailed analysis of individual objects in satellite imagery.
Enhanced interpretability: By providing a pixel-level view of objects, segmentation maps make interpretation more transparent and crucial in sensitive fields such as healthcare and defense.

Types of image segmentation: Choosing the right segmentation technique

Segmentation techniques vary depending on the complexity and the choice depends on the details required for a task. Here are the key approaches:

Semantic segmentation provides a pixel-level view of objects, segmentation maps make interpretation more transparent and crucial in sensitive fields such as healthcare and defense.

PapersWithCode Semantic Segmentation

Instance segmentation differentiates between individual instances of the same class (e.g., multiple sheep in a scene). It is necessary when object-level granularity is crucial.

PapersWithCode Instance Segmentation

Panoptic segmentation combines semantic and instance segmentation strengths, and provides a holistic view of the scene. Every pixel is labeled with unique IDs for countable objects. This works well in complex scenes with a mix of class-level and instance-level understanding (e.g., identifying each person as a unique entity with the same class as “person”).

PapersWithCode Panoptic Segmentation To create and visualize such segmentation maps, we need to prepare a high-quality annotated dataset and train and evaluate a model for our custom use case. Here, we are developing a SAM model fine-tuned on the KITTI dataset using FiftyOne.

Harnessing FiftyOne for segmentation map visualization

FiftyOne is a tool that enables you to visualize datasets and interpret model performance faster and more effectively (for example, curating, visualizing, and analyzing segmentation maps). The key value proposition is that better data ultimately leads to better performing models. For image segmentation specifically, FiftyOne offers a few key capabilities:

Streamlined workflows: FiftyOne provides an intuitive UI for labeling, dataset exploration, and model evaluation.
Powerful visualization tools: Overlay segmentation maps on images and compute detailed analytics.
Targeted data insights: Filter views and identify outliers for focused evaluation.

Dataset loading and exploration

In our example, we can start by loading the KITTI dataset from the FiftyOne Dataset Zoo. The KITTI dataset provides left-camera images and 2D object detections, ideal for autonomous driving tasks.

kitti_data = foz.load_zoo_dataset("kitti",
split="train",
)

We can isolate specific classes (e.g., “van”), creating subsets of data for targeted analysis.

van_frames = kitti_data.filter_labels('ground_truth',
(F('label') == "Van")
)
van_frames

Then we can isolate images with at least ten segmented objects detected above 80% confidence.

view = sub_data.filter_labels(
"segmentations",
(F("confidence") > 0.8)
).match(
F("segmentations.detections").length() > 10
)

This filtering is only made possible if we apply a segmentation model and view the predictions. Let's do that now.

Running predictions with SAM

Note that applying SAM to generate segmentation masks requires the transformers and segment-anything libraries. In the following code, we'll select 500 random images on which to run predictions, apply the model, and save the segmentation results in a label_field for further analysis.

# Segment inside boxes
sub_data = kitti_data.take(500)
sub_data.apply_model(model,
label_field="segmentations",
prompt_field="ground_truth",)

Unlike other models, SAM allows location-specific segmentation. This approach avoids clutter by running segmentation only within specified bounding box areas. Semantic Segmentation classifies each pixel in an image into a category. Here, all cars are labeled as “cars” and people as “pedestrians.” The following Patches View allows close inspection of bounding box areas in the image. Instance Segmentation goes beyond classification to differentiate between individual objects. Here, every car in a scene is labeled as a unique entity. And then we can create an equivalent Patches View of the same instance segmentation task. By harnessing FiftyOne and SAM, image segmentation becomes more targeted and efficient, and yields actionable insights (such as evaluating model performance vs. the ground truth).

Creating high-quality segmentation maps

Producing high-quality segmentation maps requires well-annotated data, diverse perspectives, and balanced classes. Here’s how to ensure your segmentation maps are accurate, robust, and ready for real-world applications.

Efficient data annotation

High-quality annotations are the foundation of any segmentation model. Even though it is expensive to annotate every object, it is pretty important as poor labeling can mislead the model's predictions and skew the results.

Best practices

Use tools like CVAT that support pixel-level annotation.
Collaborate with domain experts to capture task-specific nuances, such as distinguishing similar classes.
Conduct quality checks with peers to spot inconsistencies in annotations, like missing a leg of a chair in the corner (trust me, it’s hard to annotate a chair).

Source: Voxel51

Data augmentation

Real-world images often show variability due to lighting, perspective, and spatial resolution changes. Introducing diversity through data augmentation enhances model robustness and generalization across varied conditions. Key techniques include:

Applying geometric transformations (rotations, scaling, etc.) to simulate different perspectives of the same input image.
Altering photometric features (brightness, saturation, etc.) to account for lighting variations for a robust model.
Focusing on specific regions (cropping) or standardized sizes (resizing)

Tools like FiftyOne's integration with Albumentations can simplify these processes. Source: Voxel51

Balancing the dataset

A class imbalance can introduce bias in the model's decision curve. For example, training solely on daytime urban images may harm performance in other settings. It's best to ensure that your dataset has a diverse representation with an almost equal proportion of classes. If not, then the underrepresented classes can be augmented to achieve balance.

Interpreting segmentation maps

Segmentation maps provide a significant amount of information, but their true value lies in how we act on them. Let's explore a few practical workflows.

Overlaying segmentation maps on original images

A raw segmentation mask may appear abstract to us, especially to non-technical audiences, as it is just a colorful mask of binary digits. We combine binary information with the original image and make sense of it by overlaying segmentation masks onto each object. This approach enhances clarity and makes the relationships between objects and their surroundings more visible. For example, an image of a van with a segmentation mask (in the second image) shows its position and its precise boundaries, complementing its exact size and shape.

Quantifying object areas

It doesn't stop there; these masks allow for accurate measurements of object areas as each pixel is associated with a class label. It might be helpful in various applications, such as estimating the proportion of road space occupied by vehicles in a scene to monitor traffic flow. By comparing the Manhattan distance between bounding boxes and filtering based on thresholds (e.g., selecting only those with a distance greater than 1), we can clearly distinguish occupied from free space, which is beneficial for urban planning and autonomous driving.

# Bboxes are in [top-left-x, top-left-y, width, height] format
manhattan_dist = F("bounding_box")[0] + F("bounding_box")[1]

# Only contains predictions whose bounding boxes' upper left corner
# is a Manhattan distance of at least 1 from the origin
view = sub_data.filter_labels(
"segmentations",
manhattan_dist > 1
)

Identifying object instances and locations

Instance segmentation identifies and localizes individual objects, aiding analysis in areas such as:

Autonomous vehicles: Detecting pedestrians and vehicles for safe navigation.
Agriculture: Locating diseased crops for targeted treatment.
Retail: Optimizing store layouts by mapping product locations.

Here is an example of the dataset's segmented masks of different “pedestrian” objects.

Why segmentation is superior to bounding boxes

And now cutting to the chase: why is segmentation superior to bounding boxes? While bounding boxes estimate object locations relatively precisely by enclosing them in a rectangle, segmentation maps offer additional pixel-level precision. Segmentation maps provide clearer shapes than bounding boxes, eliminating irrelevant background elements. Minor imperfections (e.g., a missing part of a foot in the below image) can have a cascading effect on model accuracy. Overall, segmentation maps deliver the granularity needed for detailed analyses. Another example shows segmented objects from the cyclist class with their masks overlayed. Bounding boxes can be a poor geometric fit for this sort of shape. Segmentation mapping solves this with pixel-level labeling.

Real-world applications of segmentation maps

Segmentation maps are revolutionizing how we analyze satellite data for environmental monitoring and urban planning. Planet Labs is leading this space with key use cases, including regular updates on building infrastructure changes, insights into new and existing road developments, and early identification of construction activities. Segmentation maps also empower advanced robotic capabilities across various industries. For example, in cooking, vision-guided segmentation helps identify food items and assists in food preparation. Imagine experiencing Michelin-star dining with robotic chefs delivering precision and artistry. Chef Robotics is pioneering this vision with its ChefOS platform, which features pick-and-place robots and other innovative tools.

The future of segmentation maps: Transformers

The paper Attention is All You Need highlights significant progress in segmentation, with deep learning models now using transformers and attention mechanisms. Vision Transformers (ViTs) and Segmenters use self-attention for better global feature understanding, outperforming traditional convolutional methods. Hybrid architectures (Swin Transformers) combine CNNs and transformers, enhancing accuracy through hierarchical representations and finer spatial details. Other approaches like weakly-supervised learning reduces dependency on extensive labeled datasets. Some key methods - Class Activation Maps (CAMs) and Grad-CAM++ help identify object regions with minimal supervision. The CLIMS model (2023) integrates text and images, achieving robust segmentation without pixel-level annotations. Segmentation models are increasingly tailored to specific industries:

Healthcare: nnU-Net: No New Net offers a self-configuring solution for medical images, setting new benchmarks in tumor detection.
Agriculture: DeepWeeds aids in identifying invasive plants, improving crop yield and pest control.
Autonomous Systems: BEVFormer enhances navigation for self-driving cars through bird’s-eye view perception and obstacle detection.
Retail: Improved segmentation maps enhance self-checkout systems by accurately identifying products and improving user experience.

These advancements promise to make segmentation more accessible and impactful in real-world applications.

Key takeaways

In this article, we explored segmentation as a computer vision task using the KITTI dataset for autonomous navigation. Segmentation allows for a deeper analysis of images, while object detection provides a high-level overview of the objects present. We covered:

Understanding Segmentation Maps: Segmentation maps classify every pixel in an image, providing a comprehensive view of scenes through semantic and instance segmentation.
Using FiftyOne for Visualization and Insights: FiftyOne’s tools facilitate the visualization of segmentation maps, allowing developers to overlay predictions, explore dataset patterns, and gain insights.
Generating High-Quality Segmentation Maps: Accurate segmentation maps require careful data annotation and refinement. FiftyOne simplifies this process, enabling efficient and reliable results.

Next steps

With tools like FiftyOne, you can truly harness the power of segmentation maps in your projects! We invite you to dive into FiftyOne and visualize your datasets—discover how engaging and insightful this can be. Don’t forget to check out our related blogs, where you can find advanced features and inspiring real-world case studies! We’re eager to hear your thoughts—how do you envision segmentation maps influencing the future of AI? Please share your insights or any challenges you face.. Keep in touch for more on using FiftyOne in advanced AI workflows. Whether you're aiming to boost model performance or learn new techniques, we’re here to support your journey!