If you’ve worked with any visual data for machine learning, you may have seen rectangles called bounding boxes. But what are bounding boxes? A bounding box in computer vision is a rectangle (2‑D) or cuboid (3‑D) that encloses an object so models know where it is. Think of it as the tightest frame that fully contains the target — a bounding box defined by its coordinates. Labeling these boxes is usually step one when teaching models to detect objects.
Bounding boxes provide ground truth for object‑detection models and a yardstick for evaluation. Metrics such as Intersection over Union (IoU) measure overlap between predicted and true boxes, and datasets like COCO compute mAP from these overlaps to rank models. Without accurate boxes, a detector can’t learn where things are.
Most people start with 2‑D axis‑aligned rectangles in images or video. In multimodal stacks — e.g., autonomous driving — we extend the idea to 3‑D cuboids in LiDAR point clouds. A 3‑D box captures length, width, height, and rotation, letting downstream planners reason in real‑world space.
Voxel51’s FiftyOne includes a browser‑based App that lets you overlay detections and ground truth. Filter by box area, confidence, or label to surface edge cases instantly. Need more labels? Pipe a view to CVAT or Labelbox via the annotation API, collect fresh boxes, then re‑evaluate — all without leaving your workflow. If you’re working with point clouds, the 3‑D viewer renders cuboids in WebGL so you can orbit around scenes and inspect every angle.
Whether you’re boxing stop signs in dash‑cam footage or cuboids around drones in LiDAR, bounding boxes are the scaffolding that lets vision models learn and localize. Clean, well‑curated boxes lead to better performing models.
Talk to an expert about FiftyOne for your enterprise.
Like what you see on GitHub? Give the Open Source FiftyOne project a star
Get answers and ask questions in a variety of use case-specific channels on Discord