One API to Rule Them All: py123d + FiftyOne for Autonomous Driving Data

Jun 8, 2026
5 min read
From fragmented AV datasets to a single explorable FiftyOne collection in under 200 lines of Python
Autonomous driving research has a dataset problem — not a shortage of data, but a fragmentation of it. Argoverse 2, nuScenes, PandaSet, Waymo, nuPlan, KITTI-360: each one ships with its own file format, its own coordinate conventions, its own label taxonomy, and its own bespoke Python devkit that refuses to coexist with the others in the same environment. If you want to train across datasets, you could spend weeks writing glue code before you write a single line of model code.
If you can relate, check out py123d, a recently updated open-source library (Apache 2.0) that address this problem head on.

What py123d Does

py123d converts raw data from Argoverse 2, nuScenes, nuPlan, KITTI-360, PandaSet, and Waymo into a unified Apache Arrow format, then gives you a single API to read cameras, lidar, HD maps, and labels across all of them. One ArrowSceneAPI object. Same method calls. Every dataset.
Under the hood it uses columnar, memory-mapped, zero-copy Arrow reads for fast and memory-efficient access, supports multiple sensor codecs (MP4/JPEG/PNG for cameras, LAZ/Draco/Arrow IPC for lidar), and avoids storing sensors twice by referencing source files via relative paths.
The motivation is real: as the authors describe in their arXiv paper, each dataset adopts different 2D and 3D modalities with different rates and synchronization schemes, comes in fragmented formats requiring complex dependencies that cannot natively coexist in the same environment, and carries major inconsistencies in annotation conventions that prevent training or measuring generalization across multiple datasets. py123d absorbs all of that complexity so you don't have to.

A Quick Note on py123d's Built-in Viewer

py123d ships with a built-in Viser-based 3D viewer, and credit where it's due — it works, it's fast, and it's a thoughtful addition to the library. For quick sanity checks directly after conversion it does the job.
That said, if you want to actually explore your data — filter by label, compare annotation distributions across datasets, step through frames with synchronized camera and LiDAR views, and build intuition about what's in your training set — Viser isn't designed for those workflows. FiftyOne is.

Why FiftyOne Makes It Better

py123d solves the ingestion problem. FiftyOne is an open source library that solves the understanding problem. Once your data is loaded, FiftyOne gives you an interactive 3D viewer, label filtering, dataset statistics, and cross-dataset comparison — all in the browser, all without writing visualization code.
  1. Dataset-scale querying — FiftyOne lets you filter and slice across thousands of scenes by label type, class distribution, or metadata. py123d's Viser viewer is great for a single scene walkthrough, but FiftyOne gives you a dataset-level lens.
  2. Cross-dataset comparison — Since py123d normalizes Waymo, nuScenes, KITTI etc. into the same schema, you can load all of them into one FiftyOne dataset and visually compare how different datasets label the same object categories, sensor configurations, or edge cases.
  3. Label quality and model evaluation — FiftyOne's uniqueness, hardness, and evaluation tools let you find underrepresented or mislabeled samples across the unified dataset, which is hard to do browsing scene-by-scene.
  4. 2D camera + 3D point cloud co-visualization — FiftyOne supports synchronized camera and LiDAR views side-by-side, which pairs naturally with py123d's unified sensor model.
  5. Embedding visualization — Run CLIP over camera frames and use FiftyOne's t-SNE views to cluster scenes by visual similarity across datasets — useful for finding redundant or underrepresented training samples.
Together they form a tight loop: py123d normalizes the data, FiftyOne makes it explorable. You can load multiple datasets into a single FiftyOne collection tagged by source and instantly filter between them, inspect individual frames with synchronized camera and LiDAR views, and verify that your annotations actually land where they should.

Highlights from the Notebook

We put together a Jupyter notebook that walks through the full pipeline end to end, pulling ~900 MB of data across 3 AV2 logs and 3 PandaSet logs. A few things worth calling out:
One function, two datasets. Because py123d normalizes everything to ArrowSceneAPI, the same load_log_samples() function works for both Argoverse 2 and PandaSet without modification. Swap in nuScenes or Waymo and nothing changes except the conversion command.
World-to-ego transform matters. AV2 stores box coordinates in a global world frame — coordinates like (6711, 1703, 61) — while FiftyOne's 3D viewer expects positions relative to the ego vehicle. Without the transform, boxes appear thousands of meters from the point cloud. The fix is a single matrix multiplication using ego.center_se3.inverse.transformation_matrix (note: inverse is a property, not a method — calling it with () raises a TypeError), but it's the kind of thing that burns hours if you don't know to look for it.
Height-colored LiDAR. Raw point clouds are hard to read in monochrome. Coloring each point by its Z height using a jet colormap — blue at ground level, red at rooftop height — makes structure immediately legible in the viewer.
Tagged by source. Every sample gets a source field ("av2" or "pandaset") so you can filter the combined dataset down to a single origin with one line: dataset.match(fo.ViewField("source") == "av2"). Useful baseline for cross-dataset label distribution analysis.
Per-direction CLIP embeddings. The notebook normalizes AV2's 9 camera names and PandaSet's 6 camera names to shared directional labels (front, side_left, rear, etc.) so both datasets land in the same FiftyOne slices. This enables computing separate CLIP embeddings for each camera direction — front, side, and rear — each with its own brain_key visible simultaneously in the Embeddings panel.
Color by source to see whether AV2 and PandaSet cluster separately per direction (a clean separation means a domain gap that would hurt cross-dataset generalization), or lasso outliers to find your rarest and most valuable frames.

Get Started

The full notebook handles the venv setup, both dataset downloads, the ego-frame transform, per-direction embeddings, and FiftyOne launch.

Thanks to the Authors

A genuine thank you to Daniel Dauner, Valentin Charraut, Bastian Berle, Tianyu Li, Long Nguyen, Jiabao Wang, Changhui Jing, Maximilian Igl, Holger Caesar, Boris Ivanovic, Yiyi Liao, Andreas Geiger, and Kashyap Chitta at the University of Tübingen, Tübingen AI Center, NVIDIA Research, and KE:SAI for building py123d and releasing it under Apache 2.0. The amount of dataset-specific parsing, coordinate convention reconciliation, and dependency isolation work that went into this library is enormous, and making it freely available accelerates the whole field. If you use it in your research, cite their paper.

Talk to a computer vision expert

Loading related posts...