Synthetic AV Data Pipeline for Autonomous Driving

Autonomous vehicle teams face a structural problem with training data. Real-world data collection is slow and expensive, but more fundamentally, it cannot cover all the rare scenarios that determine whether a model is safe. Synthetic data has long been a promise in the physical AI space. It enables development teams to generate scenarios that cannot be reliably captured in the real world and use them to close the gaps in training.

Most synthetic data pipelines produce outputs that appear plausible on the surface, can show errors on close inspection, such as incorrect traffic signal states, lighting inconsistencies, or even hallucinated scene elements. The pipelines that do generate accurate data simply haven’t existed outside of the largest research labs. Building a large-scale synthetic data pipeline requires operating across simulation, large-scale GPU training, and deployment environments—a significant investment for every team building physical AI systems, including robotics and autonomous driving.

Voxel51, NVIDIA, and Nebius are working to make this possible. The NVIDIA Physical AI Data Factory Blueprint is an open reference architecture that unifies the generation, augmentation, and evaluation of physical AI training data at scale. Paired with Voxel51's data curation capabilities and Nebius's GPU cloud infrastructure, it enables development teams to move from raw data to model-ready training sets at the scale physical AI requires.

Voxel51 is delivering this synthetic data generation pipeline for its customer, Porsche Research, the global automotive leader in high-performance vehicles and advanced engineering innovation, to accelerate their autonomous driving data augmentation workflows.

Physical AI Data Factory Blueprint: Enabling data augmentation and analysis workflows

The Physical AI Data Factory Blueprint enables physical AI teams to quickly and reliably augment raw data into model-ready training sets. The blueprint offers modular workflows that enable developers to take real-world driving footage, identify the scenes that matter most for model performance, and generate hundreds or thousands of high-quality variations—different weather, traffic conditions, time of day, and more. Every output is automatically graded for quality before it ever touches a training set.

The goal is to close long-tail distribution gaps, i.e., the rare, high-stakes scenarios that real-world data collection can’t reliably cover, and that models consistently fail on.

The pipeline has two entry points that both run through Voxel51. A path for raw data visualization and curation that enables data exploration and audit checks before generation. The other takes the outputs that pass data generation and grading back into FiftyOne for review.

The Physical AI Data Factory: From data acquisition to generation, curation, and evaluation

Porsche Research accelerates synthetic data generation pipeline with NVIDIA, Nebius, and Voxel51

Porsche Research team is working on scenarios that appear rarely in production data; exactly the kind of edge cases that determine whether a model is safe or not. Manually collecting and labeling footage for every permutation isn’t feasible. Porsche adopted a data generation workflow to accelerate automation and reduce their pipeline complexity.

Porsche’s team uses embedding search and visual dataset exploration in Voxel51 to identify which scenes have the most impact (positive or negative) on model performance. Their data agent automates much of this discovery, surfacing the scenes worth augmenting before a human ever has to review them. Once a scene is identified, it moves into the generation pipeline.

From there, Nebius handles the compute. Running on Nebius’s infrastructure, the pipeline uses NVIDIA Cosmos Reason or Qwen3-VL to auto-label the source footage and generate a starting prompt describing the scene. The user then specifies the augmentations they want—such as different weather scenes, lighting changes, traffic signal states, and more.

The NVIDIA Nemotron 3 Nano open model then jitters those prompts to introduce variation across the generated batch, ensuring outputs don’t all look identical. The Cosmos Transfer model runs multiple passes to produce the synthetic video, with the entire process orchestrated through NVIDIA OSMO. Finally, Cosmos Evaluator scores each output across multiple axes: hallucination detection, traffic signal validity, and scene coherence. Only outputs that pass this are surfaced back in FiftyOne for review.

The result is a closed loop: curate in FiftyOne, generate on Nebius using NVIDIA models, and review the results back in FiftyOne.

Porsche Research is also investing in what comes next. Their data agent is an early signal of where physical AI development is heading: agentic workflows that surface insights and trigger pipelines without constant human intervention. The Data Factory was designed to support that direction.

Physical AI: from lab to reality

The infrastructure for serious synthetic data work is now accessible without having to build it from scratch. Teams that have been blocked by the cost and complexity of standing up this kind of pipeline have a blueprint and a concrete reference in Porsche.

The physical AI stack, which includes data curation and quality management, data generation and grading, and model training, along with the backend GPU-powered compute infrastructure, reduces complexity and helps bring these systems to reality.

Learn more and talk to the physical AI experts to get started.

Join us to see the pipeline live with Porsche at the Voxel51 GTC 2026 booth #1645.

Voxel51, Nebius, and NVIDIA Power Porsche's Synthetic AV Data Pipeline

Talk to a computer vision expert

Physical AI Data Factory Blueprint: Enabling data augmentation and analysis workflows

Porsche Research accelerates synthetic data generation pipeline with NVIDIA, Nebius, and Voxel51

Physical AI: from lab to reality

Talk to a computer vision expert

Related posts

Related posts