How Voxel51 is Powering Physical AI with Databricks
Aug 20, 2025
5 min read

Why Physical AI starts with better data

In autonomous vehicle (AV) and advanced driver-assistance system (ADAS) development, the hardest problems often hide in the long-tail rare events, such as a pedestrian crossing in the rain at night or a cyclist partially obscured in a crosswalk. These edge cases are exactly where models struggle most, and yet they’re buried across petabytes of unstructured sensor data.
Too often, teams spend months writing custom queries, trawling through metadata, or hand-labeling samples just to uncover a handful of the moments that matter. Even then, they’re rarely sure they’ve found all the right examples.
Voxel51 changes that. By combining the scalable data infrastructure of Databricks with the powerful discovery and curation tools offered by FiftyOne, the data engine for visual and multimodal AI that unlocks the full potential of your model performance, teams can now search, slice, and surface critical AV/ADAS scenarios in hours instead of weeks. This joint stack brings structure to unstructured data and makes it possible to iteratively build the datasets that truly move model performance forward.

Powering the AV data pipeline

This Databricks + FiftyOne integration lays the groundwork for something even more powerful: the AV/ADAS data engine. Instead of treating data discovery and annotation as one-off tasks, teams can now build continuous pipelines that identify edge cases, validate quality, and trigger labeling workflows, all with human-in-the-loop oversight.
This isn’t just about finding better data once. It’s about creating a feedback loop where model insights guide data curation, the curated data fuels better models, and the cycle repeats. With the scalable compute of Databricks and the visual intelligence of FiftyOne, we're enabling a new kind of infrastructure, one that automates the search for signal in the noise and accelerates the development of safer, smarter vehicles.

The Physical AI stack: Databricks + FiftyOne

The Databricks + FiftyOne integration forms a seamless pipeline that connects a scalable data infrastructure with visual-first discovery tools
  • Databricks provides the scalable backbone to:
  • FiftyOne builds on top of that foundation and provides the data engine for visual and multimodal AI to:
    • Explore, curate, and analyze visual datasets and models
    • Leverage Data Lens to visually query and filter events at scale
    • Discover unstructured sensor data with embeddings, similarity search, and scenario filters
    • Identify rare conditions and curate high-value subsets for model development
Together, they enable AV/ADAS teams to work fluidly across structured and unstructured worlds. Databricks powers the storage, governance, and indexing, while FiftyOne brings those indexed volumes to life in a rich, interactive interface. Let’s take a look at how.

Use case: Surfacing rare events in AV datasets

Autonomous vehicle (AV) datasets such as nuScenes, BDD100K, or internal fleet collections are massive, complex, and filled with edge cases that directly impact model performance. The challenge isn’t collecting data; it’s finding the moments that matter most buried within millions of frames.

Starting with Databricks Volumes

The first step is staging AV datasets in Unity Catalog-managed Volumes, which centralizes all your sensor data (camera images, LiDAR sweeps, radar, and labels) in a governed storage layer.
We then structure key metadata (e.g., weather, time of day, object counts per frame) into a Delta table. This structured index provides the foundation for efficient queries without combing through raw files.

Searching with Data Lens + Databricks Vector Search

With the dataset indexed, connect it to FiftyOne Data Lens. Powered by Databricks Vector Search, Data Lens allowed us to visually query for combinations of attributes and embeddings that define key events, such as:
  • Adverse weather conditions (e.g., rain, fog, snow)
  • Pedestrian or cyclist presence in crosswalks
  • Nighttime scenarios with multiple overlapping objects
These queries return focused subsets of frames in seconds, even across millions of samples.

Drilling down with scenario analysis and embedding view

It's not enough to just find a scenario; you need to know exactly what you are missing. Voxel51’s Model Evaluation Panel, which includes analysis of different scenarios, lets you understand where your model is failing and where you need to bolster your dataset. Then you can search both your current dataset and your large data lake using similarity search powered by Databricks to find that data.

Results: finding what matters

In just a few minutes, this workflow can isolate AV scenarios crucial to training. The curated datasets become the foundation for improved model training and continuous evaluation. No more wasted hours searching across petabytes of data for what you are looking for, get what you need in minutes. The right data creates the best model.

Finding rare events and edge cases

In AV/ADAS, the most important data is often the hardest to find. A single missed pedestrian in a crosswalk or a rare combination of bad weather and traffic conditions can have an outsized impact on model performance and on safety.
This is exactly what Voxel51 enables with FiftyOne on Databricks. FiftyOne and Databricks bridge the gap between managing and making sense of visual data. By bringing together Databricks’ scalable storage and indexing with FiftyOne’s visual-first discovery and curation, teams can now surface these critical events in ways that simply weren’t possible before:
  • Pinpoint long-tail scenarios instantly: Search millions of frames for combinations of conditions, objects, and behaviors that define your edge cases.
  • See your data in new ways: Embeddings and similarity search let you uncover patterns and outliers you might not even know to look for.
  • Continuously close the loop: As new data flows into Databricks, FiftyOne makes it easy to expand and refine curated datasets without starting over.
This isn’t just incremental improvement, it’s a step change. For the first time, AV/ADAS teams can see their entire dataset, find what truly matters, and act on it in real time. That’s the promise of Physical AI, and it’s being enabled by Voxel51.
Join our upcoming webinar on Sept 4, 2025, @9am PT to learn how Porsche is advancing its autonomous vehicle (AV) development by leveraging the power of Voxel51 and Databricks.

Talk to a computer vision expert

Loading related posts...