VLA Models: Why Data-Centric AI Unlocks Next-Gen Robotics

Vision-Language-Action models represent a fundamental shift in robotics—combining visual perception, natural language understanding, and motor control into unified end-to-end systems. But while much attention focuses on architectural innovations, the field's progress depends on something more fundamental: how we organize, structure, and collect training data.

What are VLAs?

A Vision-Language-Action model integrates three capabilities that traditional robotic systems kept separate:

Vision: Processing camera feeds and sensor data through pre-trained encoders (CLIP, SigLIP, DINOv2) or Vision-Language Model backbones that extract visual features from the environment.
Language: Understanding natural language instructions via language encoders based on large language models, providing world knowledge and commonsense reasoning.
Action: Generating executable robot commands—joint angles, end-effector velocities, gripper states—through action decoders that translate semantic understanding into physical motor control.

Source: Vision Language Action Models in Robotic Manipulation: A Systematic Review, arXiv 2507.10672

The architecture typically consists of five components:

Vision encoders that process camera images and extract visual features
Language encoders that process natural language instructions
Proprioceptive encoders that process the robot’s internal state (e.g., joint positions, velocities, torques, and more)
Multimodal fusion layers that combine vision, language, and proprioceptive tokens using transformer architectures with cross-modal attention
Action decoders that translate this fused understanding into executable robot commands

The promise is compelling: general-purpose robots that understand natural language and adapt to new tasks without extensive reprogramming. But whether VLA models deliver depends fundamentally on their training data.

Why VLA robotics data is fundamentally different

Source: Vision-Language-Action Models: Concepts, Progress, Applications and Challenges, arXiv 2505.04769

Unlike large language models and vision-language models that leverage internet-scale data, VLA robotics faces unique constraints that architecture alone cannot solve. This is why a data-centric approach is critical for VLA development. Five fundamental differences distinguish VLA robotics data from the web-scale datasets that power LLMs and VLMs:

The action grounding gap

Unlike VLMs that can leverage web data to understand commands like "pick up the cup" conceptually, VLAs face a fundamental constraint. Videos of people picking up a cup lack the robot-specific data needed for execution—joint angles, end-effector poses, or motor commands.

This fundamental difference in data availability creates a bottleneck that architecture alone cannot solve. It’s why the field has turned to cross-domain data utilization: using simulation to generate synthetic data, leveraging human videos by bridging the human-robot embodiment gap, and exploring self-exploration methods where robots generate their own training data.

However, simulation quality depends entirely on the input data used to build those virtual environments. Over 50% of simulations end up unusable due to corrupted sensor data, wasting millions in compute costs before a single VLA model trains. FiftyOne Physical AI Workbench addresses this by sitting at the beginning of the pipeline to ensure every simulation starts with trustworthy data.

Temporal structure

In image classification, samples can be treated independently. They can be shuffled or sampled randomly. However, VLA robotic data is inherently sequential. Each episode is a variable-length ordered sequence where actions at time T depend on states and actions at T-1, T-2, and beyond. Models must learn causality, state transitions, and long-range dependencies. Treating robot data like independent samples destroys the temporal structure algorithms need to learn from sequences.

Proprioceptive signals

Robots need to have internal state awareness (e.g., joint positions, velocities, torques, end-effector poses, gripper status). These low-dimensional structured vectors are essential for precise manipulation. Vision tells you where objects are in the world, but not where your own arm is relative to your body. Fusing proprioception with high-dimensional visual inputs creates unique multimodal fusion challenges absent in pure vision-language models.

Embodiment heterogeneity

Source: A Survey on Efficient Vision-Language-Action Models, arXiv 2510.24795

Different robots have fundamentally different action spaces, sensor configurations, and kinematic structures. A single robot arm might have 7 degrees of freedom, while a dual-arm system has 14. The same action vector induces very different motions across robots. Camera viewpoints, properties, and coordinate frames vary substantially. This makes directly combining datasets from different robots impossible without standardization.

Data scarcity

While LLMs train on trillions of tokens and VLMs on billions of images, VLA robotics datasets measure in thousands or millions of trajectories. The primary method for collecting high-quality robot data is human teleoperation, which is expensive and severely limited in scalability. This fundamental bottleneck cannot be solved by architectural innovation alone. Simply better transformers, fusion mechanisms, or action decoders can't compensate for missing diverse training data.

How to measure VLA performance: Limitations of current benchmarks

Current VLA benchmarks measure model performance—manipulation skills, language conditioning, robustness across environments, cross-embodiment transfer. But they fail to reveal the data requirements that enable these capabilities. When benchmarks can't probe systematic generalization or test diverse scenarios, it signals that our training data may not be diverse enough to enable generalization in the first place.

This gap matters because understanding benchmark limitations reveals what data requirements VLA models actually need. Current benchmarks show limitations in:

Lack of unified standards: Different benchmarks use different metrics and experimental setups, making fair comparison difficult across models and datasets.
Limited task complexity: Most focus on simple, short-horizon tasks rather than advanced cognitive reasoning, multi-step planning, or handling unexpected situations.
Insufficient generalization probes: Benchmarks test fixed sets of tasks, objects, and environments without systematically probing whether models handle truly novel scenarios outside their training distribution.
Binary success focus: Measuring success/failure rates misses fine-grained metrics revealing how data quality and organization affect model behavior—action smoothness, number of attempts, edge case handling.

These limitations suggest models need more diverse training data than current benchmarks can evaluate. Better benchmarks measuring data impact require the standardization and organization that enable comprehensive evaluation frameworks.

How to measure VLAs performance: The push to standardize benchmarks

Three major initiatives address the benchmark limitations through standardization of how to measure VLA performance:

RLDS: Preserving temporal structure

Source: RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning, arXiv 2111.02767

The Robotics Learning Dataset Standard (RLDS) provides a lossless format that preserves sequential decision-making information. Its hierarchical structure maintains temporal relationships through SAR (State-Action-Reward) alignment. Each step shows what the world looked like, what action was taken, and what resulted.

RLDS provides transformation tools enabling algorithms to consume data in different shapes while maintaining underlying temporal structure. This enables reproducibility, easy dataset generation, and algorithms that exploit temporal information—capabilities lost when data is stored as independent state-action pairs.

Open X-Embodiment: Unifying diverse datasets

Source: Open X-Embodiment: Robotic Learning Datasets and RT-X Models, website: robotics-transformer-x.github.io/

Open X-Embodiment unifies over 1 million real robot trajectories spanning 22 robot embodiments from 60 datasets collected by 34 research labs. Models trained on diverse multi-embodiment datasets significantly outperform those trained on single-embodiment data, even when evaluated on that same single embodiment. Diversity provides complementary information improving generalization, proving why data organization matters.

LeRobot: Practical deployment

LeRobot offers a lightweight alternative using lossy video compression to reduce dataset size and bandwidth requirements. This makes datasets easier to share and distribute, prioritizing accessibility over complete information preservation. Different formats serve different needs—RLDS for research requiring complete information, LeRobot for deployment and widespread adoption.

Why data matters more than architecture

VLA robotics training data is fundamentally different from web-scale data, from temporal dependencies to embodiment heterogeneity, and limitations in current benchmarking systems. The question becomes: when faced with these constraints, what actually drives VLA performance—architectural innovation or data strategy?

The evidence points decisively toward data. While the field invests heavily in better vision encoders, more sophisticated fusion mechanisms, and novel action decoders, these architectural improvements remain incremental. Data scarcity remains the primary bottleneck, and no amount of architectural innovation can overcome the fundamental constraint of limited, low-quality training data.

Here's what the research demonstrates:

Scale and diversity drive performance: Open X-Embodiment demonstrates that large-scale general-purpose models trained on diverse datasets consistently outperform narrowly targeted counterparts on smaller, task-specific data.
Cross-embodiment training improves single-embodiment performance: Models trained on multiple robot embodiments perform better even on individual embodiments they were trained on. Diversity itself improves learning.
Quality over quantity: SmolVLA achieves high performance with datasets an order of magnitude smaller than state-of-the-art models through maximizing quality and diversity of real-world data. Better curation matters more than raw dataset size.
Architecture can't compensate for missing data: Pre-trained VLMs don't transfer effectively to robotic tasks without robot-specific demonstration data. The action grounding gap cannot be bridged by architecture alone. Models demonstrate proficiency in familiar scenarios but degrade significantly on novel tasks or environments—a data limitation, not an architectural one.
Generalization requires diverse training data: While models demonstrate proficiency in familiar scenarios, its performance drops significantly when encountering novel tasks or environments. Without various scenarios in training data, models won't generalize to diverse scenarios in deployment.

The path forward for VLA models

Data scarcity remains the primary bottleneck for VLA models. No amount of architectural innovation overcomes the fundamental constraint of limited, low-quality training data. Progress requires:

Cross-domain data utilization: Leveraging simulation, human videos, and self-exploration methods to expand available data sources
Better data curation: Maximizing value of limited data through filtering, augmentation, and optimization
Standardized formats: Enabling dataset combination from multiple sources and cross-embodiment learning
Comprehensive benchmarks: Measuring how data strategies affect performance, not just final success rates

For VLA models to achieve their promise of general-purpose robots that understand natural language and adapt to new tasks, the field must prioritize data over model architecture. Organization, standardization, and diverse data collection strategies determine what models can learn—and ultimately, what robots can do.

Bridging the Gap: From Simulation to Real-World VLAs

High-fidelity simulation offers a solution to robot data scarcity—but only when the input data meets quality standards. Over 50% of simulations end up unusable due to corrupted sensor data, wasting millions in compute costs before a single VLA model trains.

Physical AI systems like VLA-powered robots require multimodal sensor data—camera feeds, LiDAR point clouds, proprioceptive signals—that must be temporally synchronized and spatially calibrated. When a camera captures a scene at timestamp 10:23:45.127, the LiDAR must capture that exact same moment. Misalignment by even milliseconds places objects in wrong locations during 3D reconstruction, producing neural reconstructions that train VLA models incorrectly.

Good neural reconstruction

Poor-quality neural reconstruction

FiftyOne Physical AI Workbench addresses this by ensuring data quality before reconstruction begins. Built by Voxel51 and integrated with NVIDIA Omniverse NuRec and Cosmos, the workbench sits at the beginning of the pipeline to ensure every reconstruction starts with trustworthy data.

This data-first approach directly addresses VLA model bottlenecks: it expands limited robot demonstration data through validated synthetic generation, maintains the temporal alignment and multimodal fusion requirements VLAs need, and enables cross-domain utilization while preserving data quality standards.

When every simulation starts with trustworthy data, VLA models train on accurate representations of physical interactions—bridging the action grounding gap from semantic understanding to reliable motor control.

Get started

Ready to build VLA models with better data infrastructure? Explore FiftyOne for open-source robotics dataset management or FiftyOne Physical AI Workbench for enterprise-grade simulation pipelines.

Talk to a computer vision expert