2026 State of Visual and Physical AI

May 27, 2026
21 min read

Introduction

The last decade of AI progress was built on text. Billions of documents, books, web pages, and conversations were scraped and fed into models until language became commoditized. As text data yields increasingly incremental returns, the frontier has shifted toward data in the physical world.
Video, sensor streams, LiDAR point clouds, time series from industrial equipment, satellite imagery, medical scans, and other high-dimensional multimodal data carry enormous untapped potential. This is the visual layer that physical AI is built on — the perception channel through which machines see, reason, and act in the physical space. But much of this data sits locked inside enterprises that have spent years accumulating it without the infrastructure to make it usable.
Physical AI is where the hardest problems are, and where the most important work will be done. It's also new territory in the truest sense: as of 2026, there are no established benchmarks, no standardized infrastructure, and no consensus on what good looks like at each stage of the pipeline.
This report, conducted by Dimensional Research and commissioned by Voxel51, draws on a 2026 online survey of over 700 professionals working at the intersection of visual AI and the physical world. It documents how teams are actually building physical AI — so practitioners can benchmark their own work, and leaders can invest in the capabilities that will define the next decade of AI. Where relevant, data is broken out by team maturity and deployment stage so readers can locate their own work within the broader field — or use the Physical AI Pipeline Audit to get a quick read on where they stand.

Key findings

Visual and physical AI are inherently multimodal

  • 68% of visual and physical AI teams work across three or more data modalities, with top physical AI applications including tracking, 3D reconstruction, pose estimation, and VLA models
  • 92% expect world models and spatial intelligence to break into mainstream awareness

Value is proven, but investment lags behind the opportunity

  • 78% of teams are already seeing value from their visual and physical AI investments, and 86% expect its importance to grow over the next three years
  • Yet 74% say the industry is underinvested

Model failures are ubiquitous, and the causes trace back to data

  • 100% report underperforming models
  • The top five causes of model failures are all data issues: insufficient training data (57%), data quality problems (48%), domain shifts (43%), annotation errors (32%), and class imbalance (31%)
  • 95% say human involvement is regularly required throughout model development, with 67% diagnosing failures through manual review

Data work is the bottleneck, not data collection

  • 89% say data is the primary driver of visual and physical AI success
  • 97% struggle to iterate on datasets, with the top pain points including bad labels (59%), difficulty in identifying samples hurting model performance (47%), and coverage gaps (43%)
  • 58% of exceptional teams spend more than half their project time on data work, compared to just 21% of struggling teams — a nearly 3x gap
  • 63% agree that synthetic data will become the primary source of training data

Annotation is expensive, wasteful, and painful

  • 99% of teams describe their annotation process as painful, and only 34% are satisfied with it
  • 36% report that more than half of their annotated data never reaches production
  • 44% expect annotation costs will increase in the coming year

Visual and physical AI are inherently multimodal

Physical AI — the ability to combine images, video, 3D, sensor, and time-series data into unified systems that perceive and reason — is no longer an experimental pursuit. It's how teams are building today.
Images dominate the field at 92%, and video is nearly as widespread at 63%. But practitioners are working with far more than 2D frames.
Time-series data from inertial measurement units (IMUs) and sensors (42%), 3D point clouds, meshes, and LiDAR (38%), audio (25%), and multispectral imagery (18%) are all active parts of workflows today.
Moreover, most teams aren't specializing in one data type — they're fusing several. Physical AI is inherently multimodal, with 68% working across three or more modalities, and only 6% work with just one.
Horizontal bar chart titled “Data modalities.” Images are the most common data modality at 92%, followed by video at 63%, language/text at 55%, time-series data such as IMU or sonar at 42%, and 3D data such as point clouds, meshes, or LiDAR at 38%. Lower responses include audio at 25%, multi/hyperspectral data at 18%, and other at 3%.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Horizontal bar chart titled “Count of Different Modalities.” The most common number of data modalities used is three at 30%, followed by two at 25% and four at 21%. Fewer respondents use five modalities at 8%, one modality at 6%, seven modalities at 5%, or six modalities at 4%.
Source: 2026 State of Visual and Physical AI Report | Voxel51
68% work across three or more modalities
The infrastructure and tooling required to support physical AI at scale — multimodal data management, 3D visualization, sensor fusion — is still catching up to practitioner intent. For teams moving toward physical AI, building fluency with these data modalities now is one of the highest-leverage investments available.
References to "visual AI" throughout this report encompass the full range of data modalities surveyed — images, video, time-series and sensor data, 3D point clouds and LiDAR, audio, and multispectral imagery

Physical AI is moving beyond 2D perception into spatial and embodied systems

Applications of these data modalities span a wide range of maturity. Object detection and classification remain the dominant use cases at 62% and 61%, respectively, reflecting the mature foundation of visual AI.
But one cluster of applications — vision-language-action models for robotics (20%), 3D reconstruction and world models (27%), pose estimation and keypoint detection (25%), and tracking (40%), alongside the growing use of time-series and 3D sensor data — points to a distinct frontier: systems that don't just interpret visual data but act on it in the physical world.
Horizontal bar chart showing responses to “Tasks visual AI is being used for.” The top responses are object detection at 62% and classification at 61%, followed by tracking at 40%, scene understanding/reasoning at 37%, image or video generation/synthesis/augmentation at 36%, and instance/semantic segmentation at 35%. Lower responses include 3D reconstruction, world models, and digital twins at 27%, vision-language tasks at 27%, action recognition at 25%, pose estimation/keypoint detection at 25%, robotics-related vision-language-action models at 20%, and others at 3%.
Source: 2026 State of Visual and Physical AI Report | Voxel51
The write-in responses further underscore how broadly physical AI has already taken root. Teams reported working on seismic auto-interpretation, hyperspectral imaging, stereo depth estimation, and autonomous navigation — applications spanning energy, defense, agriculture, and logistics.
"It started with perception AI. Then generative AI. Now, we're entering the era of physical AI." – Jensen Huang, CEO, NVIDIA
Practitioners are clear-eyed about where the field is headed: 92% agree that world models and spatial intelligence will break into mainstream awareness in the near future.
Optimism extends to what physical AI systems will ultimately be capable of. 95% of practitioners believe AI will eventually replicate human performance across perceptual tasks.
92% agree that world models and spatial intelligence represent the next frontier in visual AI and will break into mainstream awareness in the near future.
95% believe AI will reliably replicate human performance
They’re largely aligned on which industries will get there first — autonomous vehicles, manufacturing, and security and surveillance lead the ranking, while healthcare, defense, and aerospace sit near the bottom, likely reflecting the regulatory complexity on the path to real-world deployment.
Table ranking industries. Autonomous vehicles or driving systems rank first, followed by manufacturing second, security/surveillance third, robotics fourth, retail or e-commerce fifth, healthcare/medical sixth, agriculture seventh, defense eighth, and aerospace ninth.
Source: 2026 State of Visual and Physical AI Report | Voxel51

Visual and physical AI deliver value, but investment lags behind the opportunity

The business case is clear. 78% of teams are already seeing value from their visual AI investments and 86% expect its importance to grow over the next three years. Leadership skepticism is exceedingly rare with only 6% reporting that leadership is questioning the investment at all.
Horizontal bar chart titled “Current state of visual AI at companies.” Most companies are in advanced or active adoption stages: 28% are in growth mode, 26% are in proving grounds, and 24% consider visual AI mission-critical, totaling 78%. Smaller shares report being in a steady state at 13%, on the fence at 6%, or struggling at 2%.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Bar chart titled “Expected growth in visual AI importance over the next 3 years.” Most respondents expect visual AI to become more important: 48% say it will be much more important and 38% say it will be a bit more important. Another 11% expect no change, while 1% expect it to become a bit less important and 1% expect it to become much less important.
Source: 2026 State of Visual and Physical AI Report | Voxel51
”I am optimistic about the potential of visual AI to revolutionize real-world decision-making. I eagerly anticipate the development of more transparent, explainable, and ethically designed systems in the future.” – Survey respondent working as a Data Scientist in the United States
What isn't settled is the level of investment required to realize that value. 74% of practitioners say the industry is underinvested, and only 34% of deployments have reached production. The rest remain in pre-production, prototyping, or early research.
Bar chart titled “Feelings about industry’s current investment in visual AI and computer vision.” Most respondents, 74%, say the industry is underinvested. Another 21% say it is overinvested, and 6% selected other.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Horizontal bar chart titled “Maturity of visual AI deployments.” The most common stages are research and exploration at 35% and production and maintenance at 34%. Prototyping and experimentation accounts for 22%, while pre-production is the least common stage at 10%.
Source: 2026 State of Visual and Physical AI Report | Voxel51
It still requires a lot of experience and know-how to deliver visual AI solutions from inception to production.” – Survey respondent working as an AI/ML Engineer in Europe
The barriers to investment are real but largely temporary. Attention is elsewhere (63% cite GenAI hype absorbing budget), the technology is seen as early (45% view world models and embodied AI as research rather than product-ready), and use cases are still coming into focus (45%). None of these are permanent conditions — they're the natural friction of an emerging field finding its footing, and every one of them will resolve as the category matures.
Horizontal bar chart titled “What’s driving underinvestment in visual AI.” The top reason is GenAI/LLM hype absorbing budget and attention at 63%. Next-gen visual AI being seen as research but not product-ready and use cases being less understood, both are tied at 45%. Other reasons include visual AI ROI being harder to demonstrate to leadership at 35%, talent being harder to find at 32%, visual AI being perceived as solved or mature at 23%, and other at 3%.
Source: 2026 State of Visual and Physical AI Report | Voxel51
The organizations investing now will have the data infrastructure, tooling maturity, and institutional knowledge to move fast when the broader market catches up. Which raises the harder question the rest of this report addresses: why, despite proven value and broad commitment, does progress stall? The answer, consistently, comes back to data.

Model failures are ubiquitous,
and the causes trace back to data

The top five causes of model failure are all data problems

Issues with model performance and failure are a fact of life for visual and physical AI teams. Every single participant in this study (100%) indicated that they had experienced underperforming models. For a third (34%), these issues are persistent. Even the most mature teams aren't immune. Among organizations with models in production, 76% say model failures are common.
Stacked horizontal bar chart titled “Frequency of AI model failure and underperformance,” broken down by visual AI maturity. Overall, 3% of teams report constant failures, 31% frequent failures, 51% occasional failures, 13% rare failures, and 2% almost never. Production and maintenance teams are highlighted: 76% experience failures at least occasionally, including 1% constant, 21% frequent, and 54% occasional failures. Another 23% of production and maintenance teams report rare failures, and only 1% say failures almost never happen. Pre-production teams report 6% constant, 26% frequent, 56% occasional, and 12% rare failures. Prototyping and experimentation teams report 4% constant, 33% frequent, 53% occasional, and 10% rare failures. Research and exploration teams report 3% constant, 44% frequent, 47% occasional, 4% rare, and 2% almost never.
Source: 2026 State of Visual and Physical AI Report | Voxel51

Top causes of model failures

Data issues account for all of the top causes of model failures. This includes:
  • Insufficient training data (57%)
  • Quality issues such as occlusions or artifacts (48%)
  • Domain shifts between training data and real-world data (43%)
  • Data annotation errors or inconsistencies (32%)
  • Class imbalance in training data (31%).
Horizontal bar chart titled “Top causes of underperforming models.” The leading cause is insufficient training data for specific scenarios at 57%, followed by data quality issues such as occlusions, lighting, and artifacts at 48%, and domain shift between training and real-world data at 43%. Other causes include annotation errors or inconsistencies at 32%, class imbalance in training data at 31%, training process issues at 24%, model architecture limitations at 19%, and other at 2%. No respondents said their visual AI models never underperform.
Source: 2026 State of Visual and Physical AI Report | Voxel51
100% say their models underperform
Training process issues like overfitting (24%) and model architecture limitations (19%) also factor in but rank lower. Perhaps most striking: some teams couldn't answer the question at all, writing in that their models are black boxes and the root cause is simply unknown.
”In our experience the hard part isn’t building the model, it’s keeping it reliable in real factory conditions. Data quality and maintenance matter more than chasing bigger architectures.” – Survey respondent working as an AI/ML Engineer in manufacturing
These failure modes are especially costly in physical AI deployments, where a domain shift between training data and real-world sensor conditions, or an unlabeled edge case, can translate directly into a robot, vehicle, or inspection system failing in the field.
If every model failure traces back to data, then the question is why data work is so hard. The rest of this report looks at where teams actually lose ground.

Humans are heavily involved in model development, especially for troubleshooting

Despite advances in automation, human involvement in model development remains extensive. Participants were asked to characterize the degree of human involvement in their workflow, and the data is unambiguous.
95% say human involvement is regularly required throughout model development. Even among the most mature teams with models in production, that figure shifts only marginally, with 9% reporting minimal involvement. The vast majority report extensive (19%), significant (53%), or moderate (23%) human participation. The numbers tell a consistent story across every maturity level: automation has not displaced human judgment but redistributed it.
Horizontal bar chart titled “Human involvement required in vision model development.” Most respondents report needing human involvement: 53% say significant involvement is required, 23% say moderate involvement, and 19% say extensive involvement, totaling 95%. Only 6% say minimal human involvement is required.
Source: 2026 State of Visual and Physical AI Report | Voxel51
We also see a need for humans in the process of troubleshooting underperforming models. When asked about the process taken to diagnose issues, the top approach by far was manual review of failures (67%).
Horizontal bar chart titled “Ways teams diagnose underperforming models.” The most common method is manually reviewing failure cases at 67%, followed by searching for specific edge cases or failure modes at 50%, comparing data or predictions across distributions or model versions at 48%, and analyzing performance across slices or metadata segments at 43%. Fewer teams use embeddings or similarity search to find error patterns at 35%, while 2% say they do not have a way to diagnose issues.
Source: 2026 State of Visual and Physical AI Report | Voxel51

Data work is the bottleneck, not data collection

If data problems cause the majority of model failures, then the real question is why data work is so hard to get right. The answer isn't collection — teams have plenty of data. It's what happens after: how that data is curated, managed, and validated against real-world performance. This is where progress stalls.
The vast majority of practitioners (89%) agree: as models and compute become commodities, data is now the primary driver of visual and physical AI success.
89% say with models and compute being commoditized, data is now the primary driver
of visual and physical AI success

Dataset iteration is a universal pain point and physical AI makes it worse

97% of teams report that they face dataset iteration challenges. In physical AI, these visibility gaps tend to multiply. The issues cluster around a common theme: lack of visibility.
Teams struggle to identify bad labels (59%), pinpoint which samples are hurting model performance (47%), and detect gaps in coverage (43%) before they become production problems. The result is a reactive workflow that is heavy on data rework, a problem that only compounds as datasets grow.
Horizontal bar chart titled “Biggest challenges when iterating on datasets.” The top challenge is bad labels or annotation rework being very time consuming at 59%. Other major challenges include difficulty identifying which samples hurt model performance and time-consuming filtering or searching through large datasets, both at 47%, followed by difficulty identifying dataset coverage gaps at 43%. Additional challenges include measuring the impact of data changes on model performance at 37%, manual or brittle data pipeline updates at 35%, and difficulty tracking dataset versions and changes over time at 34%. Only 3% say they do not face challenges when iterating on datasets.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Dataset challenges don't diminish with experience. Among teams with models in production, 99% report dataset challenges, compared to 96% of teams still in the research phase. While some friction points ease over time, annotation rework actually gets worse: 66% of production teams cite it as a top challenge, versus 56% of teams still in research. Maturity brings scale, and scale makes data quality harder to ignore. The Physical AI Pipeline Audit benchmarks where your team stands against 700+ practitioners in this study.
"Competitive advantage in AI goes not so much to those with data but those with a data engine: iterated data acquisition, re-training, evaluation, deployment, telemetry. And whoever can spin it up fastest." – Andrej Karpathy, founding member of OpenAI and former Director of AI at Tesla
In physical AI, dataset iteration compounds further. A single training run might draw on video, LiDAR, time-series sensor data, and 3D point clouds — each with its own labeling conventions, failure modes, and coverage gaps. A bad label in a 2D bounding box is one problem; a misaligned LiDAR return, a dropped IMU reading, or a temporal gap across synchronized streams is another class of problem entirely.
Teams aren't just iterating on more data — they're iterating across more kinds of data, and the visibility gaps multiply with every modality added to the stack.

The teams that ship successfully spend more time on data

It's no surprise that data work consumes a significant share of project time. Across all project types, a third of teams (34%) spend more than half their time on exploration, curation, cleaning, and quality checks.
But time spent on data isn't a cost — it separates the teams that ship successfully from the ones that stall.
Among self-described exceptional teams, 58% spend more than half their project time on data work. Among struggling teams, just 21% do, a nearly 3x gap. The pattern holds across every tier: the better teams rate their own track record, the more of their time goes into data. If you’re wondering where your team sits, take the Physical AI Pipeline Audit that benchmarks your data practices against the 700+ practitioners in this study.
Stacked horizontal bar chart showing the percentage of total time teams spend on data work, including exploration, curation, cleaning, and quality checking, grouped by an organization’s track record with visual AI projects. Across all respondents, 73% spend at least 25% of their time on data work: 39% spend 25–50% and 34% spend more than 50%. Exceptional teams spend the most time on data work, with 58% spending more than half their time and another 26% spending 25–50%. Strong and mixed teams show similar patterns, with 36% and 34% respectively spending more than half their time. Struggling teams spend less time at the highest level, with 21% spending more than half their time, while 18% spend less than 10% and 18% spend 10–25%.
Exceptional is defined as teams whose projects consistently reach production and meet goals. Source: 2026 State of Visual and Physical AI Report | Voxel51
"80% of AI is the dirty work of data engineering." — RAND National Security Research Division

Training datasets are getting bigger, more proprietary, and more industry-specific

Training datasets for a project can easily include millions of data points— both labeled and unlabeled data. Across our entire study, almost a quarter (23%) reported over a million data points per project, including 3% that have over one billion.
"A four-year-old child has seen 50x more information than the biggest LLMs that we have." — Yann LeCun, Chief AI Scientist, Meta
Dataset size varies significantly by industry. Individuals working on autonomous vehicles are among the most data-intensive, with 47% reporting more than one million data points per project. Manufacturing sits at the opposite end — 31% have fewer than 1,000 data points per project. Often, this reflects the more controlled and constrained environments in which manufacturing systems operate.
Stacked horizontal bar chart showing the typical total size of training datasets per project, including labeled and unlabeled data, by industry. Overall, 33% of respondents use datasets of 10,000 to 1 million samples, 26% use 1,000 to 10,000 samples, 23% use more than 1 million samples, and 18% use fewer than 1,000 samples. Autonomous vehicles have the largest datasets, with 47% using more than 1 million samples and 31% using 10,000 to 1 million. Retail/e-commerce and robotics most commonly use 10,000 to 1 million samples, at 43% and 40%. Healthcare/medical is split across dataset sizes: 28% use fewer than 1,000 samples, 30% use 1,000 to 10,000, 25% use 10,000 to 1 million, and 17% use more than 1 million. Manufacturing shows a similar spread, with 31% using fewer than 1,000 samples, 23% using 1,000 to 10,000, 33% using 10,000 to 1 million, and 13% using more than 1 million.
Source: 2026 State of Visual and Physical AI Report | Voxel51
In addition to variation by industry, we see that the size of datasets grows with maturity. Companies whose models have been deployed to production are more likely to have project datasets with more than 10,000 data points (70%) compared to organizations at the research stage (43%).
Stacked horizontal bar chart showing the typical total size of training datasets per project, including labeled and unlabeled data, by visual AI maturity. Production and maintenance teams use the largest datasets: 31% use more than 1 million samples and 39% use 10,000 to 1 million, totaling 70% with datasets above 10,000 samples. Pre-production teams report 19% using more than 1 million, 38% using 10,000 to 1 million, 19% using 1,000 to 10,000, and 24% using fewer than 1,000. Prototyping and experimentation teams are more spread out, with 20% using more than 1 million, 24% using 10,000 to 1 million, 37% using 1,000 to 10,000, and 20% using fewer than 1,000. Research and exploration teams report 19% using more than 1 million, 24% using 10,000 to 1 million, 31% using 1,000 to 10,000, and 26% using fewer than 1,000.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Larger datasets generally produce more reliable models — provided quality keeps pace with volume. In pursuit of more training data, organizations are drawing from multiple sources simultaneously.
91% of teams in production use proprietary data to train visual AI
Proprietary data remains the foundation, with 72% citing it as a source of training data — a number that climbs to 91% among teams with models in production, up from 59% at the research stage.
The pattern suggests that public datasets serve as a starting point for experimentation, but as teams mature and models move closer to real-world deployment, proprietary data becomes the dominant source.
This concentration has significant downstream implications: unlike public datasets (used by 50% of teams for training data), proprietary data requires organizations to build or acquire the infrastructure to store, curate, label, and continuously validate it at scale.
Horizontal bar chart titled “Where visual AI training data comes from.” The most common source is proprietary data at 72%, followed by public datasets such as COCO, ImageNet, and Open Images at 50%. Synthetic data is used by 40% of respondents, while 32% use web-scraped or publicly available visual data. Less common sources include paid or licensed data at 23% and other sources at 3%.
Source: 2026 State of Visual and Physical AI Report | Voxel51

Synthetic data unlocks the edge cases that real-world data misses

Coverage gaps were one of the top-cited dataset iteration challenges (43%), and synthetic data is how a growing share of teams expect to close them.
63% believe synthetic data will be
the primary source of training data
While only 40% use synthetic data today, 63% believe it will become the primary source of training data for their projects in the near future — a significant gap between current practice and where practitioners expect the field to land.
Synthetic data — digitally generated images, video, and sensor outputs that simulate real-world conditions — offers something real-world data collection cannot: the ability to produce diverse, labeled examples on demand. This allows developers to test systems against a wide range of scenarios, including edge cases that rarely appear in real-world datasets.

Annotation is expensive, wasteful, and painful

Teams need less labeled data each year

The landscape for training data is shifting rapidly, and two signals stand out the most. More than half (57%) expect to need less labeled data each year to achieve the same model performance, suggesting that efficiency gains are outpacing data volume requirements. And 68% believe vision-language model growth will soon stall as available training data runs out — a challenge already playing out in LLM development.
57% expect to need less labeled data each year to achieve the same model performance
68% believe vision-language model growth will soon stall as available training data runs out

Annotation is expensive, wasteful, and painful

Data annotation — the process of labeling raw images and video so models can learn to recognize patterns, objects, and actions — is foundational to AI. The quality and consistency of those labels directly determine what a model learns and how well it performs in the real world.
Nearly all teams (96%) label their own data rather than relying on pre-labeled datasets. Most teams keep labeling in-house (65%), increasingly augmented by model-assisted labeling (51%). This mix represents both the sensitivity of proprietary data and the growing maturity of annotation tooling.
Horizontal bar chart titled “How visual AI data is labeled.” The most common labeling method is using an in-house team at 65%, followed by model-assisted labeling at 51% and domain or subject matter experts at 46%. Other methods include active learning to select what to label at 29% and external vendors or services at 27%. A small share use other methods at 2%, while 4% say they do not annotate data and only use pre-labeled or benchmark datasets.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Approaches to labeling also vary by industry. Teams working with autonomous vehicles and driving were more likely to report using model-assisted labeling (67%), while healthcare and medical technology companies were more likely to say they leveraged subject matter experts (59%).
“It's literally me by hand on the weekend cranking through data!” – Survey respondent working as a software Engineer in retail
Despite the proliferation of external annotation vendors, only 27% of teams use outside services. But adoption grows with organizational size.
Among organizations with 10,000+ employees, 37% rely on external annotation services. Similarly, among projects with more than a million data points, that share rises to 41%. This likely reflects both a willingness to pay for services and the practical reality that larger, more complex projects demand external resources.
Horizontal bar chart titled “Use of external annotation services,” broken down by company size. Larger companies are more likely to use external annotation services: 37% of companies with more than 10,000 employees use them, followed by 30% of companies with 1,001 to 10,000 employees and 28% of companies with 200 to 1,000 employees. Smaller companies are least likely to use them, with 22% of companies with fewer than 200 employees reporting usage.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Given the effort and cost required to annotate data, it is not surprising that teams are gaining experience with model-assisted labeling as their projects mature. Companies in production are almost twice as likely to use model-assisted labeling (70%) compared to those in the research phase (37%).
Horizontal bar chart titled “Use of model-assisted labelling,” broken down by visual AI maturity. Teams in later deployment stages are more likely to use model-assisted labelling: 70% of production and maintenance teams use it, followed by 63% of pre-production teams. Usage is lower among earlier-stage teams, with 44% of prototyping and experimentation teams and 37% of research and exploration teams using model-assisted labelling.
Source: 2026 State of Visual and Physical AI Report | Voxel51

Teams label too much of the wrong data

Data annotation has a significant operational cost, particularly when done manually by internal teams or subject matter experts. Despite the investment, more than a third (36%) report that less than half their annotated data ever reaches production, with 15% saying less than a quarter makes it through.
Pie chart titled “Percentage of annotated data that ends up in a production training dataset.” The largest share of respondents, 38%, say more than 75% of annotated data makes it into production training datasets. Another 26% say 51–75% is used, meaning 64% of respondents use more than half of their annotated data in production training. Lower usage is less common: 21% say 25–50% of annotated data is used, and 15% say less than 25% is used.
Source: 2026 State of Visual and Physical AI Report | Voxel51
The financial impact compounds quickly in physical AI. At a conservative $1.50 per sensor-fused 3D cuboid, annotating one million objects costs $1.5M. When 50% of that data never reaches production, you’re looking at $750,000 in sunk costs. In reality, a single second of driving data can contain dozens of objects across synchronized streams, meaning teams are spending thousands of dollars on single sequences that may ultimately be discarded.
Without the ability to identify high-value samples before annotation begins, teams default to labeling everything and discarding what doesn't make the cut. The waste isn't an annotation problem. It's a data curation problem.

Existing annotation processes are painful

Teams are frequently unhappy with their approach to annotating data. Only a third (34%) report they are satisfied with their annotation process, a very low number considering how important this effort is to visual AI outcomes. This number is even lower among data scientists (22%) and individuals working at manufacturing companies (25%). While annotation pain is nearly universal in this data, it's not evenly distributed. See where your process stands by taking the Physical AI Pipeline Audit.
Bar chart titled “Satisfaction with current annotation process.” The largest group of respondents is neutral at 39%. Satisfaction is lower overall: 4% are very satisfied and 30% are somewhat satisfied, totaling 34%. Dissatisfaction accounts for 27%, with 20% somewhat dissatisfied and 7% very dissatisfied.
Source: 2026 State of Visual and Physical AI Report | Voxel51
99% say annotation process is painful
The dissatisfaction is nearly universal, with 99% describing their annotation process as painful. Cost and speed top the list (65%), but two deeper issues stand out: teams struggle to identify which samples are worth labeling in the first place (46%), and when they do label, quality and consistency are difficult to maintain (58%). Too much of the wrong data gets labeled, and too much of the right data gets labeled incorrectly.
Horizontal bar chart titled “Biggest pain points with the annotation process.” The top pain point is cost or speed at 65%, followed by annotator quality or consistency at 58%. Other common pain points include identifying high-value samples to label at 46%, tooling limitations at 35%, and security or privacy concerns at 24%. Only 2% selected other, and 1% say their annotation process is not painful.
Write-in answers included too few people, the cost of building tools, lack of expertise, and difficulties specifying requirements in a clear, consistent, and standardized way. Source: 2026 State of Visual and Physical AI Report | Voxel51

Investments in annotation are expected to grow

For most teams (78%), annotation spend is stable or growing — 44% anticipate an increase, 34% expect it to hold steady. Only 22% project a decrease.
Bar chart titled “Expected changes in annotation costs over the next year.” Most respondents, 78%, expect annotation costs to either increase or stay about the same: 9% expect costs to increase significantly, 35% expect them to increase somewhat, and 34% expect them to stay about the same. Fewer respondents expect costs to decrease, with 15% expecting a somewhat decrease and 7% expecting a significant decrease.
Source: 2026 State of Visual and Physical AI Report | Voxel51
AI-assisted automation (auto-labeling, foundation models, synthetic data) is the #1 reason teams expect annotation costs to fall, and the #3 reason they expect costs to rise. Scope changes (#1) and budget or resourcing shifts (#2) round out the reasons for cost increases, but automation is the only factor showing up prominently in both directions.
Two side-by-side ranking tables showing reasons annotation costs may increase or decrease. For cost increases, the top reason is project scope changes, followed by budget or resourcing changes, and AI-assisted automation such as auto-labeling, foundation models, or synthetic data. For cost decreases, the top reason is AI-assisted automation, followed by smarter data selection such as active learning or prioritization, and tooling or workflow improvements.
Source: 2026 State of Visual and Physical AI Report | Voxel51

Physical AI still demands custom model development

The model development landscape reflects an industry still finding its footing, and a domain where foundation models haven't yet taken hold the way they have in language. Fine-tuning publicly available models dominates at 72%, but the more telling finding is that 57% of teams are still training custom models from scratch. Only 29% use models off-the-shelf without any modification.
This pattern is specific to visual and physical AI. Foundation models are trained overwhelmingly on internet text and 2D images. They've seen very little LiDAR, few synchronized multi-camera rigs, almost no IMU or industrial sensor streams, and none of the domain-specific edge cases that define real-world deployments.
Fine-tuning and post-training can close some of that gap, and for many teams, they do. But when 57% still conclude it isn't enough and build from scratch, it's a signal that off-the-shelf capabilities have real limits in production physical AI deployments.
For teams evaluating what purpose-built infrastructure should actually include, check out the 2026 Physical AI Data Platform Guide.
Horizontal bar chart titled “Method for developing visual AI models.” The most common approach is fine-tuning publicly available models at 72%, followed by training custom models from scratch at 57%. Other methods include fine-tuning licensed proprietary models at 38% and using off-the-shelf models without modification at 29%.
Source: 2026 State of Visual and Physical AI Report | Voxel51

Final thoughts on the state of visual and physical AI

Physical AI is scaling fast. Robotics, autonomous systems, and industrial applications are moving from pilots into production, and the amount of data generated is measured in petabytes, not gigabytes.
The companies that win this decade will be the ones that treat their data stack as core to their physical AI strategy – on par with investments in models and compute – not as a downstream cost. As models become increasingly commoditized, the ability to find, curate, and act on the right data becomes the primary driver of model performance.
"What enables each wave and each phase of AI, three fundamental matters are involved. The first is how do you solve the data problem." — Jensen Huang, CEO, NVIDIA
The data challenges in physical AI are distinct, and they compound at scale:
  • Multimodal complexity emerges across inputs. Camera, LiDAR, radar, and depth streams have to be synchronized and spatially aligned before they can be analyzed together — a problem that doesn’t exist in text or tabular ML.
  • Value concentrates in a fraction of the data. System safety depends on rare edge cases such as unusual pedestrian poses, lighting conditions, or manufacturing defects that are difficult to find in petabyte-scale datasets.
  • Data quality degrades at scale. As datasets grow, errors accumulate across labels, sensors, and distribution shifts, turning isolated mistakes into widespread quality issues hidden throughout the data.
  • Annotation spend is wasted on low-value data. Teams pay to label large volumes of data, even when additional samples provide no meaningful signal once common patterns have already been learned.
  • Aggregate metrics misrepresent real-world performance. Metrics like mAP and IoU guide training, but fail under real-world conditions where edge cases, long-tail events, and safety-critical scenarios determine performance.
  • Traceability breaks down in production. When models fail, most stacks cannot reliably connect failures back to specific data, labels, or training runs.
What teams should look for in a physical AI data platform:
  • Unified view across sensors. Infrastructure that ingests and visualizes camera, LiDAR, radar, and depth data in a single interface, with synchronization and alignment handled as a property of the data layer rather than a per-project engineering project. Engineers need to debug fusion failures where they actually occur — across modalities, in the same view.
  • Scenario mining at petabyte scale. The ability to surface rare, high-value moments from massive archives using semantic search, similarity, and metadata filtering — without manual review. Edge cases can't improve model performance if teams can't find them.
  • Continuous, multi-dimensional quality workflows. Tooling that detects annotation errors, flags sensor anomalies, and identifies coverage gaps and distribution drift as first-class signals. Quality has to be monitored continuously across dimensions, not audited once at handoff.
  • Curation that targets what the model is weakest on. The ability to identify underrepresented classes, low-confidence predictions, and failure clusters, then prioritize annotation spend against them. Annotation budgets should reinforce where models fail, not where data is easiest to collect.
  • Scenario-based, sliced evaluation. Per-condition, per-class, per-geography performance breakdowns that go beyond aggregate metrics. Teams need to know where a model fails before deployment tells them — and regulators increasingly require it.
  • Dataset and model lineage. Every prediction, label, and model version traceable back to the exact data that produced it. When a failure surfaces in production, teams need to connect it to the training examples, annotation revisions, and model run that caused it — not guess.
Not sure where your data practice stands relative to the field? Take a few minutes to find out.
Physical AI Pipeline Audit →
Model accuracy, robustness, and deployability are downstream of how well a team can see into its data and act on what it finds. The organizations that close that loop fastest will ship physical AI systems that work in production. The ones that do not will watch their programs stall in the accuracy gap between demo and deployment. Check out the full detailed breakdown of what to evaluate: 2026 Physical AI Data Platform Guide.

Survey methodology and participant demographics

This research was conducted by Dimensional Research and commissioned by Voxel51. An online survey was sent to independent databases of professionals working in physical AI, visual AI, and computer vision. A total of 709 qualified individuals participated in the survey. All participants had professional experience working with visual or multimodal data. Participants included a mix of roles, responsibilities, company sizes, industries, and regions. Responses were captured in 2026. Due to rounding, certain graph options may not add up to exactly 100%.
Two pie charts summarizing respondent demographics. The first chart shows years of visual AI experience: 30% have less than 3 years, 29% have 3–5 years, 22% have 6–10 years, and 19% have more than 10 years. The second chart shows respondent roles: 34% are senior individual contributors, 18% are academic researchers, 16% are team managers, 12% are graduate students, 11% are frontline staff, and 9% are executives such as directors, VPs, or C-level leaders.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Horizontal bar chart titled “Primary responsibility.” The largest group of respondents work in AI/ML engineering at 32%, followed by AI/ML research at 27%. Other responsibilities include software engineering at 13%, data science at 10%, robotics at 8%, data platform at 3%, and other at 7%.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Two pie charts showing company demographics. The first chart shows company size by number of employees: 22% have 10,000 or more employees, 22% have 2–10 employees, 17% have 11–50 employees, 14% have 1,001–10,000 employees, 13% have 201–1,000 employees, and 13% have 51–200 employees. The second chart shows company region: 58% are in the United States or Canada, 26% are in Europe, 10% are in Asia Pacific, 3% are in the Middle East or Africa, and 2% are in Mexico, Central, or South America.
Source: 2026 State of Visual and Physical AI Report | Voxel51
Horizontal bar chart titled “Data modalities.” Images are the most common data modality at 92%, followed by video at 63% and language/text at 55%. Other modalities include time-series data such as IMU or sonar at 42%, 3D data such as point clouds, meshes, or LiDAR at 38%, audio at 25%, multi/hyperspectral data at 18%, and other at 3%.
Source: 2026 State of Visual and Physical AI Report | Voxel51

About Dimensional Research

Dimensional Research® provides practical market research for technology companies. We partner with our clients to deliver actionable information that reduces risks, increases customer satisfaction, and grows the business. Our researchers are experts in the applications, devices, and infrastructure used by modern businesses and their customers.
For more information, visit www.dimensionalresearch.com.

About Voxel51

Voxel51 is the leading data platform for physical AI. The company's flagship product, FiftyOne, combines open-source flexibility with enterprise-grade capabilities to help teams understand and analyze their multimodal data, annotate the right samples, close quality and coverage gaps, and build models that perform reliably in the real world. Trusted by millions of developers and thousands of enterprises—including Porsche, Vivint, Berkshire Grey, and Microsoft—FiftyOne is how the world's leading AI teams build the data foundation physical AI demands. Learn more at voxel51.com.

Talk to a computer vision expert