Over the past few years, I’ve had the privilege of working closely with manufacturing teams that are deploying visual AI to solve real problems on the factory floor. From detecting subtle defects in circuit boards to identifying structural anomalies in large-scale assemblies, the use of computer vision in manufacturing is no longer at the experimental stage—it’s operational, as highlighted in our
2025 AI in Manufacturing Landscape.
For all the excitement around foundation models and automated visual inspection, the reality of deploying these systems at scale is far from plug-and-play. Most of the cost, complexity, and risk doesn’t lie in GPUs or code—it lies in the data. And in manufacturing, where defects are rare by design, that data is especially hard to come by.
Here’s what we’ve learned at Voxel51 about building reliable AI defect detection systems in the real world, and how manufacturers are rethinking their approach to AI.
The data imbalance no one talks about
When people think of AI in manufacturing, they often picture high-speed vision systems spotting flaws with superhuman precision. But behind that glossy outcome is a deeply asymmetric data problem.
On a typical production line, more than 99.99% of parts are perfectly fine. That’s exactly what you want from a manufacturing system. But from a machine learning standpoint, this is a nightmare. We’re trying to train a model to detect rare and often subtle anomalies—hairline cracks, misalignments, poor solder joints—when there might be only one true defect in every 200,000 examples. That extreme imbalance is a bottleneck not just for modeling, but for data collection and labeling. It’s also why so many efforts to use off-the-shelf models fail.
Real-world use cases: from micro to macro
At Voxel51, we’ve worked with customers detecting defects at multiple scales. On the macro end, we’ve seen computer vision used to catch panel misalignments on vehicle assembly lines and identify damage to large mechanical components. On the micro scale, manufacturers are inspecting PCB boards for imperfect solder joints, missing components, or subtle wear that could lead to failure.
In all of these cases, the process starts with a visual media pipeline—RGB, depth, sometimes infrared. That data flows into a visual dataset platform like
FiftyOne, where QA teams or machine learning engineers
inspect, sort, and flag anomalies. These aren’t giant labeling farms. They’re small, focused teams identifying the few meaningful edge cases that make all the difference in training.
And those hard examples? They become the gold nuggets—the foundation for model improvement.
False positives aren’t the enemy
One of the most surprising insights from our
research on auto-labeling is that false positives are often less harmful than we think, especially in early-stage model training.
With our
Verified Auto Labeling work, we’ve studied the performance of models when trained on noisy labels produced by foundation models. What we found is that it’s better to have a wide range of data with some incorrect labels than a small, perfectly labeled dataset. Bigger is often better than better, so long as you have mechanisms to verify and improve iteratively.
That’s particularly relevant in manufacturing, where getting more of the right data (i.e., real defect cases) is often infeasible. Instead, teams are increasingly augmenting rare examples or simulating variability in appearance, lighting, or viewpoint. The goal isn’t perfection—it’s coverage.
From prototypes to production
Many teams see early success with their first AI prototypes. A model achieves 80% accuracy after just a few weeks of training, and the proof-of-concept demo looks great. But 80% isn’t good enough when you’re shipping products at scale. If your system flags 1 in every 5 defects incorrectly—or worse, misses them altogether—you still need humans reviewing every frame.
So how do you move beyond 80%?
One key strategy we’ve seen is scenario and failure mode analysis. Instead of just evaluating a model’s top-line accuracy, teams inspect how it performs across different edge cases. Where is it weak? Which conditions consistently trip it up? Should we collect more data, refine the labels, or change the model architecture?
This isn’t glamorous work. It’s data-centric, engineering-heavy, and often iterative. But it’s the work that separates toy models from real production systems.
The role of human-in-the-loop
In manufacturing, we often hear the concern: “Will AI replace human inspectors?” But in practice, the most successful systems are built around human-in-the-loop design.
Models can be used to triage incoming data—automatically passing along the confident “green” cases, flagging the uncertain “yellow” ones for review, and discarding or quarantining the obvious errors. This workflow saves time, reduces QA fatigue, and ensures that humans focus their attention where it matters most.
Over time, those reviewed edge cases improve the model, which leads to fewer yellow flags, and more automation. But the human role doesn’t disappear—it becomes more strategic.
Defect detection is one of the most compelling applications of computer vision in manufacturing, but it’s also one of the hardest to get right. The challenges aren’t algorithmic—they’re data-driven. You’re not just building a model; you’re building a system that can learn from the rare, the subtle, and the complex.
If there’s one takeaway from the work we’ve done at Voxel51, it’s this: the teams that succeed treat data not as a byproduct, but as a product. They invest in the right tools, build feedback loops, and design with humans in mind.
Because at the end of the day, good AI doesn’t just see the world—it learns from it.
I recently joined Manufacturing Tomorrow, an Ohio State University podcast, to discuss this in detail.
Listen to the full episode if you’re bringing visual AI applications from concept to the production floor.