What Makes ‘Good’ Data? A View from the Front Lines of AI
Jul 17, 2025
6 min read
For much of the last decade, the prevailing narrative in AI has been that scale wins. More data, more compute, larger models—that has been the formula. And to be fair, we’ve gone pretty far with it. But in my experience—both in academia and industry—this emphasis on quantity has come at the cost of a more nuanced truth: it’s not just how much data you have. It’s how well you understand and curate that data.
As a computer vision and machine learning researcher, I’ve spent years working on models to interpret the visual world. But over time, I kept running into the same friction point: the data itself. Was it representative? Was it biased? Was it even usable? And perhaps most importantly—how would I even know?

From open source code to open source data

The open source mindset has always been part of my work—long before it became mainstream in machine learning. As a graduate student, most of us spent countless hours re-implementing algorithms from papers, line by line, because authors rarely shared their code. That made replication slow and sometimes frustrating. But it also reinforced how important transparency and reproducibility was if we wanted the field to move forward.
In the early 2000s, that started to change. More researchers began publishing MATLAB code alongside their papers, and suddenly it became easier to reproduce, test, critique, and build on each other’s work. It wasn’t just more efficient—it helped advance the broader field faster and more dynamically. That shift convinced me to make open source code a requirement once I became a professor with my own research lab. If we published a paper, we released the code and instructions to reproduce the results. It was a simple rule with a big impact.
But after years of doing that, something still felt incomplete. We were sharing our code. But the data behind the results—the decisions we made in collecting, filtering, labeling, or cleaning it—were rarely as visible. Around 2003, more open computer vision datasets, such as Caltech 101 and the KTH action dataset, were created, setting the precedent that data sharing was also viable.
As datasets became increasingly available and larger, it was a huge challenge to understand and analyze them. The tools to inspect or understand them just didn’t exist. In a world where models were only getting more complex and data more central, that seemed like a blind spot we couldn’t afford to ignore.

Why we built FiftyOne for data understanding

Around the time that large-scale deep learning started reshaping the field, something else changed: the role of data shifted from supporting cast to co-star. It became clear that the performance of a model wasn’t just about architecture or training tricks—it was fundamentally tied to the quality and characteristics of the data itself.
That shift was both exciting and disorienting. We were training increasingly powerful models, but often without understanding why they worked—or didn’t. I saw it again and again in my own research and in conversations with other engineers and scientists. The data might be noisy, imbalanced, redundant, or just a poor fit for the task, but we didn’t have tools to diagnose that. It felt like flying blind.
This lack of visibility sparked the idea that maybe we needed to rethink the interface between humans and machine learning systems—not at the level of models, but at the level of data. What if you could ask questions of your dataset the way you’d inspect model logs or weight distributions? What if developers had the same observability for their training data that they have for their model architectures?
That’s what we set out to build with the founding of Voxel51: tools like FiftyOne that allow machine learning engineers to inspect, slice, visualize, and experiment with image and video datasets in meaningful ways. Not because it’s trendy, but because understanding your data is essential if you want your model to generalize, behave ethically, or even just to work at all.

What makes data “good”?

There’s no universal checklist for “good” data—it depends entirely on context. But in practice, we’ve found that a few patterns come up again and again: redundancy, class imbalance, mislabeled examples, and poor edge-case coverage, to name a few. These aren’t just academic concerns—they’re the reason models fail in production.
What makes data “good” is whether it’s right for the problem you’re trying to solve. That might mean reducing duplicates, balancing class distributions, or specifically not balancing them if your use case demands it. Sometimes it means discovering the rare, hard-to-label examples that matter most. The point is, you can’t fix what you can’t see. Data observability is the prerequisite for actionable improvement.
In machine learning, we’ve made incredible progress on modeling techniques, architectures, and tooling. But models are only as good as the data they learn from. And understanding that data—truly analyzing it, questioning it, improving it—is still one of the most underdeveloped parts of the workflow.
We don’t need more data. We need better ways to work with the data we already have, and that starts with data-centric AI practices.
If we want to build models that are not just accurate, but robust, fair, and reliable, then investing in data understanding isn’t optional—it’s foundational.

Ready to see your data clearly?

Try FiftyOne today and discover how faster dataset visualization and curation can unlock the next level of performance for your visual-AI models.

Talk to a computer vision expert

Loading related posts...