TL;DR: This post introduces a new workflow that proactively identifies issues in your datasets so you can quickly improve data quality, identify the sources of failure modes, and build higher-performing models with confidence.

As organizations accelerate their visual AI development, data quality remains one of the biggest drivers of success. Today’s datasets are vast and complex, often combining millions of images, videos, or sensor readings. Yet it only takes a small percentage of problematic data – such as edge case failures, duplicate samples, or inconsistent annotations – to compromise a model’s performance in critical scenarios.

Lacking dedicated tools, many teams rely on manual, DIY approaches to assess dataset quality. For example, data scientists often still manually scroll through samples or write one-off scripts to find duplicate images or blank frames. These approaches don’t scale, and enterprises therefore tend to address data problems retroactively.

Yet proactively focusing on upstream data quality is a strategic imperative. The evidence increasingly shows that better data yields better models. Conversely, hidden flaws in data act like landmines that can cause a model to fail spectacularly when it encounters outliers that were present (but unnoticed) in training data.

We’re excited to announce new Data Quality workflows in FiftyOne Enterprise. These workflows are purpose-built to make data quality assessment easy, intuitive, and actionable.

How Does It Work?

The FiftyOne App provides a powerful Python interface to expose operations, workflows, and dashboards right alongside your datasets. We’ve developed a dedicated Data Quality UI so that you can easily visualize, interact, and take action on affected samples as you analyze several supported data quality issues.

Out of the box, you can scan for a range of common data issues that plague AI projects.

Brightness: find images that are unusually dark or bright
Blurriness: flag scenes that are too blurry or, conversely, abnormally sharp
Aspect Ratio: catch images with extreme aspect ratios that might indicate improper scaling or padding
Entropy: measure visual information content to surface scenes that are almost empty or overly noisy
Near Duplicates: uses embeddings techniques to identify images that are very similar, such as the same scene from two angles
Exact Duplicates: detect identical samples that appear multiple times in the dataset

Understanding how the above is represented in your data is critical for building high quality datasets. For example, autonomous vehicle failures have been shown to be caused by critical scenarios being underrepresented in the data (e.g., low visibility or high-entropy environments).. On the other hand, significant numbers of poor-quality images as well as near or exact duplicates are known to “poison the well” and introduce model bias and unpredictable outputs we neither intend nor want.

If your dataset is relatively small, you can immediately execute and monitor scans within the data quality interface.. For larger datasets, you have the option to delegate the operation to an external compute source. The task then runs as a scheduled job in the background, allowing you to continue your work uninterrupted.

The output of each scan will include the distribution of samples across the data quality metric. You can then set a threshold to identify possible issues and outliers. The dataset will automatically filter by the threshold you set. You can also save the threshold, and the output of the next scan will then inform you of any net changes to the number of samples falling within that range.

Finally, the interface lets add new tags to any samples within the threshold that might need attention or further review. Any team member with access to the dataset can then bring up those same samples with a single click.

Better Data, Better Models

Continuously improving your machine learning models means getting your data right. The Data Quality workflow removes guesswork and manual toil from dataset improvement and transforms a tedious process into an automated, visual, and collaborative experience.

Data scientists, engineers, and team leads receive unprecedented clarity into dataset health and can see the quality distribution of your data and take corrective action on the fly. There’s higher confidence in the data going into model training and higher confidence in the models coming out.

Prioritizing data quality de-risks AI projects, accelerates model development, and ultimately delivers solutions that perform more reliably in the real world. We invite you to experience the Data Quality workflow and other FiftyOne features. Our ML experts will demonstrate how FiftyOne will work with your data and ML pipelines to drive successful AI outcomes for your business. The path to reliable and scalable visual AI starts here.

Getting Started With the Data Quality Workflow

Analyzing data quality is available now with FiftyOne Enterprise. Check out our documentation for easy steps to get started.

Already an enterprise user? Upgrade to FiftyOne Enterprise 2.7.1 and give it a try!

Happy modeling! 🚀

Talk to a computer vision expert

How Does It Work?

Better Data, Better Models

Getting Started With the Data Quality Workflow

Talk to a computer vision expert

Related posts

Related posts