Two data-centric AI providers integrate, enabling AI engineers to get the most out of their data
Nothing limits computer vision model performance more than bad data. But datasets today are huge, reaching hundreds of millions or even billions of samples, which means it’s impossible to look through them all to catch errors quickly and efficiently. Moreover, improving the quality of a dataset largely depends on two important components: high-quality ground truth annotations, and the ability to curate datasets of the highest quality with class balance and representative coverage of your data distribution. The good news is that there are tools to help you combat bad data by improving and optimizing your datasets so you can deliver exceptional AI products into production.
Who are Voxel51 and V7?
If you’re not yet familiar, we are Voxel51, the company behind FiftyOne, the leading toolkit for building high quality datasets and computer vision models. FiftyOne is where real AI work happens. AI teams around the world rely on FiftyOne to visualize, curate, manage, and QA data, and automate the workflows that make enterprise machine learning possible. Plus, FiftyOne was designed with extreme flexibility and extensibility in mind, which includes the ability to integrate naturally with other AI/ML tools you know and love.
V7 is a powerful AI data engine enabling better AI products to reach the market faster. Used by enterprise customers worldwide, including Continental, Wanzl, and Boston Scientific, V7’s unique workflows enable 10x faster labeling. Features such as auto-annotation, model visualization, advanced video labeling, bespoke workflow design, intelligent QA, and elite labeling task forces converge to offer a scalable solution that prioritizes impactful AI development.
At the heart of the partnership is the integration between Voxel51’s FiftyOne and V7 Darwin. The integration is currently in beta. Continue reading to learn how these two platforms provide customers with cutting-edge solutions primed to deliver top-tier AI products.
Dataset curation for smarter annotation
For most machine learning projects, the first step is to collect a suitable dataset for the task. In addition, datasets need labeling and annotating in order to continue through the ML pipeline. With large collections containing millions or billions of samples, annotating can quickly become cost prohibitive. The question is: how can you create smaller, carefully curated data subsets for annotation that are the most impactful for your ML project in order to get the most out of your annotation budget while boosting model performance?
FiftyOne provides a variety of cutting-edge tools and workflows that enable you to:
- explore and balance your datasets by class and metadata distribution
- visualize, de-duplicate, sample, and pre-label your data distributions using embeddings
- perform automated pre-labeling with off-the-shelf or custom models
- and more!
The new integration between FiftyOne and V7 Darwin allows users to send subsets of their datasets from FiftyOne to V7 Darwin for annotation. The annotated data from V7 can then be imported back into FiftyOne for review and refinement, before ultimately being used to train your model.
Annotation review & QA
In many ML projects, a dataset already exists and is being used to train models. In such cases, the best use of time is likely to improve the quality of the dataset, which often provides greater performance gains than similar effort put into optimizing the model architecture.
FiftyOne enables powerful image- and object-level annotation review and QA workflows. Use FiftyOne’s embeddings visualization, compatible with both off-the-shelf as well as custom models, to highlight likely annotation mistakes and outliers. Use FiftyOne’s sample- and label-level tags, as well as saved views, to easily mark samples for reannotation back in V7.
Once a model is trained, you can easily run inference, load the model predictions back into FiftyOne, and evaluate them (including regressions, classifications, detections, polygons, instance and semantic segmentations, on both image and video datasets) against the ground truth annotations. This makes it possible to highlight areas of improvement for your annotations, as well as identify classes of difficult samples for training set augmentation. For targeted dataset augmentation, FiftyOne’s built-in similarity search functionality can be leveraged with a variety of vector database backends.
Seamless data transfers
The integration makes it smooth and easy to send data back and forth between FiftyOne and V7 Darwin via an API. The integration also allows for seamless conversion of all data formats, retaining all existing annotations (including labels made in other tools).
Additional capabilities for FiftyOne Teams customers
FiftyOne Teams combines the features you know and love in open source FiftyOne with additional capabilities for secure, real-time multi-user collaboration—all backed by world-class customer support. With the Voxel51 and V7 partnership, there are three additional capabilities to highlight for FiftyOne Teams customers.
With dataset versioning in FiftyOne Teams, every annotation and model run can now be captured and versioned in a history of dataset snapshots. No more complex naming conventions or manual tracking of versions—dataset snapshots in FiftyOne Teams can be created, browsed, linked to, and re-materialized with ease in the App or SDK.
Loading cloud-backed media
If you’re a FiftyOne Teams customer and work with cloud-backed media, you will be able to connect your cloud-backed media to FiftyOne Teams and V7 Darwin in order to directly load items from your cloud storage.
Collaborating with humans in the loop
Because training datasets are generally too large for a single person to process, teams of data annotators and QA professionals often come together to do this work and ensure dataset quality. However, without the right tooling, it’s hard to visualize a large amount of data for different tasks of various shapes and sizes, and difficult to leave meaningful feedback that can easily be acted on.
FiftyOne Teams makes it possible to safely and securely collaborate both inside and outside your organization. Create and share datasets and dataset views within and across QA teams, instantly load up datasets in your browser, easily scroll through samples, and leave sample- and label-level tags on any aspects of the dataset that are not of sufficient quality (such as an incorrect annotation or a blurry image).
How to get started with FiftyOne and V7 Darwin
We collaborated with the amazing team over at V7 to create this getting started tutorial for the FiftyOne and V7 Darwin integration.
In the tutorial you’ll learn how to:
- Set up FiftyOne
- Configure V7 Darwin
- Load example data in FiftyOne
- Annotate the data in V7
- Continuously improve your training dataset to boost model performance
Conclusion and next steps
We’re excited to partner with V7 to bring value to joint customers, and we’d love to hear what you think!
Here are a few next steps to take if you’re interested in the integration:
- Check out open source FiftyOne on GitHub. If you like what you see, consider giving the project a star.
- Join the FiftyOne Slack community. It’s open to everyone, and we’re always happy to help address any questions, comments, or feedback you might have.
- Schedule a personalized demo. We’d love to show you FiftyOne in action, along with any integrations you’d like to see to help automate your AI workflows.