Meetup Recap: How to Build High-Quality Machine Learning Datasets and Computer Vision Models

Brian Moore, Co-Founder and CTO of Voxel51, recently presented at the Virtual MLOps and Kubeflow Meetup to share how Voxel51 helps computer vision and machine learning engineers and scientists train better models with measurably better data.

In his talk, Brian covers what data-centric ML is and why it’s important, and shows a live demo of FiftyOne, the open-source tool for building high-quality datasets and computer vision models.

In this blog post, we provide the playback recording, slides, and a recap of highlights from the presentation.

If you have additional questions about data-centric machine learning, FiftyOne, FiftyOne Teams, or other computer vision topics, join our Slack community to ask and get answers or follow along with the discussion.

Video Replay

To dive into the meetup presentation, check out this recording, and/or continue reading the highlights below:

Presentation Highlights

Introducing Voxel51

Brian opens the meetup presentation with a brief look back at the origins of Voxel51, which was conceived when he met Jason Corso at the University of Michigan. The idea came about to fill the gap at that time in machine learning tooling to take computer vision models into production. Thus Voxel51 and the open source FiftyOne project were born.

Introducing data-centric machine learning

At the core of production-ready models is data-centric machine learning. What is data-centric ML? Brian explains that the biggest challenge in getting a computer vision model into production today is not the model architecture, because there are plenty of architectures you can use and great tools to help you train them. Rather, the biggest challenge is how to improve the quality of your data.

Visual datasets today are huge, now reaching hundreds of millions of samples, and you don’t have time to sift through them all to catch any errors. Maybe you can get to 80% accuracy in your dataset pretty easily, or with some additional work you can even get to 90%, but that’s not nearly enough because it can lead to huge issues on the backend of the system due to issues like biased predictions or real-world edge cases that just won’t work for a product you’re releasing to the world.

When you have a model trained on poor quality data, this can also lead to a significant decrease in the performance of that model because the data that you were feeding it was not good. So how can you improve the quality of your visual datasets with the goal of getting to higher performance models?

Where FiftyOne fits in

That’s where open source FiftyOne comes in — it helps you integrate with the way that you get data annotated and the way you train your models in order to achieve higher performance models through better data.

FiftyOne in action

Brian then shows a live demo (starting at ~12:00 in the playback video) of FiftyOne that walks you through:

How to install the latest stable version of FiftyOne via pip
How to load in a dataset, including:
– Common datasets like COCO and ActivityNet using the FiftyOne Dataset Zoo
– Datasets using standard data formats like the COCO format
– Custom datasets in your own format
– Image datasets, as well as video datasets
How to visualize the dataset in the GUI or code, including:
– How to filter and view specific data of interest
– How to flexibly interact with your data through the GUI and/or through code
How to import the FiftyOne Brain to look for data by visual similarity, uniqueness, computing your own embeddings, and more
How to work with FiftyOne in an interactive Python shell
How to work with FiftyOne in Jupyter notebooks
How to get hands-on with your data in Jupyter notebooks, including how to run an experiment using a dataset of handwritten digits to find annotation mistakes and automatically or semi-automatically annotate data sets
Another example of how to get hands-on with your data in Jupyter notebooks, including how to use model embeddings from the Model Zoo together with visualization capabilities in an experiment using the BDD100K to find outliers and annotation mistakes

FiftyOne: resources and next steps

Brian shares some resources and next steps to help you get started with and contribute to the open source FiftyOne project:

Light reading:

Next steps:

Like the project? Give us a star on GitHub
Want to get involved? Join our Slack community

A shoutout for FiftyOne Teams

For those of you wondering how we make money, Brian explains that we sell a version of FiftyOne called FiftyOne Teams. Teams is a version of open source FiftyOne which is designed for organizations that want to use FiftyOne as the one source of truth for their data. It’s a SaaS deployment of FiftyOne with a centralized database. You can have multiple workflows using Python in parallel to load in data both locally and in the cloud. You can visualize your data sets through the web portal without even using Python, making it more suitable for non-technical workflows. If you’re interested in learning more about Teams, simply fill out this form and we’ll be in touch.