Automatically parse your dataset for outliers with FiftyOne Plugins Data is the heart of AI. As new models continue to change and evolve at a rapid pace, it is more important than ever to have a high quality dataset to train on. Lurking in our datasets are poor samples that are not representative of our problem, dragging down the performance of our model. Data curation helps solve this problem by supplying a ML engineer with the tools needed to take their datasets to the next level by addressing datasets with outliers to curate better data. Within FiftyOne, there are many different ways to find poor samples in your datasets. Just a few examples are:

Finding Classification and Detection Mistakes
Removing Duplicates
Image Quality Issues
Visualizing Embeddings

Today we will be showing how to do outlier detection in FiftyOne using embeddings and sklearn! To start, install the outlier detection plugin to gain access to the outlier_detection operator. It can be installed with:

Finding Outliers in the Entire Dataset

Once installed, we can kick open the FiftyOne app with the dataset of our choice. If you need help loading your dataset, check out the documentation on how to get started. We will be using the MSCOCO 2017 training split for our example. We can get started with:

Once you are in the app, hit the backtick key ( ` ) or the browse operations button to open the operators list. Search for the outlier detection operator and you will be met with the following menu.

From here, you will be able to configure how you want to find your outliers. You can choose from any of the FiftyOne embedding models, what percentage of your dataset you think is contaminated with outliers, and have optional inputs such as looking through a specific class or tagging the samples found as outliers! Let’s try an example using CLIP on the training set of MSCOCO, which is a dataset with outliers that are known.

What’s particularly interesting about using the Outlier Detection Plugin is you are met with so many unique and interesting samples. Very quickly from over 100,000 images we can view the 1% that are most relevant to data curation and make decisions on what to keep and what to remove. Some quick observations from our detection leads us to find these issues that are common in datasets with outliers:

Black and white photos
Distorted or warped images
Duplicated images in attempt to maintain aspect ratio
Backgrounds dominated by a single color (snow, ocean, sky, etc)

Finding Outliers in a Single Class

Different problems require different data curation decisions and FiftyOne brings the most important samples right in front of you. We can perform outlier detection on a single class as well! Let’s check out a few examples of “airplane” outliers!

Once again, we can find all these unusual edge cases in our dataset easily with the outlier detection. Now we can be sure that we have no birthday cake airplanes in our training set!

Tackling Datasets with Outliers is Critical

Outliers are found in almost every dataset. Finding them, especially across hundreds of thousands if not millions of samples can be a daunting task, but with FiftyOne, the workflow can be made simple with the Outlier Detection. If you are interested in finding more FiftyOne plugins, checkout our community repo to optimize your workflows with plugins or contribute one of your own! Plugins are highly flexible and always open source so that you can customize it exactly to your needs! Have fun exploring!

Talk to a computer vision expert

Finding Outliers in the Entire Dataset

Finding Outliers in a Single Class

Tackling Datasets with Outliers is Critical

Talk to a computer vision expert

Related posts

Related posts