Automatically parse your dataset for outliers with FiftyOne Plugins
Data is the heart of AI. As new models continue to change and evolve at a rapid pace, it is more important than ever to have a high quality dataset to train on. Lurking in our datasets are poor samples that are not representative of our problem, dragging down the performance of our model. Data curation helps solve this problem by supplying a ML engineer with the tools needed to take their datasets to the next level.
Within FiftyOne, there are many different ways to find poor samples in your datasets. Just a few examples are:
- Finding Classification and Detection Mistakes
- Removing Duplicates
- Image Quality Issues
- Visualizing Embeddings
Today we will be showing how to do outlier detection in FiftyOne using embeddings and sklearn
! To start, install the outlier detection plugin to gain access to the outlier_detection
operator. It can be installed with:
fiftyone plugins download https://github.com/danielgural/outlier_detection
Finding Outliers in the Entire Dataset
Once installed, we can kick open the FiftyOne app with the dataset of our choice. If you need help loading your dataset, check out the documentation on how to get started. We will be using the MSCOCO 2017 training split for our example. We can get started with:
import fiftyone as fo import fiftyone.zoo as foz import numpy as np dataset = foz.load_zoo_dataset( "coco-2017", shuffle=True, split="train" #change to validation for a smaller split ) session = fo.launch_app(dataset)
Once you are in the app, hit the backtick key ( ` ) or the browse operations button to open the operators list. Search for the outlier detection operator and you will be met with the following menu.
From here, you will be able to configure how you want to find your outliers. You can choose from any of the FiftyOne embedding models, what percentage of your dataset you think is contaminated with outliers, and have optional inputs such as looking through a specific class or tagging the samples found as outliers! Let’s try an example using CLIP on the training set of MSCOCO.
What’s particularly interesting about using the Outlier Detection Plugin is you are met with so many unique and interesting samples. Very quickly from over 100,000 images we can view the 1% that are most relevant to data curation and make decisions on what to keep and what to remove. Some quick observations from our detection leads us to find these issues:
- Black and white photos
- Distorted or warped images
- Duplicated images in attempt to maintain aspect ratio
- Backgrounds dominated by a single color (snow, ocean, sky, etc)
Finding Outliers in a Single Class
Different problems require different data curation decisions and FiftyOne brings the most important samples right in front of you. We can perform outlier detection on a single class as well! Let’s check out a few examples of “airplane” outliers!
Once again, we can find all these unusual edge cases in our dataset easily with the outlier detection. Now we can be sure that we have no birthday cake airplanes in our training set!
Conclusion
Outliers are found in almost every dataset. Finding them, especially across hundreds of thousands if not millions of samples can be a daunting task, but with FiftyOne, the workflow can be made simple with the Outlier Detection. If you are interested in finding more FiftyOne plugins, checkout our community repo to optimize your workflows with plugins or contribute one of your own! Plugins are highly flexible and always open source so that you can customize it exactly to your needs! Have fun exploring!