The FiftyOne Brain provides powerful machine learning techniques that are designed to transform how you curate your data from an art into a measurable science.
The FiftyOne Brain is a separate Python package that is bundled with FiftyOne. Although it is closed-source, it is licensed as freeware, and you have permission to use it for commercial or non-commercial purposes. See the license for more details.
The FiftyOne Brain methods are useful across the stages of the machine learning workflow:
Visualizing embeddings: Tired of combing through individual images/videos and staring at aggregate performance metrics trying to figure out how to improve the performance of your model? Visualizing your dataset in a low-dimensional embedding space reveal patterns and clusters in your data that can help you answer many important questions about your data, from identifying the most critical failure modes of your model, to isolating examples of critical scenarios, to recommending new samples to add to your training dataset, and more! The FiftyOne Brain provides a powerful
compute_visualization()method that you can use to generate out-of-the-box or highly customized visualizations of your samples and labels.
Uniqueness: During the training loop for a model, the best results will be seen when training on unique data. The FiftyOne Brain provides a uniqueness measure for images that compare the content of every image in a dataset with all other images. Uniqueness operates on raw images and does not require any prior annotation on the data. It is hence very useful in the early stages of the machine learning workflow when you are likely asking “What data should I select to annotate?”
Mistakenness: Annotations mistakes create an artificial ceiling on the performance of your models. However, finding these mistakes by hand is at least as arduous as the original annotation was, especially in cases of larger datasets. The FiftyOne Brain provides a quantitative mistakenness measure to identify possible label mistakes. Mistakenness operates on labeled images and requires the logit-output of your model predictions in order to provide maximum efficacy. It also works on detection datasets to find missed objects, incorrect annotations, and localization issues.
Hardness: While a model is training, it will learn to understand attributes of certain samples faster than others. The FiftyOne Brain provides a hardness measure that calculates how easy or difficult it is for your model to understand any given sample. Mining hard samples is a tried and true measure of mature machine learning processes. Use your current model instance to compute predictions on unlabeled samples to determine which are the most valuable to have annotated and fed back into the system as training samples, for example.
Check out the tutorials page for detailed examples demonstrating the use of each Brain capability.
The FiftyOne Brain provides a powerful
that you can use to generate low-dimensional representations of the samples
and/or individual objects in your datasets.
These representations can be visualized via interactive plots, which can be connected to the FiftyOne App so that when points of interest are selected in the plot, the corresponding samples/labels are automatically selected in the App, and vice versa.
Interactive plots are currently only supported in Jupyter notebooks. In the
meantime, you can still use FiftyOne’s plotting features in other
environments, but you must manually call
plot.show() to update the
state of a plot to match the state of a connected
Session, and any
callbacks that would normally be triggered in response to interacting with
a plot will not be triggered.
See this section for more information.
There are two primary components to an embedding visualization: the method used to generate the embeddings, and the dimensionality reduction method used to compute a low-dimensional representation of the embeddings.
model parameters of
support a variety of ways to generate embeddings for your data:
Provide nothing, in which case a default general purpose model is used to embed your data
Compute your own embeddings and provide them in array form
Dimensionality reduction methods¶
method parameter of
you to specify the dimensionality reduction method to use. The supported
How can embedding-based visualization of your data be used in practice? These visualizations often uncover hidden structure in you data that has important semantic meaning depending on the data you use to color/size the points.
Here are a few of the many possible applications:
Identifying anomolous and/or visually similar examples
Uncovering patterns in incorrect/spurious predictions
Finding examples of target scenarios in your data lake
Mining hard examples for your evaluation pipeline
Recommending samples from your data lake for classes that need additional training data
Unsupervised pre-annotation of training data
The best part about embedding visualizations is that you will likely discover more applications specific to your use case when you try it out on your data!
Check out the image embeddings tutorial to see example uses of the Brain’s embeddings-powered visualization methods to uncover hidden structure in datasets.
The following example gives a taste of the powers of visual embeddings in FiftyOne using the BDD100K dataset from the dataset zoo, embeddings generated by a mobilenet model from the model zoo, and the default UMAP dimensionality reduction method.
In this setup, the scatterpoints correspond to images in the validation split
colored by the
time of day labels provided by the BDD100K dataset. The plot
is attached to an App instance, so when points are
lasso-ed in the plot, the corresponding samples are automatically selected in
Each block in the example code below denotes a separate cell in a Jupyter notebook:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
import fiftyone as fo import fiftyone.brain as fob import fiftyone.zoo as foz # The BDD dataset must be manually downloaded. See the zoo docs for details source_dir = "/path/to/dir-with-bdd100k-files" # Load dataset dataset = foz.load_zoo_dataset( "bdd100k", split="validation", source_dir=source_dir, ) # Compute embeddings # You will likely want to run this on a machine with GPU, as this requires # running inference on 10,000 images model = foz.load_zoo_model("mobilenet-v2-imagenet-torch") embeddings = dataset.compute_embeddings(model) # Compute visualization results = fob.compute_visualization(dataset, embeddings=embeddings, seed=51) # Launch App instance session = fo.launch_app(dataset)
1 2 3 4 5 6 7 8 9 10
# Generate scatterplot plot = results.visualize( labels="ground_truth_timeofday.label", labels_title="time of day", axis_equal=True, ) plot.show(height=512) # Connect to session session.plots.attach(plot)
The GIF shows the variety of insights that are revealed by running this simple protocol:
The first cluster of points selected reveals a set of samples whose field of view is corrupted by hardware gradients at the top and bottom of the image.
The second cluster of points reveals a set of images in rainy conditions with water droplets on the windshield.
Hiding the primary cluster of
daytimepoints and selecting the remaining
nightpoints reveals that the
nightpoints have incorrect labels
The FiftyOne Brain allows for the computation of the uniqueness of an image, in comparison with other images in a dataset; it does so without requiring any model from you. One good use of uniqueness is in the early stages of the machine learning workflow when you are deciding what subset of data with which to bootstrap your models. Unique samples are vital in creating training batches that help your model learn as efficiently and effectively as possible.
1 2 3 4 5 6
import fiftyone as fo import fiftyone.brain as fob dataset = fo.load_dataset(...) fob.compute_uniqueness(dataset)
Input: An unlabeled (or labeled) image dataset. There are recipes for building datasets from a wide variety of image formats, ranging from a simple directory of images to complicated dataset structures like COCO.
Did you know? Instead of using FiftyOne’s default model to generate
embeddings, you can provide your own embeddings or specify a model from the
Model Zoo to use to generate embeddings via the optional
embeddings argument to
Output: A scalar-valued
uniqueness field is populated on each sample
that ranks the uniqueness of that sample (higher value means more unique).
The uniqueness values for a dataset are normalized to
[0, 1], with the most
unique sample in the collection having a uniqueness value of
You can customize the name of this field by passing the optional
uniqueness_field argument to
What to expect: Uniqueness uses a tuned algorithm that measures the
distribution of each
Sample in the
Dataset. Using this distribution, it
ranks each sample based on its relative similarity to other samples. Those
that are close to other samples are not unique whereas those that are far from
most other samples are more unique.
Did you know? You can specify a region of interest within each image to use
to compute uniqueness by providing the optional
roi_field argument to
Polylines that define the ROI for each sample.
Check out the uniqueness tutorial to see an example use case of the Brain’s uniqueness method to detect near-duplicate images in a dataset.
Label mistakes can be calculated for both classification and detection datasets.
During training, it is useful to identify samples that are more difficult for a model to learn so that training can be more focused around these hard samples. These hard samples are also useful as seeds when considering what other new samples to add to a training dataset.
1 2 3 4 5 6
import fiftyone as fo import fiftyone.brain as fob dataset = fo.load_dataset(...) fob.compute_hardness(dataset, "predictions")
Output: A scalar-valued
hardness field is populated on each sample that
ranks the hardness of the sample. You can customize the name of this field via
hardness_field argument of
What to expect: Hardness is computed in the context of a prediction model. The FiftyOne Brain hardness measure defines hard samples as those for which the prediction model is unsure about what label to assign. This measure incorporates prediction confidence and logits in a tuned model that has demonstrated empirical value in many model training exercises.
Check out the classification evaluation tutorial to see example uses of the Brain’s hardness method to uncover annotation mistakes in a dataset.
Managing brain runs¶
When you run a brain method on a dataset, the run is recorded on the dataset, allowing you to retrive information about it later, delete it (along with any modifications to your dataset that were performed by it), or even retrieve the view into your dataset that you processed.
Brain method runs can be accessed later by their
The example below demonstrates the basic interface:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
import fiftyone as fo import fiftyone.brain as fob import fiftyone.zoo as foz dataset = foz.load_zoo_dataset("quickstart") view = dataset.take(100) # Run a brain method that returns results results = fob.compute_visualization(view, brain_key="visualization") # Run a brain method that populates a new sample field on the dataset fob.compute_uniqueness(view) # List the brain methods that have been run print(dataset.list_brain_runs()) # ['visualization', 'uniqueness'] # Print information about a brain run print(dataset.get_brain_info("visualization")) # Load the results of a previous brain run also_results = dataset.load_brain_results("visualization") # Load the view on which a brain run was performed same_view = dataset.load_brain_view("visualization") # Delete brain runs # This will delete any stored results and fields that were populated dataset.delete_brain_run("visualization") dataset.delete_brain_run("uniqueness")