FiftyOne helps you curate high-quality datasets by combining code-based analysis with visual exploration. With it we can filter samples, remove duplicates, fix annotations, and add metadata programmatically. Then we can launch an interactive app to explore our results visually and slice or aggregate the data as we’d like.
What You will Need
- A virtual environment with Python 3.9-3.11 installed (or willingness to use a Google Colab notebook)
- Kaggle account with API access
- HuggingFace account
- Basic familiarity with Pandas and PIL
Time Required
90 minutes start-to-finish
Installing FiftyOne
Quick Start: Load ASL‑MNIST into FiftyOne
If you just want to
get the dataset and explore it in FiftyOne (and already have a HuggingFace account set up), you can simply execute:
This is the artifact that we will produce at step 8 of this tutorial: a FiftyOne dataset published on HuggingFace Hub with all the images from the test and training set.
To visualize it, we can start the FiftyOne app
Try filtering the samples by label, here we get a view of images that share the label “V”
However, if what you want is to learn how to get your data from Kaggle to FiftyOne and HuggingFace Hub, please keep reading. There are only eight steps ahead :)
Even though this dataset is already available on HuggingFace, this tutorial is valuable if you want to upload your own datasets or understand the underlying steps.
Step 1: Set Up Your Python Environment
Begin by installing the necessary libraries:
We pin versions to ensure the code in this tutorial runs exactly as shown. For your own projects, you may be able to use more recent versions. These versions are known to be compatible with each other as of July 2025.
Step 2: Configure the Kaggle API Credentials
To access datasets from Kaggle:
- Log into Kaggle and navigate to your account settings.
- Generate a new API token (kaggle.json).
- Move the downloaded file to ~/.kaggle/ and set permissions:
Step 3: Download & Process the ASL‑MNIST Dataset
Download and extract the dataset:
Quirks of ASL-MNIST
The images are
not stored as JPEG or PNG files but as rows inside two CSV files. This format is unusual and I found it unintuitive, but is done on variants of the MNIST dataset. Each row of the csv file encoding the training set has 784 pixel intensity values in uint8 and a column indicating the ground truth label. The ground truth label is the letter that the image of the gesture is representing. Gestures in the dataset do not represent numbers. These images are
really tiny. They only have 28x28 pixels and that’s where the “MNIST” part of the name comes from. The images have the same dimensions as the
original MNIST dataset of grayscale handwritten digits. FiftyOne supports importing datasets using more standard formats, such as COCO or having directory structures where the name of the folder maps to the label. In this case, we need to do some manual processing to get around the quirks of ASL-MNIST.
Test images will have the label “unknown”.
Knowing all this, we process the CSV files into images with pandas and PIL:
Step 4: Build a FiftyOne Dataset from ASL‑MNIST Images
We import the processed jpg images into a
FiftyOne Dataset and map numerical labels to their corresponding letters.
Note: The letters 'J' (9) and 'Z' (25) are not included in this dataset as they involve motion, and the images that we have are static (single frames).
Each image is a Sample within the FiftyOne dataset. We can associate metadata, labels, and tags to each of them. Samples are initialized with a filepath to the corresponding data on disk. We had to save our images to the local hard drive to create the FiftyOne Dataset (a
collection of samples).
Note that asl_dataset.persistent = True only affects in-memory persistence across Python sessions, not across system reboots. This is why we also export the dataset to disk in Step 6.
Step 5: Explore & Visualize the Dataset in FiftyOne
We can now visualize, query, and analyze our dataset interactively using FiftyOne. The FiftyOne app is a powerful graphical user interface that allows us to browse, tag, aggregate, and interact directly with the dataset.
Producing Histograms
After getting the dataset into FiftyOne, try producing an aggregation of its ground_truth.label field by going to the Histograms panel and selecting the field.
You can then click on the “Split Horizontally” button to see the Samples next to the Histogram panel.
Here is a short video demoing this.
I encourage you to try creating histograms on other fields, such as metadata.size_bytes.
Step 6: Save the FiftyOne Dataset Locally
To save our FiftyOne dataset for future use or for sharing, we can export it locally. The dataset.export() method creates a portable and self-contained archive and allows us to save the dataset in various formats. In this case, we will save it in the FiftyOneDataset format, which preserves the full FiftyOne dataset structure. Note that this local export is also needed to make the dataset available for ourselves after we turn off our computer. The persistence that we defined in step 4 (with asl_dataset.persistent=True) is only valid across different Python sessions. With export_media=True, our export ensures portability by copying the images into a self-contained folder.
This will create a new directory named asl-mnist-fiftyone-dataset containing the JSON definition of your dataset and copies of the images. The export directory will contain both the label files AND copies of all the original jpg images. This creates a self-contained dataset that you can share or move without worrying about broken file paths.
This exported dataset can be easily reloaded into FiftyOne later using fo.Dataset.from_dir().
For more information on loading and using datasets, see the
documentation.
Following these steps, we have moved from a raw Kaggle dataset stored in CSV files to a portable, visual, and queryable dataset ready for interpretable computer vision workflows.
Step 7: Publish the Dataset to Hugging Face Hub
We can share the dataset with others through HuggingFace Hub, which is a great resource both for open models and data. When we push our FiftyOne dataset to the HF Hub, a fiftyone.yml is generated and the dataset remains in FiftyOne format, with all its fields: annotations, tags, and metadata.
If you are new to HuggingFace, you can
follow this guide to set up your authentication token. You will need to have it set up with write permissions to publish your dataset.
Be sure to check the data's licensing rights. The American Sign Language dataset has a
CC0 License, meaning that is in the public domain. The original content is CC0, and we apply an MIT license only to the packaging and any additional code or annotations.
We can modify a public domain dataset and redistribute it under a compatible license (such as the
MIT license). Public domain works have no copyright restrictions, so anyone can use, modify, and redistribute them without permission. When we modify public domain content, we “create” a new work that we own the copyright to. We can license our modifications under MIT, but the original public domain portions remain
public domain and cannot be relicensed. Remember to follow best practices to avoid complications:
- Document your changes: Clearly state which portions are your modifications versus the original public domain content.
- Apply the license correctly: Ensure your MIT license applies only to your contributions.
- Verify the source: Double-check that the original dataset is truly in the public domain.
After uploading the dataset, be sure to fill-in its dataset card with all the details on the data collection process. Dataset cards serve as documentation for how your dataset was collected, cleaned, and used. You can use
the one that we have created for this example as a template.
Step 8: Verify & Reload the Dataset from Hugging Face Hub
Finally, we can check that the dataset is available on our HuggingFace user account.
Pro Tip: Move the Dataset to a Hugging Face Organization
A detail that might be important to you is that push_to_hub will only allow you to push to personal accounts, not organizations. To transfer the dataset to an organization, you will first need to push it to a personal account and then transfer ownership through the dataset’s page.
Full Google Colab Notebook and Source Code
Key Takeaways
Congratulations! You have successfully taken a dataset from Kaggle, processed it into a usable format, curated it within FiftyOne, and shared it with the community on Hugging Face Hub. This workflow is a powerful pattern for any computer vision project, enabling better data understanding, collaboration, and reproducibility.
Try applying these steps to your own dataset and share it on HuggingFace Hub!
Next steps
In the following blog posts, we will go into training neural networks using the integration of FiftyOne and PyTorch. Be sure to try those techniques on this data!