Exploring the World’s Largest Insect Dataset with a Modern Toolkit for Visual AI

BIOSCAN in FiftyOne A new, comprehensive dataset called BIOSCAN-5M was introduced to the machine learning community at NeurIPS 2024, and it is a wealth of multi-modal information on over 5 million arthropod specimens, 98% of which are insects. Look, I get it; I’m as creeped out by bugs as the next guy. But it turns out that insect populations are declining globally, which could threaten ecosystems and human well-being. So, monitoring these populations is necessary for maintaining ecosystem stability. But, with deep learning and large, comprehensive datasets, researchers can help conservation efforts by automating species identification and providing ecological insights. This is where datasets like BIOSCAN-5M come into play, bridging the gap between deep learning and biodiversity research and enhancing our understanding and preservation of the natural world. BIOSCAN-5M builds upon the earlier BIOSCAN-1M dataset, offering increased data volume, diversity, and enhanced taxonomic label cleaning. This dataset combines specimen images, DNA barcodes, and taxonomic classifications, addressing the critical need for automated species identification and discovery.

The power of multi-modal data

At the heart of BIOSCAN-5M lies its multi-modal nature, which synergizes diverse data types to unlock unprecedented insights into insect biodiversity. Taxonomic labels, structured according to the Linnaean hierarchy (from phylum to species), provide a standardized classification system. A hybrid validation approach — combining AI tools with human expertise — ensures accuracy, particularly at deeper taxonomic levels such as genus and species. DNA barcodes and Barcode Index Numbers (BINs) complement these labels, facilitating rapid species-level identification and mapping evolutionary relationships. This offers a genetic blueprint for biodiversity research. Geographical context helps track species distribution patterns, revealing ecological hotspots and potential sampling biases. Additionally, specimen size metrics (such as pixel area and scale factors) serve as proxies for studying environmental impacts, like climate change on insect populations. High-resolution microscopic images capture intricate morphological details, which support visual identification and the development of robust deep learning models for image-based classification. The true strength of BIOSCAN-5M comes from integrating these modalities: multi-modal learning boosts classification accuracy by cross-referencing genetic, visual, and ecological data; supports real-world scenarios like handling incomplete labels or identifying novel species; and enables cross-modal queries, such as retrieving specimen images using DNA sequences.

Key features of the BIOSCAN-5M dataset

Images: 5.15M high-resolution microscope images (1024×768px) with cropped/resized versions.
Genetic Data: Raw DNA barcode sequences (COI gene) and BIN clusters.
Taxonomy: Labels for 7 taxonomic ranks (phylum, class, order, family, subfamily, genus, species).
Geographical Metadata: Collection country, province/state, latitude/longitude.
Size Metadata: Pixel count, area fraction, and scale factor for specimens.

Together, these features create a holistic toolkit for advancing biodiversity science, conservation, and AI-driven ecological monitoring.

Building a robust dataset

The thing about BIOSCAN-5M that stood out to me was its approach to data collection, processing, and curation. The dataset underwent a rigorous data cleaning pipeline to ensure the reliability of taxonomic labels. This involved identifying and resolving inconsistencies in the original data, such as typos and disagreements in taxonomic naming conventions. The labels were checked and cleaned to ensure consistency across DNA barcodes.

Conflict Resolution: Differences in annotations for the same barcode were investigated and resolved. The process included correcting stylistic differences, misspellings, and standardizing placeholder names.
Invalid Image Removal: Invalid images were removed during the dataset curation process.
High-Resolution Microscopy: Specimen images were captured using a Keyence VHX-7000 microscope at 1024×768 pixels.
Cropping: Images were cropped to focus on the region of interest containing the organism, eliminating unnecessary background. A DETR model was fine-tuned to improve the cropping tool. The bounding box of the cropped region is provided as part of the dataset release.
Resizing: Cropped images were resized to 341×256 pixels to facilitate model training and standardize the resolution across samples. Images were also resized to 256 pixels on their shorter side.
DNA Barcode Standardization: The genetic information is represented as raw nucleotide barcode sequences under the dna_barcode field. The dataset utilizes the COI gene sequence for species-level identification in Animalia.
BIN Clustering: DNA barcodes are grouped based on sequence similarity into clusters called Operational Taxonomic Units (OTUs), each assigned a Barcode Index Number (BIN).
Geographical Information: The dataset includes geographic location information detailing the country and province/state where each specimen was collected and latitude and longitude coordinates.
Specimen Size: The dataset provides specimen size information, including the number of pixels occupied by the organism, the area fraction, and the scale factor.

Exploring BIOSCAN-5M in FiftyOne

I randomly sampled a subset of 30,000 samples across all splits from the Cropped 256 dataset, parsed it into FiftyOne format, and pushed it to the Voxel51 Hugging Face org. It’s just a tiny, tiny fraction of the entire dataset, but it's enough to give us an appreciation of what’s in it and how we can use it. Let’s begin by installing some dependencies and downloading the dataset from the Voxel51 org on Hugging Face.

This dataset has geolocation, to visualize that in the FiftyOne app you’ll need a Mapbox key. You can sign up for a key here; it’s free, and you get 50,000 free map loads. Once you have a Mapbox account and API key, you must set the following environment variable export MAPBOX_TOKEN=xxxxxxx. Alternatively, if you’re running this in a Jupyter Notebook you can do the following:

Now let’s install a plugin that allows us to create custom dashboards and glean more insight into our dataset:

After the dataset has been downloaded you can begin exploring it in the FiftyOne app. Once the dataset has been downloaded, you can do some initial exploration by launching the app. There are two ways to use the app: 1. As a cell in your notebook, which you can do by running:

2. In a separate browser window, run the following in your terminal:

Once the app is launched, you can explore the dataset by:

Scrolling through the images for a visual vibecheck of its contents
Filter based on the labels (the various taxonomic classifications, geographic information, or size measurements)
Opening the map panel and exploring based on geographic location
Create a dashboard of plots for the various information fields of the dataset.

🐛 Warning: You’re about to see some creepy crawly insects

Below is an example of using the map panel: Initial exploration of the dataset in FiftyOne

Deeper analysis with FiftyOne

You can take your analysis to a deeper level by using embeddings based workflows. The authors of the paper mentioned they trained a CLIP like model. This model, built using the CLIBD (Contrastive Learning for Image-Barcode-Description) framework, learns a shared embedding space across the three modalities, enabling cross-modal queries and improving performance in taxonomic classification tasks. However, I was unable to find the model weights on Hugging Face or through the projects GitHub repo. Instead, I will make use of some other models which were mentioned in the paper. Note: I’m not an expert in biology, genomics, or insects. I’m just a hacker. I apologize in advance to the community of pracitioners working in this space if I’m not using the models as intended. My goal is to to show you what’s possible when you use the open source FiftyOne library. Let’s start computing embeddings for the images using BioCLIP. BioCLIP extends the CLIP framework to create a vision foundation model specialized for biological imagery, focusing on taxonomic relationships across the tree of life. Trained on TreeOfLife-10M — a novel dataset of 10M biological images spanning 454K taxa — BioCLIP learns hierarchical representations aligned with taxonomic ranks (kingdom to species). Unlike standard CLIP, it treats species as interconnected nodes in a biological hierarchy rather than isolated classes. BioCLIP is part of the Open CLIP ecosystem, so you can use FiftyOne’s integration with as follows:

Once the model is downloaded, you can compute embeddings as follows:

We’ll visualize these embeddings shortly, but first let’s compute embeddings for the DNA Sequences using BarcodeBERT. BarcodeBERT is a specialized transformer model designed for biodiversity analysis through DNA barcode sequences. Built on the BERT architecture, it adapts self-supervised pretraining to the unique demands of taxonomic classification, particularly for invertebrates.

Now we can compute a 2D representation of our high-dimensional embeddings using UMAP. This will allow us to visualize how different specimens are related to each other in the embedding space while preserving the important relationships between data points.

And we can visualize our results in the app:

Visualizing embeddings in FiftyOne

Using embeddings to gain deeper insights

You can use the embeddings you’ve just computed to compute scores for uniqueness and representativeness. Uniqueness will compute a scalar value for each sample that ranks the uniqueness of that sample (a higher value means more unique). The uniqueness values for a dataset are normalized to [0, 1], with the most unique sample in the collection having a uniqueness value of 1. Representativeness will compute a scalar value for each sample that ranks the representativeness of that sample (a higher value means more representative). The representativeness values for a dataset are normalized to [0, 1], with the most representative samples in the collection having a representativeness value of 1. Since we already have embeddings computed for images and the DNA barcode, you can compute these values for either field. We'll compute the examples below for the BioCLIP embeddings.

Once these fields have been computed, you can interact with them in the app as you normally would: Filtering by uniqueness and representativeness

Conclusion

In this exploration of BIOSCAN-30k (a small subset of the full BIOSCAN-5M dataset), we’ve demonstrated several powerful ways to analyze biodiversity data using FiftyOne:

Data Exploration: We explored the dataset’s rich taxonomic, geographic, and specimen metadata through FiftyOne’s visualization tools and custom dashboards.
Multimodal Analysis: We leveraged two state-of-the-art models:
- BioCLIP for visual embeddings of specimen images
- BarcodeBERT for DNA barcode sequence embeddings
Advanced Analytics: We used these embeddings to:

Visualize relationships between specimens using UMAP dimensionality reduction.
Identify unique and representative samples in the dataset.
Create interactive visualizations for exploring specimen similarities.

This workflow showcases how modern ML tools can help researchers and practitioners analyze large-scale biodiversity datasets, potentially accelerating species identification and ecological research.

Next Steps

Some potential directions for further analysis:

Clustering analysis to identify specimen groups
Cross-modal similarity search (finding similar specimens across images and DNA sequences)
Geographic distribution analysis of specific taxonomic groups
Training custom models for automated species identification

For more information about the BIOSCAN-5M project and dataset, visit:

If you have any questions or want to stay up-to-date with us at FiftyOne, feel free to join our Discord community!

Talk to a computer vision expert