We’ve added a ton of new model families and plugins to the FiftyOne Model Zoo, so you can try modern classification, detection, segmentation, and zero‑shot models on your own data with just a few lines of code. Load a model through our Transformers integration, run it on a dataset or view, and review results in the App with the same evaluation and visualization tools you already know and love.

This update adds widely used families across classification, detection, segmentation, and zero‑shot. They are available through our Hugging Face Transformers integration, so you can pull popular checkpoints and keep evaluation, comparison, and curation in the same FiftyOne workflow. In practice, you get the reach of Hugging Face and the dataset‑centric analysis of FiftyOne working together.

If you’re new to the Zoo, the pattern below repeats in every section: load → apply → evaluate → explore in FiftyOne.

Quick setup

Once you install those packages, models you load from the FiftyOne Model Zoo with foz.load_zoo_model will use your graphics card (if available). Torch models use CUDA on NVIDIA GPUs by default, and they fall back to the CPU if a GPU is not present. On a Mac with Apple Silicon, you can run on the built‑in GPU by passing device="mps". The same device handling applies when you load models through the Hugging Face Transformers integration, so your load → apply → evaluate → explore scripts work the same on a laptop and on a server.

ConvNeXt (image classification)

ConvNeXt is a modern convolutional family that brings the clarity of a CNN with practical refinements inspired by recent transformer work. Offered in multiple sizes, it’s a dependable choice for image classification and a strong backbone for transfer learning when you want something fast, accurate, and easy to use.

Available checkpoints: convnext-tiny-224-torch, convnext-small-224-torch, convnext-base-224-torch, convnext-large-224-torch, convnext-xlarge-224-torch.

EfficientNet B0–B7 (image classification)

EfficientNet is a scaled family from B0 for lightweight devices up to B7 for higher accuracy, designed to grow depth, width, and input size in a balanced way so you get strong results for the compute you have. In the Fiftyone Model Zoo you can load any B0–B7 checkpoint, run it on your own dataset, and quickly compare speed and accuracy to choose the right tradeoff for your classes and hardware.

Available checkpoints: efficientnet-b0/b1/b2/b3/b4/b5/b6/b7-imagenet-torch.

Swin V2 (image classification)

Swin V2 is a hierarchical vision transformer that uses windowed self‑attention for efficiency, scales cleanly to higher capacities and image resolutions, and serves as a strong general‑purpose backbone to compare alongside ConvNeXt and EfficientNet within the same evaluation flow; the v2 updates stabilize large models and improve transfer to higher input sizes, and ready‑to‑use Tiny, Small, Base, and Large checkpoints are available.

Available checkpoints: swin-v2-tiny/small/base/large-torch.

Slice, compare, and iterate (views + eval keys)

Use dataset views to focus on the subset you care about, then run evaluation on that view and save the results with an eval key so multiple runs stay separate and easy to compare. Evaluations can be run on a Dataset or a DatasetView, and you can later reload the exact view that was evaluated. In the App you get confusion matrices, PR curves, and TP, FP, and FN, and you can click cells in the confusion matrix to load the matching samples for fast error inspection. This makes it simple to compare ConvNeXt, EfficientNet, and Swin V2 on the same slice without overwriting results.

RT‑DETR v2 (object detection)

RT‑DETR v2 is a real‑time detection transformer that keeps the end‑to‑end DETR design while adding practical upgrades like scale‑aware sampling in deformable attention, an optional discrete sampling operator that removes grid_sample deployment friction, and a stronger “bag‑of‑freebies” training recipe for better accuracy at the same speed. In the Model Zoo, use rtdetr-v2-s-coco-torch when latency is the priority and switch to rtdetr-v2-m-coco-torch when you want more accuracy while staying real‑time.

Available checkpoints: rtdetr-v2-s-coco-torch, rtdetr-v2-m-coco-torch.

D‑FINE (object detection)

D‑FINE is a real‑time DETR‑style detector that replaces direct box regression with fine‑grained distribution refinement and adds a localization self‑distillation step, yielding strong accuracy without giving up speed; the family spans Nano, Small, Medium, Large, and Extra-Large on COCO, with the larger variants reported around mid‑50s AP while still running in real time on a T4 GPU, making it a useful counterpoint to RT‑DETR v2 when you want to compare error patterns on the same view.

Available checkpoints: dfine-nano/small/medium/large/xlarge-coco-torch.

SegFormer (semantic segmentation)

SegFormer (B0 through B5) combines a hierarchical transformer encoder with a lightweight MLP decoder to deliver efficient, accurate semantic segmentation; we include ADE20K‑finetuned checkpoints, including B5 at 640×640, that work well for scene parsing across 150 classes such as road, building, and sky, so you can generate per‑pixel overlays and explore class‑wise behavior in FiftyOne.

Available checkpoints: segformer-b0/b1/b2/b3/b4-ade20k-torch, segformer-b5-ade20k-torch (640×640).

Medical zero‑shot VLMs (label‑free concepts and embeddings)

MedSigLIP, MONET, and PubMed‑CLIP are vision‑language models that map medical images and text into the same embedding space, so you can search for concepts and run zero‑shot classification without labeled training data. MedSigLIP focuses on clinical imagery at 448×448 resolution and its weights are distributed under Google’s Health AI Developer Foundations terms for research use, MONET is a dermatology‑focused CLIP model trained on 100k+ skin images with literature‑derived captions and expert verification, and PubMed‑CLIP is a CLIP variant trained on the ROCO radiology image‑caption corpus for biomedical zero‑shot work.

Available checkpoints: medsiglip-448-zero-torch, monet-zero-torch, pubmed-clip-vit-base-patch32.

Plugins: run models from the App

FiftyOne Plugins let you run model‑driven operators (captioning, open‑vocab detection, OCR, VQA, embeddings/search) from the Operators Browser or Python, writing results to fields you can slice, evaluate, and compare like any Zoo run. The workflow is unchanged—load → apply → evaluate → explore—just triggered as an operator. The catalog has grown quickly; below we’re highlighting only two of the many recent additions to illustrate what’s now possible.

Plugin spotlights (try these first)

Florence‑2 (VLM plugin)

A unified VLM that exposes multiple tasks as operators, so you can caption, detect with open vocabulary, run OCR, and do phrase grounding/RES on any dataset slice—then analyze results with the same views, eval keys, and overlays you already use.

Tasks covered: caption, detect (open‑vocab), OCR, ground, RES.

Moondream‑2 (lightweight VLM plugin)

A small, fast VLM that’s ideal for quick iterations on captioning, VQA, and lightweight detection/point‑localization before you scale up, great for idea testing and label bootstrapping on laptops or small servers.

Tasks covered: caption, VQA, detect, point_localize.

How this compares to Hugging Face and other tools

Hugging Face gives a simple way to run models with Transformers pipelines and to compute metrics with Evaluate, both of which are code-first utilities. With FiftyOne you pull the same checkpoints through the Model Zoo or the Transformers integration, apply them to your own dataset in a line or two, and then explore results in an interactive App with confusion matrices, PR curves, filters, and saved dataset slices across classification, detection, segmentation, and zero‑shot. This keeps evaluation, comparison, and curation in one place while you continue to fine‑tune or serve models with your preferred training stack. Compared with training‑centric tools like Ultralytics YOLO, which focus on validation runs and saved plots, FiftyOne is dataset‑centric and model‑agnostic so you can compare families side by side and curate the data that moves the needle.

Bringing it together

The FiftyOne Model Zoo now spans modern families for classification, detection, segmentation, and medical zero‑shot. Pick a dataset slice that reflects your use case, run a family, and use the App to study where it succeeds and where it misses. Save those slices and let them drive what you train, what you tune, and what you fix next.

TL;DR

We added new models for:

Classification: ConvNeXt (Tiny → X-Large), EfficientNet (B0 → B7), Swin V2 (Tiny → Large)
Detection: RT-DETR v2 (Small, Medium), D-FINE (Nano → X-Large)
Segmentation: SegFormer (B0 → B5 on ADE20K; B5 supports 640×640)
Medical zero‑shot: MedSigLIP, MONET, PubMed-CLIP
Plugins: Florence-2 & Moondream-2 (Vision-Language Models

Talk to a computer vision expert