Apply transformer models directly to your computer vision datasets

Transformer models may have begun with language modeling, but over the past few years, the vision transformer (ViT) has become a crucial tool in the computer vision toolbox. Whether you're working on traditional vision tasks like image classification or semantic segmentation, or more du jour zero-shot tasks, transformer models are either competitive with, or are themselves setting the state of the art. Hugging Face's transformers library makes it incredibly easy to load, apply, and manipulate these models.

Now, with the integration between Hugging Face transformers and the open source FiftyOne library for data curation and visualization, it is easier than ever to integrate Transformer models directly into your computer vision workflows.

In this post, we'll show you how to seamlessly connect your visual data and transformer models.

Setup
What is FiftyOne
Transformers <> FiftyOne Integration Overview
Inference with Transformers
- Classification, Detection, Semantic Segmentation, and Monocular Depth Estimation
- Zero-Shot Tasks
- Videos for Free
Embeddings with Transformers
- Image & Patch Embeddings
- Embeddings Visualization
- Semantic and Similarity Searches
Conclusion

Setup

For this walkthrough, you'll need Hugging Face's transformers library, Voxel51's fiftyone library, and `torch` and `torchvision` installed:

What is FiftyOne?

FiftyOne is the leading open source library for curation and visualization of computer vision data. The core data structure in FiftyOne is the fiftyone.Dataset, which logically represents the metadata, labels, and any other information associated with media files like images, videos, and point clouds.

You can load datasets directly from the FiftyOne Dataset Zoo, or load in your own data — there's built-in support for loading from directories, glob patterns, or common formats like COCO.
For this walkthrough, we'll be using the Quickstart dataset, which is a subset of the COCO 2017 validation split:

💡 To get started work with videos, try out the Quickstart Video dataset

Once you have a FiftyOne.Dataset, you can filter it with pandas-like syntax.

You can also visualize and visually inspect your data in the FiftyOne App:

Why use FiftyOne? FiftyOne is built from the ground up for computer vision. It puts all of your labels, features, and associated information in one place, so you can compare apples to apples, stay organized, and treat your data as a living, breathing object!

Transformers Integration Overview

With the integration between fiftyone and Hugging Face transformers, you can apply Transformer models directly to your data — either the entire dataset, or whatever filtered subset you choose — without writing any custom code.

For inference, the integration supports:

Image Classification: any of the models listed in the Transformers image classification task guide, including BeiT, BiT, DeiT, DINOv2, ViT, and more
Object Detection: any of the models listed in the Transformers object detection task guide, including DETA, DETR, Table Transformer, YOLOS, and more
Semantic Segmentation: DPT, MaskFormer, Mask2Former, and Segformer
Monocular Depth Estimation: DPT, GLPN, and Depth Anything
Zero-Shot Image Classification: any Transformer model which exposes both text and image features (get_text_features() and get_image_features()), such as ALIGN, AltCLIP, CLIP, etc, or supports image-text matching (XYZForImageAndTextRetrieval), such as BridgeTower.
Zero-Shot Object Detection: OWL-ViT and OWLv2

Additionally, the integration supports using direct computation of embeddings, and direct utilization of Transformer models for any downstream applications that leverage embeddings, such as dimensionality reduced visualization and semantic/similarity search.

For embedding computation/utilization, all Image Classification and Object Detection models that expose the last_hidden_state attribute, and all Zero-Shot Image Classification/Object Detection models that expose image features via get_image_features()are supported.

For semantic similarity search, only Zero-Shot Classification/Detection models that expose text and image features are supported.

Inference with Transformer Models

In FiftyOne, sample collections (fiftyone.Dataset and fiftyone.DatasetView instances) have an apply_model() method, which takes a model as input. This model can be any model from the FiftyOne Model Zoo, any fiftyone.Model instance, or a Hugging Face transformers model!

Traditional Image Inference Tasks

For Image Classification, you can load a Transformers model via the Transformers library, with the specific architectural constructor, or via AutoModelForImageClassification, using from_pretrained() to specify the checkpoint. For BeiT, for instance.

Once the model has been loaded, you can apply the model directly to your dataset, specifying the name of the field in which to store the classification labels via the label_field argument:

Object Detection, Semantic Segmentation, and Depth Estimation tasks work in analogous fashion; for Object Detection, instantiate a model with the AutoModelForObjectDetection or the specific architectural constructor, and apply with the same syntax:

For Semantic Segmentation, you can load and apply models that have ForInstanceSegmentation or ForUniversalSegmentation in the constructors, so long as the image processor for the model has a post_process_semantic_segmentations()method.

And for Monocular Depth Estimation, you can load and apply models that have ForDepthEstimation in their constructors. To use DPT, for instance:

💡 Once you have generated predictions, you can filter by label class and prediction confidence in the app, and by arbitrary properties in Python. For instance, to filter for bounding boxes that take up less than 1/4 of the image:

💡 You can numerically evaluate predictions for any of these tasks with FiftyOne's Evaluation API.

Zero-Shot Inference Tasks

For zero-shot tasks, it is recommended to load the Hugging Face transformers model from the FiftyOne Model Zoo. Transformer models for Zero-Shot Image Classification can be loaded with the load_zoo_model() method, specifying the model type (first argument) as "zero-shot-classification-transformer-torch", and then passing in the name_or_path=<hf-name-or-path>. You can pass the list of classes in at model initialization time, or set the model's classes later.

You can then apply the model for image classification just as you did in the traditional image classification setting with apply_model().

Zero-Shot Object Detection works the same way, but with model type "zero-shot-detection-transformer-torch":

Video Inference Tasks

One of the coolest parts of this integration is that the flexibility intrinsic to FiftyOne's datasets and to Hugging Face's Transformer models is preserved. Without any additional work, you can apply any of the models from the image tasks above to (the frames in) a video dataset, and it will just work!

This is all the code it takes to apply YOLOS from the transformers library to a video dataset:

Embeddings with Transformers

Image and Patch Embeddings

In the same vein as how we could pass a Hugging Face transformers model directly into a FiftyOne sample collection's apply_model() method for inference, we can pass a transformers model directly into a sample collection's compute_embeddings() method. For instance, this would use a Beit model to compute embeddings for all images and store them in a field "beit_embeddings'' on the samples:

You can also use compute_patch_embeddings() to compute and store embeddings for each of the object patches in a certain label field on your dataset. For example, to compute embeddings for our ground truth object patches with CLIP:

Visualizing Embeddings with Dimensionality Reduction

The way that Hugging Face Transformer models plug into FiftyOne datasets for embedding computations also makes them directly applicable for dataset-wide computations that utilize embeddings. One such application is dimensionality reduction. By embedding our images (or patches) and then reducing the embeddings down to 2D with t-SNE, UMAP, or PCA, we can visually inspect hidden structure in our data, and interact with our data in new ways.

In FiftyOne, dimensionality reduction is performed via the FiftyOne Brain's compute_visualization() method, which has built-in support for t-SNE, UMAP, and PCA.

Just pass any Hugging Face transformers model that exposes image embeddings — either via last_hidden_state or get_image_features() — into the method, along with:

a brain_key specifying where to save the results, and
the dimensionality reduction technique to use

Then you can visualize the dimensionally-reduced embeddings along with samples in the app.

This is a great way to compare embedding models and dimensionality reduction techniques!

Searching by Similarity

Another dataset-level application of embeddings is indexing unstructured or semi-structured data. In FiftyOne, this is accomplished via the FiftyOne Brain's compute_similarity() method — and Hugging Face transformers models directly plug into these workflows as well!

Just pass the transformers model directly into the compute_similarity() call, and you will be able to query your dataset to find similar images:

💡 You can also create a similarity index over object patches in the dataset by passing the name of the field containing the object patches in with the patches_field argument.

If you want to semantically search your images with natural language, you can leverage a multimodal Transformer model that exposes both image and text features. To enable natural language querying, pass in the model type for the model argument, along with the name_or_path for the model via model_kwargs:

💡 For larger datasets, you can index the data using a purpose-built vector search engine — check out our native integrations with Pinecone, Qdrant, Milvus, LanceDB, MongoDB, and Redis!

Conclusion

Transformer models have become a mainstay for those of us working in computer vision or multimodal machine learning, and their impact only appears to be increasing. With the variety and versatility of Transformer models at an all-time high, seamlessly connecting these models with computer vision datasets is absolutely critical.

I hope this integration between FiftyOne and Hugging Face Transformers helps you reduce boilerplate, readily compare and contrast model checkpoints and architectures, and understand both your data and models better!

Talk to a computer vision expert