Move over, CLIP — you’ve been dethroned!

Source: AIMv2 technical blog Released in late 2024, AIMv2 is a family of open-vision encoders that has quietly revolutionized multimodal learning yet has received surprisingly little fanfare given its capabilities.

What Is AIMv2?

AIMv2 is a family of pre-trained vision encoders that uses a novel multimodal autoregressive method. Its key innovation lies in treating image patches and text tokens as part of a unified sequence, using a causal multimodal decoder to predict elements sequentially. This approach provides dense supervision, as the model extracts training signals from every image patch and text token. AIMv2's autoregressive modeling predicts each element (image patch or text token) based on all preceding elements in the sequence. This unified approach means the model:

Processes both modalities in a single, coherent framework
Learns from dense supervision at every step
Develops rich contextual understanding across modalities
Achieves efficient training with fewer samples
Achieves better multimodal synergy

How AIMv2 Differs from CLIP

AIMv2’s processes data as one continuous sequence, predicting the next step in the series. It deliberately puts image information first, followed by text, creating a specific sequence: image patches → text tokens:

Image Prediction: It looks at the beginning of an image and tries to predict what comes next, pixel by pixel, like looking at the top half of a photo and trying to guess what’s in the bottom half.
Text Prediction: After seeing the image, it predicts the caption one word at a time, like completing a sentence. If it sees “A dog playing in the”, it tries to predict “park.”

Source: Figure 1 from the AIMv2 Paper This is critically important because:

Order Matters: While it’s technically possible to process text first and then images, AIMV2 deliberately chooses to process images first. This allows the model to build a strong visual understanding before tackling the text.
Visual Foundation: By processing image patches first, the model can access the complete visual context for every text prediction. Consider how you would describe a photo — you’d likely want to see the whole image before starting your description and not try to describe it while still uncovering parts of it.
Unified Processing: Once ordered, everything is treated as one continuous sequence.

This sequential, image-first approach differs fundamentally from CLIP’s parallel processing of image and text and ensures that the visual context always fully informs the text generation. In a nutshell, this is what CLIP does:

Takes a batch of images and their captions
Tries to match each image with its correct caption from the batch
Learns by trying to make matched pairs score high and mismatched pairs score low

Source: Figure 1 from the CLIP paper The fundamental difference is that CLIP focuses on aligning existing image and text pairs, while AIMV2 tries to actively predict and reconstruct both the images and text. CLIP needs large batches to have enough positive and negative examples for its matching game, while AIMV2 can learn from each image-text pair independently since it extracts more information from each sample through its prediction task. AIMv2’s architectural choice specifically aims to strengthen the vision encoder. The model must build robust visual representations to support image reconstruction and subsequent text prediction.

Technical Architecture

As mentioned, AIMv2 processes vision and text tokens symmetrically within a unified architecture, avoiding modality-specific prioritization. This differs from models using cross-attention mechanisms. Furthermore, AIMv2 uses a prefix attention mask to constrain the self-attention mechanism within the vision encoder. The vision encoder and multimodal decoder incorporate SwiGLU activations and replace all normalization layers with RMSNorm. But what does that even mean? It took me a while to understand the significance of this, and here’s my understanding:

Cross-Attention Mechanisms: Many models use cross-attention, where the image and text are processed separately but “look” at each other to find connections. AIMV2 avoids this, processing both symmetrically.
Prefix Attention Mask: AIMV2 uses a “prefix attention mask”. The prefix attention mechanism constrains the self-attention within the vision encoder.
Attention is a mechanism that allows the model to focus on the most relevant parts of the input when processing information.
Self-attention means that the model is attending to different parts of the same input (in this case, the image).
A mask prevents the model from attending to certain parts of the input. In AIMV2, a prefix attention mask is applied to the vision encoder.
Prefix attention helps the model focus on initial parts of the image (the “prefix”). This helps the model encode more informative contexts, even from partial images, which are then used by subsequent visual and textual tokens.
SwiGLU: The vision encoder and multimodal decoder incorporate the SwiGLU activation function to introduce non-linearity into the model. Why SwiGLU? Good question; it just works. According to the SwiGLU paper, We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.
RMSNorm: AIMV2 replaces all normalization layers with RMSNorm, which has proven effective in language modeling. Normalization layers help stabilize the training process and improve the model’s performance.

Training Data

The model is trained on 12 billion image-text samples, balancing human-written alt-text and synthetically generated captions from diverse sources. The public DFN dataset contributes significantly with 1.9 billion alt-text pairs, and its private counterpart adds another 3.8 billion synthetic caption pairs. Both were sampled at a 30% probability during training. The public COYO dataset contributes 560 million alt-text pairs, sampled at 9% probability. Additionally, a proprietary dataset, the High Quality Image-Text Pairs (HQITP), provides 565 million alt-text pairs and 432 million synthetic caption pairs, sampled at 28% and 3% probability, respectively. This sampling strategy ensures comprehensive coverage while prioritizing quality by assigning higher sampling rates to presumably more reliable sources. The synthetic captions are generated based on the methodology by Lai et al., which uses a two-stage human-aligned captioning pipeline. Source: Figure 5 from Lai et al. Stage 1 transforms an MLLM into a customized captioner using high-quality human-annotated image-text pairs, OCR detection, and LLM-based summarization with strict prompting. Stage 2 fine-tunes the model using expert-annotated detailed captions and LLM-based reformatting under strict constraints. With carefully balanced, high-quality data and a sophisticated two-stage captioning pipeline, AIMv2 achieves comprehensive coverage and reliability in its pre-training.

Practical Applications with FiftyOne

I’m excited to announce AIMv2’s integration into FiftyOne, enabling you to:

Extract insightful features from visual data
Visualize high-dimensional embeddings
Perform zero-shot classification on diverse datasets
Streamline multimodal analysis workflows

I’ll show you how you can get started with the model for zero-shot classification and feature extraction with just a few lines of code! First, let’s install FiftyOne, a couple of dependencies, set an environment variable, and download the quickstart dataset.

Next, you’ll need to install two plugins:

The plugin framework lets you extend and customize the functionality of FiftyOne to suit your needs. If you’re interested in learning more about plugins, you might be interested in attending one of our monthly workshops. You can see the full schedule here and look for the Advanced Computer Vision Data Curation and Model Evaluation workshop. Let’s begin with embeddings.

Feature Extraction and Embedding Visualization with AIMv2 in FiftyOne

With a dataset and plugins downloaded, we’re ready to rock. First, let’s instantiate an operator:

The plugin supports two types of embeddings:

Class Token Embedding (cls): A single embedding vector derived from the special classification token. This represents the global semantic context of an image.
Mean Pooling Embedding (mean): An embedding vector computed by averaging the representations of all image patches. This captures distributed contextual information across the entire input.

You can also choose from any model in the AIMv2 collection. I’ll assume that you’re running this in a Jupyter notebook, in which case you can run the entire model on the dataset as follows:

Alternatively, you can use the app and fill out the operator form. Just hit the backtick button (`) to open the operator menu. Type in “AIMv2” and click on it. You’ll see the following form, which you will fill out: And that’s it! You’ve now computed embeddings for every sample in your dataset. Now, you visualize the embeddings via:

Below is an example of how you can interact with the embeddings once they’ve been computed:

Zero-Shot Classification using AIMv2 in FiftyOne

Let’s instantiate the operator for zero-shot classification:

Next, you will need to define a list of the classes that you are interested in:

Once you’ve done that, you can run the operator as follows. I’ll assume you are working in a notebook:

Alternatively, you can also do this directly in the app. Just hit the backtick button (`) to open the operator menu. Type in “zero-shot classification” and click on it. You’ll see the following form, which you will fill out: You can upload a .txt file with your classes (one class per line) or pass in a comma-separated list of classes you’re interested in (you don’t need to wrap the strings in quotes). And just like that, you’ve used AIMv2 for zero-shot classification!

Conclusion

AIMv2 offers a fresh perspective on how visual and textual information can be processed together. Its autoregressive approach, treating both modalities as a unified sequence, addresses the fundamental limitations of previous models like CLIP. By processing image patches first and building a strong visual foundation before text generation, AIMv2 achieves more nuanced understanding and efficient training. The model’s integration with FiftyOne makes it accessible to practitioners, enabling straightforward implementation of feature extraction and zero-shot classification tasks. Whether you’re working with embeddings for visualization or performing zero-shot predictions, AIMv2’s capabilities are now just a few lines of code away. AIMv2 will play an increasingly important role in computer vision applications. Its ability to learn effectively from smaller datasets while maintaining robust performance makes it valuable for real-world applications where data efficiency matters. While CLIP has dominated the multimodal landscape for years, AIMv2’s is a new chapter in visual-language learning — and seriously, why isn’t anyone reading this chapter ‘cuz this model is hella good! — one in which unified sequence processing and autoregressive prediction may become the new standard.

Talk to a computer vision expert