When humans encounter optical illusions, our brains often see things that aren’t physically present in the image. This perceptual phenomenon, known as pareidolia, has long fascinated neuroscientists and psychologists. Now, researchers are turning these visual puzzles toward Vision-Language Models (VLM) to test their perceptual capabilities.

I recently came across a paper, Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions, introducing a novel task called Illusory VQA.

The core challenge presented in the Illusory VQA task is deceptively complex: given an image containing both a “Real Concept” (RC) and potentially an “Illusory Concept” (IC), can a VLM detect if an illusion is present and correctly answer questions about that illusory element?

This task requires perception beyond standard image recognition and assesses how well models can mimic human-like visual understanding. It is interestingly challenging because the model must simultaneously recognize what’s in the image while also perceiving what appears to be there due to the illusion—much like our visual system does.

The Illusory Datasets

For this task, the authors created four benchmark datasets, each targeting different aspects of visual illusion processing:

IllusionMNIST: Built using the classic MNIST handwritten digit dataset as the source material, this dataset contains 3,960 training samples and 1,219 test samples. The researchers added a “No illusion” class to make the task more challenging, requiring models to determine whether an illusion is present.
IllusionFashionMNIST: Based on the Fashion-MNIST dataset, which contains clothing items rather than digits, this collection includes 3,300 training samples and 1,267 test samples. Like its MNIST counterpart, it includes a “No illusion” class to test discrimination abilities further.
IllusionAnimals: This dataset features animal images generated using SDXL-Lightning and transformed with ControlNet to create illusory versions. It comprises 3,300 training and 1,100 test samples, with the additional “No illusion” class.
IllusionChar: This unique dataset focuses on reading characters in images, with 3 to 5 characters per image sequence. It includes 9,900 training and 3,300 test samples to test how well models can interpret text within illusory contexts.

What I found particularly interesting is how these datasets were constructed:

The research team:

Generated scene descriptions using large language models
Combined these descriptions with raw images
Used a variant of ControlNet to create the final illusory images
Conducted human evaluations to validate the quality of the generated images
Asked participants to identify what they perceived in each picture
Filtered out inappropriate content using NSFW detectors

This approach ensures that the illusions in the datasets genuinely challenge perceptual abilities in ways that mirror human visual processing.

Testing Leading Multimodal Models

The study evaluated several state-of-the-art models:

The research team focused on zero-shot performance (how well models perform without specific training on illusions) and performance after fine-tuning.

All models showed a performance drop when dealing with illusions compared to standard images — mirroring the human experience of being “fooled” by optical illusions. Different models demonstrated varying levels of robustness to different illusions, suggesting that architectural differences influence how these systems process visual information.

A Simple Yet Effective Solution

An interesting finding from the research is their straightforward solution for improving model performance on illusory images. The technique:

Apply a Gaussian and blur low-pass filter to the illusory images.
Convert the images to grayscale.

This simple preprocessing approach yielded significant performance improvements across all tested models.

For example, in the IllusionAnimals dataset:

CLIP initially showed the highest performance on illusory images.
After applying the filter, BLIP-2 achieved the best results — even outperforming humans.

What We’re Going to Do in This Tutorial

This tutorial will explore the IllusionAnimals dataset and evaluate how different AI models perceive visual illusions. We’ll:

1. Load and explore the IllusionAnimals dataset using FiftyOne to see if we can reproduce the paper’s results but only focus on the CLIP model.

2. Compute embeddings using multiple models:

CLIP
SigLIP 2 (a new model released by Google)
AIMv2 (in my opinion, a highly slept-on contender to CLIP, released in late 2024 by Apple)

3. Visualize these embeddings using UMAP dimensionality reduction

4. Perform zero-shot classification using the models mentioned above

5. Test Visual Question-Answering (VQA) capabilities using:

Janus-Pro
Moondream2
Compare how models perform with and without hints about potential illusions.

💻 Note: You can access the notebook on GitHub

Let’s start by installing some dependencies and downloading the dataset from the Hugging Face Hub.

Let’s start with a visual vibe check in this dataset.

Note this is a Grouped Dataset. Grouped datasets allow us to represent multiple slices of the same data point. This way, data from multiple perspectives of the same scene can be stored, visualized, and queried in ways that respect the relationships between the slices of data.

For the IllusionAnimals subset, the dataset includes different slices representing variations of the images with and without illusions, as well as with and without filters. The specific slices available in the IllusionAnimals dataset are:

Raw Images: These are the original images of animals without any illusions applied. They serve as a baseline for evaluating the models’ performance on standard image recognition tasks. The models should accurately identify the animal in the image.
Illusory Images: These images incorporate visual illusions. The illusions are designed to make the images appear as one animal while subtly containing elements of another. The goal is to test whether the models can detect the presence of the illusory concept, even with the presence of the real concept.
Filtered Images: These are the illusory images processed with a Gaussian and blur low-pass filter. This filter is applied to enhance the models’ ability to detect the illusions. The idea is that the filter helps to reduce noise and highlight the illusory elements, making it easier for the models to identify and interpret the content. Applying the filter generally improves model performance.
Illusionless Class: In addition to the above, an extra class called “illusionless” is added to enhance the models’ capabilities. This class enables the models to detect instances where no illusion images are present in the picture.

In this tutorial, we’ll only work with two slices: “main” (the illusion and no illusion images) and “filtered” (the images after the Gaussian Blur and grayscaling).

FiftyOne datasets are logical datasets pointing to media files on disk rather than storing the media file contents directly. So, by cloning the dataset, we are not duplicating the images on disk, only the schema.

Note that when I originally parsed the dataset, I mapped the images without illusion to the illusionless class. To be consistent with the paper, I will map these to the no illusion class. I was too lazy to reparse the dataset, but luckily, this is easy to do in FiftyOne using the map_labels method of the dataset.

We’ll also need the labels so that we can grab them now. It doesn’t matter which slice we grab them from, as they are both the same:

Using Embeddings for Deeper Dataset Understanding

The first thing I want to do is gain a deeper understanding of the images in this dataset; we can use embeddings to do that.

Visual embeddings are high-dimensional vector representations of images that capture semantic and visual features. For the IllusionAnimals dataset, embeddings are particularly valuable because they can help us:

Visualize Relationships: Reducing these high-dimensional embeddings to 2D using UMAP helps visualize how different images cluster together and potentially identify patterns in the dataset.
Compare Model Perspectives: Different models may encode visual information differently. By comparing embeddings from multiple models (SigLIP and AIM-v2 in our case), we can understand how their “perception” of illusions differs.
Analyze Illusion Effects: We can examine whether illusory versions of images cluster closer to their “real” concept or their “illusory” concept, giving us insights into how effectively the illusions work from a model’s perspective.

For this analysis, we’ll use three models:

CLIP
SigLIP 2
AIM-v2

Let’s start by instantiating the models and then computing embeddings.

CLIP

Much has been written about the CLIP model, so I won’t repeat anything here. However, if you’re interested in going deep into CLIP and its history, then check out this blog.

We can use the model via FiftyOne’s integration with Hugging Face. In the paper, the authors use CLIP-ViT-base-patch32 in their experiments, which is the checkpoint we will also use. We can instantiate the model with classes, but it will not affect the embeddings.

SigLIP 2

SigLIP 2 is a family of multilingual vision-language encoders that improves upon the original SigLIP model. It incorporates techniques, including captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation, into a unified training recipe.

I don’t want to get too deep into the technical details of the SigLIP 2 model; feel free to check out the paper for more information about the model, its performance, and how it was trained.

SigLIP 2 Data Curation

The SigLIP 2 model uses Active Curation as Implicit Distillation (ACID) to enhance training efficiency:

Teacher-Student Framework:

The smaller student model learns from the pre-trained teacher model.
The teacher is a fine-tuned SigLIP 2 So400m model.

2. “Learnability” Scoring System:

Training examples were scored by comparing loss values between students and teacher.
High scores given to examples are easy for the teacher but challenging for student.

3. Strategic Sample Selection:

Training batches were chosen using two criteria:
Easy-reference scoring: prioritizes batches the teacher handles well
Learnability scoring: prioritizes batches with the greatest teacher-student performance gap
This approach implicitly distills teacher knowledge through data selection

4. Filtering Implementation:

The filtering ratio balances curation benefits with computational costs.
Example: 0.5 ratio means selecting the best 50% from double-sized super-batch

5. Computational Efficiency:

Knowledge transfers from teacher to student without explicit distillation mechanisms
Focuses training on most valuable examples, improving performance while saving resources

This method ensures smaller models benefit from high-quality, diverse data without the full computational burden of larger models.

I’m using this particular checkpoint because I’m trying to be as “apples-to-apples” with the CLIP model as possible. And, of course, these are two completely different model architectures trained on completely different datasets. For this tutorial, “apples-to-apples” means picking a model that uses ViT/B 32 as the vision encoder…that’s good enough for us.

AIMv2

AIMv2 is a family of vision encoders released in late 2024 that uses a novel multimodal autoregressive method.

It processes image patches and text tokens as a unified sequence, using a causal multimodal decoder to predict elements sequentially. AIMv2 processes data as one continuous sequence, predicting the next step in the series. It deliberately puts image information first, followed by text, creating a specific sequence: image patches → text tokens. This differs from CLIP’s parallel processing of images and text and strengthens the vision encoder. AIMv2 is trained on 12 billion image-text samples, balancing human-written alt-text and synthetically generated captions from diverse sources. I’ve written about the AIMv2 models in great detail in two blog posts, which you can read here and here.

To use AIMv2 for embeddings, we need to install a plugin.

We’ll need to set an environment variable as well:

You’ll also need to start a delegated service by running fiftyone delegated launch in the terminal.

Exploring Embeddings

Now, we can use UMAP to reduce the dimensionality of the embeddings and explore them in the FiftyOne app.

Looking at the embeddings between filtered and non-filtered images, it’s clear that Gaussian blur and grayscale preprocessing create more effective image embeddings:

Improved Clustering: Filtered images form tighter, more distinct clusters in AIMv2 and CLIP embedding spaces (SigLIP 2 remains an exception with uniform distribution).
Reduced Noise: Preprocessing reduces irrelevant visual variations, creating cleaner representations.
Clearer Class Boundaries: Class distinctions become more defined, especially for “no illusion” categories.
Better Conceptual Relationships: Filtered embeddings capture more logical spatial relationships between real and illusory concepts.

Computing Uniqueness Values

We can use the embeddings to compute uniqueness values for the images in our dataset.

We can use the compute_uniqueness method from the FiftyOne Brain measures how dissimilar each sample is from its neighbours in an embedding space. It finds each sample’s three nearest neighbours, weights their distances (60%-30%-10%), and normalizes these weighted distances to produce scores between 0-1. Higher scores indicate samples more “isolated” or distinct in the embedding space, while lower scores indicate samples with many close neighbours.

I’ll compute these values for the CLIP embeddings on the Illusion Animals subset and leave computing uniqueness values for the other embeddings to the reader:

You can then filter on the uniqueness values to see the most and least unique images in the dataset:

Reproducing CLIP Results from the Paper

We’ve already instantiated the CLIP model above. To use it for zero-shot classification, we can make use of the apply_model method of the dataset:

We can use the Model Evaluation panel in the FiftyOne app to see the model performance:

We can also print the classification reports programmatically, as shown below:

The paper reported an accuracy of 42.64 for the illusion images and 85.45 for the non-illusion images.

I’m not able to reproduce their results, and I suspect that’s because of how they implement their inference function:

What their code does is:

Feature Extraction vs End-to-End

Paper implementation explicitly extracts features using model.extract_features() and then computes similarities manually
The standard implementation uses the model’s built-in forward pass (model(**inputs)) which handles this internally

2. Temperature Scaling

The paper implementation uses a custom temperature value of 0.01: sims = (image_features @ text_features.t())[0] / 0.01
The standard implementation uses CLIP’s default temperature scaling built into the model

3. Feature Projection

The paper specifically uses projected embeddings: image_features = clip_features.image_embeds_proj
Standard implementation lets the model handle the projection internally

These differences, particularly the custom temperature value of 0.01, likely explain why we couldn’t exactly reproduce their results. The temperature parameter significantly affects how “sharp” or “soft” the probability distribution becomes after softmax — a lower value like 0.01 makes the model more confident in its predictions than CLIP’s default temperature.

Testing SigLIP 2 and AIMv2

We can use the AIMv2 model for zero-shot classification directly with FiftyOne’s integration with Hugging Face (it’s just the embeddings for which we need a plugin). Let’s go ahead and instantiate the model:

And apply it to our datasets:

Now for the filtered images:

Now, let’s apply our already instantiated SigLIP 2 model to our datasets as well:

And for the filtered images:

Summary of Findings

I encourage you to dig into the results yourself, and if you find anything interesting, please reach out.

Given this isn’t a research paper, and we’ve already covered a lot of ground, I’ll just share my high-level observation: CLIP is crushing it! I had high hopes for the SigLIP 2 model, but it doesn’t perform as well on this particular task as CLIP or AIMv2. To be fair, I’m a huge AIMv2 fanboy…so I was hoping it would beat both models, but it let me down here.

Can VLMs Do Better?

Now, let’s test VLMs to see how well they perform. I’m using two models, Moondream2 and Janus-Pro. These are implemented via plugins for FiftyOne. You can read more about the plugins here and here.

Start by downloading the plugins and installing their dependencies:

Next, we need to instantiate them via operators.

In Appendix G of the paper, the authors outline the prompt they use for VLMs. I’ve created a modified version of each:

Running the VLMs Using the “No Hint” Prompt

Just like above, you will need to have a delegated service running. To do that, just open the terminal and run fiftyone delegated launch.

For this blog, I will only run the “No Hint Prompt” using the Janus Pro and Moondream2 models on the datasets. I leave the other permutations to the reader as a next step.

We’ll need to parse the results as FiftyOne Classification types so we can run the evaluation methods, I’ll be generous to the VLMs and strip leading and trailing white space:

My initial impression is that both VLMs perform poorly in this task, and the model comparison panel confirms that as well.

Neither Janus Pro nor Moondream2 gets any of the illusion animal classifications correct. Janus doesn’t stick to the prompt that I provide and gives more than a one-word answer (it tends to present “Class:” or “Class of” to classifications, which is not what I asked for). Both models also assign classifications for classes that don’t exist. For example, there are samples with classes assigned to zebra, deer, lion, etc.

Overall, Janus Pro performs worse than the zero-shot classification models we tested before. However, Moondream2 does an excellent job. Let’s look at the results in more detail:

Moondream2 does an excellent job of classifying the filtered images. It is close to the highest-performing model from the paper, BLIP 2.

What I Discovered

Reproducing results is harder than it seems:

My attempts to match the paper’s CLIP results revealed critical implementation details that were not mentioned (especially the 0.01 temperature setting).
This highlights how seemingly small implementation choices dramatically impact model performance.

2. Model hierarchy isn’t what I expected:

Despite being a self-proclaimed AIMv2 fan, I was surprised to see CLIP outperforming AIMv2 and SigLIP 2.
The newest models aren’t automatically the best for specialized perceptual tasks.

3. VLM performance varies dramatically:

Janus Pro failed, with 0% accuracy on unfiltered images
Moondream2 achieved an impressive 90.2% accuracy on filtered images, approaching BLIP-2’s benchmark-leading performance

4. Embedding visualization proves illuminating:

The UMAP visualizations demonstrate how filtering creates more meaningful semantic spaces.
This explains why models perform better after preprocessing.

Why This Matters

Though IllusionAnimals might seem like a toy benchmark, it tests something fundamental: whether our AI systems see the world as we do. Visual illusions expose the gap between human and machine perception that standard benchmarks miss.

If an AI system can’t properly process visual illusions — something human brains handle naturally (though imperfectly) — how can we trust it to make critical decisions based on visual input in autonomous driving, medical diagnosis, or surveillance?

This investigation demonstrates that simple preprocessing techniques can dramatically improve AI perception of challenging visual content. As we deploy these systems in increasingly complex environments, understanding and addressing their perceptual limitations becomes academically interesting and practically essential.

Talk to a computer vision expert