A dataset of memes in FiftyOne format The idea for this project occurred when the Janus Pro model by DeepSeek AI dropped. I was reading through the paper, and in my hyped frame of mind, I thought I read the model excels in MEME Perception…in fact, what the paper actually said was the model excels in MME Perception. However, it turns out that MME-perception measures against a total of 10 subtasks for evaluating the perception ability from the perspectives of coarse-grained recognition, fine-grained recognition, and OCR…which actually isn’t all that different from meme perception. Memes are, arguably, the best proving ground for Vision Language Models (VLMs) because they combine multiple aspects of vision and language understanding:

Memes require understanding both visual elements and text and how they interact.
They often rely on shared cultural knowledge or references.
Humour often emerges from subtle interactions between images and text.
They appear in different formats, templates, and styles.
Text can appear in various fonts, sizes, and positions.

In this highly rigorous, academic, conference-quality blog post— or, at the very least, a small project I hope you find entertaining — we will see how well VLMs perform on meme understanding. But on a half-serious note, if you want to collaborate with me on the MEME Bench (Multimodal Evaluation of Memetic Erudition), hit me up. Let’s make a benchmark, write a paper, and submit it to ECCV/NeurIPS (or whatever conference is needed for posters). If you end up stealing my idea after reading this blog…then at least throw a citation my way! I’ll put two VLMs (Janus-Pro and Moondream2) through their paces on distinct tasks:

OCR: Can they accurately extract text from memes?
Meme understanding: Can they explain what makes a meme funny and relevant?
Fine-grained visual cues: I’ll also test their attention to detail by seeing if they can spot subtle watermarks, giving us insight into their visual processing capabilities.
Caption generation: Can they generate contextually appropriate, humorous captions?

Note: You can find the repo for this blog here.

Start by setting up our environment and downloading the necessary plugins:

I’ve created plugins which allow you to easily use 🌔Moondream2 and 🐋Janus-Pro with your FiftyOne dataset. Let’s start by downloading the plugins and installing their dependencies.

The plugin framework lets you extend and customize the functionality of FiftyOne to suit your needs. If you’re interested in learning more about plugins, you might be interested in attending one of our monthly workshops. You can see the full schedule here and look for the Advanced Computer Vision Data Curation and Model Evaluation workshop.

We also need to set an environment variable.

I found this website — scott.ai from Scott Penberthy — which had some awesome machine learning memes on it. I parsed these memes into a FiftyOne dataset. You can download the dataset from Hugging Face as well. I also recommend checking out the Voxel51 org on Hugging Face to see the other datasets we have uploaded. Let’s quickly explore the dataset:

Exploring the meme dataset in FiftyOne Now, let’s instantiate our plugins as operators via the FiftyOne SDK. Alternatively, you can use the app and fill out the operator form. Just hit the backtick button (`) to open the operator menu. Type in “Moondream” or “Janus” and click on it. You’ll be presented with a form to fill out, which takes the same information as what we will pass in via the SDK.

Now let’s kick off a delegated service by opening the terminal and running fiftyone delegated launch .

OCR

Optical Character Recognition (OCR) is a fundamental task in Computer Vision. And, I think, using it to test memes is a good use case! Memes typically combine both visual elements and text. While traditional OCR systems are specifically trained for text extraction, it’s interesting to test how well general-purpose Vision Language Models (VLMs) can perform this task. Testing VLMs on OCR helps us understand:

Their ability to perceive and accurately read text in various fonts, orientations, and styles common in memes.
How well they can distinguish between text and visual elements.
Their robustness in handling text integrated into images rather than presented as plain text.

Let’s test Janus Pro and Moondream2 on this task using the plugins we downloaded earlier. First, let’s run Janus Pro:

And now, Moondream2:

Since we don’t have ground truth annotations for the text in these memes, we’ll do a qualitative evaluation — a “visual vibe check” — of how well each model performs. We can visually inspect the results in the FiftyOne App by comparing the model outputs against the actual meme images to assess the accuracy and completeness of text extraction.

Exploring OCR outputs in FiftyOne If we had ground truth annotations for the meme text, we could evaluate OCR quality using standard text similarity metrics. Common approaches include Character Error Rate (CER) and Word Error Rate (WER), which measure the minimum number of character/word edits needed to transform the predicted text into the ground truth. We could also use exact match rates for strict evaluation or BLEU scores for more flexible matching that accounts for partial correctness. These metrics would give us quantitative measures of how accurately each VLM extracts text from memes, allowing for direct model comparisons and helping identify specific errors (like case sensitivity issues or problems with special characters). But we don’t have ground truth, so we use a visual vibe check, which is highly subjective.

🏆 Based on my initial vibe check, I’m giving this to Moondream2. If you feel otherwise, then leave a comment below!

Meme understanding

Understanding memes is more complex than OCR because there are multiple levels of comprehension:

Recognizing the scene, characters, and their expressions
Understanding the reference or template being used
Connecting how the text relates to the visual elements
Grasping why the combination is meant to be humorous

This means we can test a VLM’s ability to:

Integrate multimodal information (text and visuals)
Understand cultural references and context.
Grasp abstract concepts and humour.
Explain complex social/cultural phenomena in natural language.

Let’s see how our models handle this deeper level of understanding. It’s the same pattern as above but a different prompt. An interesting point of comparison and experimentation that I leave to the reader is assessing the impact of prompts on model performance. First, we can run Janus:

Next, we can run Moondream:

Assessing meme understanding outputs in FiftyOne Again, no ground truth here. But if we were to build out the actual MEME benchmark, creating ground truth annotations for meme understanding would require human experts to write detailed explanations for each meme. I don't know who these human experts are, but a good place to start is sourcing meme lords from Reddit and asking them to pitch in for the cause. These ground truth annotations would cover key aspects like the visual scene, the reference being made, and why it’s humorous. With these expert-written explanations as ground truth, we could evaluate VLM responses using natural language similarity metrics like ROUGE or BERTScore, which can capture semantic similarity beyond exact word matches. We might also use structured templates for the ground truth annotations (e.g., separate fields for scene description, cultural reference, and humour explanation) to enable a more fine-grained evaluation of how well VLMs understand each component of meme interpretation.

🏆 Based on my initial vibe check, I’m giving this to Janus Pro. If you feel otherwise, then leave a comment below!

Can the VLMs find the attribution tag?

Each meme has a small attribution in the left corner, which reads @scott.ai. This presents an interesting test case for VLMs' visual capabilities because:

The attribution is intentionally subtle — a small watermark that could be easily missed even by human viewers.
It tests the model’s ability to detect and read fine details in images.
It evaluates whether VLMs can distinguish between the main meme content and metadata like attributions.
It helps us understand if VLMs can maintain attention to small details while processing the broader image.

This kind of test is particularly relevant for real-world applications where models might need to:

Detect watermarks or copyright information.
Read small print or disclaimers.
Identify subtle branding elements.

Let’s see if the VLMs can pick up on this subtle detail, starting with Janus Pro:

Next, Moondream2:

Assessing attribution finding outputs in FiftyOne If we were to build out the official MEME Bench, we could treat this as a simple classification task where the model either correctly identifies the attribution or doesn’t. Performance could be measured using standard binary classification metrics like accuracy, precision, and recall. Additionally, we might want to track false positives where models “hallucinate” attributions that aren’t present, as this gives insight into their reliability for detecting fine-grained details. Since the task involves extracting specific text, we could also use fuzzy string matching to account for minor variations in how models might format their responses (e.g., “scott.ai” vs “@scott.ai” vs “www.scott.ai").

🏆 Based on my initial vibe check, the clear winner here is Moondream2. If you feel otherwise, leave a comment below, but I think you’d be hard pressed to counter me.

Meme Captioning

Meme captioning is a generative task that’s distinct from our previous experiments:

Unlike OCR, which extracts existing text
Unlike meme understanding, which interprets the combined meaning
Captioning requires the model to create novel, contextually appropriate text

This is challenging because good meme captions:

Match the visual template’s intended use.
Are culturally relevant to the target audience (in our case, the ML/AI community)
Strike the right balance of humour and relatability.
Follow an established format of the meme template.

You could use metrics like BLEU or ROUGE to evaluate captions against references, but they often miss the aspects of humour and cultural relevance. Like the previous tasks, a qualitative “vibe check” is probably the most reliable way to assess the quality of the captions. Let’s download another dataset (a grouped dataset with a captioned and un-captioned meme…but we will work with only un-captioned) and then see what our models generate.

Exploring the uncaptioned meme dataset in FiftyOne We can follow the same pattern as before, starting with Janus Pro:

And for Moondream2:

Assessing meme captions in FiftyOne Like LMSYS’s Chatbot Arena, we could create a “Meme Arena” to evaluate caption generation. Users would be shown the same meme template with different VLM-generated captions (without knowing which VLM produced each) and vote on which one is funnier or more contextually appropriate. This crowd-sourced evaluation would naturally capture subjective aspects like humour and cultural relevance that traditional metrics miss. We could use an Elo rating system or TrueSkill to maintain a dynamic leaderboard where VLMs gain or lose points based on head-to-head caption competitions. We could even add topic-constrained captioning to make the task more challenging and practical. Users could specify domains like “quantum computing,” “cybersecurity,” or “data cleaning,” and VLMs would need to generate captions that not only fit the meme template but also cleverly relate to that field. This tests the model’s understanding of the meme format and its ability to draw meaningful connections to specific domains. The evaluation arena could include domain experts voting on humour and technical accuracy — did the model understand the field-specific concepts it referenced? This would create a comprehensive multi-dimensional leaderboard where models might excel at certain domains but struggle with others, providing insight into their knowledge breadth and ability to create contextually appropriate humour across different technical spaces.

👨🏽‍⚖️ Based on my initial vibe check, both models failed at this task.

Conclusion

While one might argue I’ve spent an unreasonable amount of time thinking about how to scientifically evaluate AI’s understanding of internet jokes (and leadership at Voxel51 are probably questioning what I’m doing with my time right now), there’s a method to this meme madness. Memes represent a unique intersection of visual understanding, cultural knowledge, and contextual humour, making them a surprisingly robust benchmark for VLM capabilities. The tasks we’ve explored — OCR, understanding, attribution detection, and contextual caption generation — test different aspects of these models’ abilities in ways that traditional benchmarks often miss. The proposed “Meme Arena” with topic-specific challenges could provide valuable insights into how well VLMs can:

Process fine-grained visual details.
Understand and generate contextual humour.
Connect abstract concepts across domains.
Interpret cultural and technical references.

So, while I may have gone down a rabbit hole typically reserved for 3 AM thoughts, perhaps the real benchmark was the memes we made. (I’ll see myself out now… 🚪)

Talk to a computer vision expert

OCR