Composed Image Retrieval at CVPR 2025
Jun 2, 2025
35 min read

The evolution of image search

If you’ve ever struggled to find the perfect car online, thinking, “I want this exact model, but in blue,” you’ve encountered the limitations of traditional image searches.
While standard search lets you look for “blue sedans” or find similar vehicles to a reference image, it doesn’t easily combine these approaches. Enter Compositional Image Retrieval (CIR), one of the coolest (and useful) areas of Visual AI research showcased at CVPR 2025.

Why it matters

Beyond academic interest, CIR has transformative potential for e-commerce, creative applications, and everyday search experiences. Imagine finding products matching your preferences by saying “like this but
more formal” or “this couch in a different fabric.” It bridges the gap between how humans naturally communicate their visual preferences and how search systems operate.
In this blog post, I’ll highlight some of the interesting CIR research presented at CVPR 2025, showcasing how researchers are tackling these challenges and pushing the boundaries of what’s possible in visual search technology.

A primer on composed image retrieval

Compositional Image Retrieval operates at the intersection of vision and language, allowing users to search using a multimodal query: a reference image combined with a text modification. The reference image provides the visual foundation, while the text specifies desired changes to particular attributes.
In formal terms, CIR involves three key elements:
  • A reference image that establishes the visual starting point
  • A modification text that specifies the desired changes (e.g., “make it red”, “with a sunroof,” “in the sport trim”)
  • A gallery of candidate images from which the system retrieves results
The task requires the system to understand what’s in the reference image and how the textual instruction should transform it. For example, if you show a sedan and ask for “the same model but as a convertible in red,” the system must:
  1. Parse the visual content of the sedan (make, model, features, design elements, etc.)
  2. Identify which attributes should change (body type, colour) and which should remain (make, model, other features)
  3. Apply these specific modifications conceptually
  4. Retrieve images that match this mental transformation
Under the hood, CIR systems typically learn an embedding function that maps the combination of reference image and modification text to the same vector space as potential target images. The mathematical goal is to make this composed representation similar to the true target images and dissimilar to irrelevant ones.
What makes CIR distinct from other retrieval approaches is its compositional nature. Instead of finding exact matches to a query, it performs a semantic transformation first, then searches based on the result of that transformation. This bridges the gap between how humans naturally communicate our visual preferences (“like this, but different in that specific way”) and how retrieval systems operate.

Beyond traditional search

In traditional retrieval, the system directly encodes text (“blue SUV”) or an image into a representation matched against a database.
The query itself represents what you want to find.
In CIR, the system must first understand the reference image’s attributes, interpret how the text modification should alter these attributes, and then combine this information to create a new representation that doesn’t directly match either input. This composed representation describes an image that may not even exist in the exact form imagined.
CIR does not match the raw multimodal query directly to gallery images. It processes (or “semantically transforms”) the reference image and modification text into a unified representation that embodies the desired target state.
It then uses this transformed representation to retrieve images from the gallery that are semantically closest to that desired state.

Different embedding space operations

Traditional retrieval systems typically operate through:
  • Text-to-image matching: Text and images are projected into a shared embedding space where semantically similar items cluster together
  • Image-to-image matching: Finding database images closest to a query image in feature space
CIR requires more complex operations:
  • Attribute disentanglement: Breaking down the reference image into modifiable properties
  • Selective feature modification: Applying the text modification to only relevant dimensions of the image embedding
  • Feature preservation: Maintaining unmentioned attributes at their original values
  • Compositional reasoning: Understanding how multiple modifications might interact
For instance, when requesting “the same car but with a panoramic roof and in metallic gray,” the system must understand these are independent modifications that can co-exist, rather than treating it as a single atomic change.
These techniques go beyond traditional contrastive learning approaches in image-text retrieval systems like CLIP.

Two approaches to composed image retrieval

Composed Image Retrieval (CIR) research typically follows one of two distinct paths, each with its own philosophy about how these systems should learn and operate.
The traditional approach, Supervised CIR, relies heavily on carefully annotated training data.
These systems learn from triplets consisting of a reference image, modification text, and the corresponding target image that satisfies the modification. While this approach often yields impressive results, it comes with significant drawbacks. Creating these annotated datasets is extremely labour-intensive and expensive, inherently limiting the size and scope of what these models can learn. Supervised models excel within their training domains but may struggle with novel modifications or image types.
In contrast, Zero-shot CIR (ZS-CIR) takes a fundamentally different approach by eliminating the need for task-specific annotated triplets.
Instead, these systems leverage the knowledge embedded in large vision-language models pre-trained on vast amounts of general image-text data. Some ZS-CIR methods convert reference images into textual representations that can be combined with modification text. In contrast, others generate synthetic training examples or cleverly combine existing pre-trained models without additional training. The key advantage is that these systems can often generalize to new domains and modification types without requiring domain-specific annotations.
What’s particularly exciting is how the gap between these approaches has begun to narrow.
While supervised methods historically held the performance edge, recent advances in large-scale vision-language models have enabled some zero-shot approaches to match or even exceed traditional supervised methods. This evolution suggests a future where CIR systems can offer strong performance and the flexibility to work across diverse domains without requiring extensive manual annotation for each new application.

Evaluation complexity

The evaluation of traditional retrieval systems is relatively straightforward: given a query, is the ground truth image ranked highly in the results?
CIR introduces multiple layers of complexity:
  • Multiple valid targets may exist for a single query
  • The degree of modification matters (how metallic should “metallic gray” be?)
  • Attribute preservation needs evaluation (did unmentioned attributes remain unchanged?)
  • The same modification applied to different reference images should produce consistent changes
Unlike the complexity of CIR systems themselves, evaluation frameworks are relatively straightforward, focusing on retrieval effectiveness through key metrics:

Primary metrics:

  • Recall@K (R@K): Measures how often the correct target appears in the top K results. Common reporting includes R@1, R@5, R@10, and R@50.
  • Recall_subset@K: Addresses the “false negative” problem in CIRR by evaluating against visually similar image subsets, focusing on the model’s ability to distinguish subtle text-specified differences.
  • Mean Average Precision@K (mAP@K): Essential for CIRCO evaluation where multiple ground-truth images may satisfy a single query.

Benchmark datasets:

  • FashionIQ: Fashion-focused with clothing attribute modifications
  • CIRR: Features image clustering to mitigate false negatives
  • CIRCO: First dataset with multiple ground-truth targets per query
Current evaluation focuses exclusively on retrieval outcomes rather than intermediate reasoning processes. While CIR systems must handle complex operations internally (attribute understanding, transformation application, selective feature modification), success is measured purely by whether the correct image(s) were retrieved.
The CIR evaluation challenge is addressing several inherent complexities: multiple valid targets may exist, modification degrees vary in subjective importance, unmentioned attributes should remain preserved, and similar modifications should produce consistent results across different reference images.

Why it’s xhallenging (and interesting)

The field has seen rapid growth since its emergence around 2019, with approaches ranging from supervised learning using annotated triplets to zero-shot methods leveraging large vision-language models. Several factors make this a rich research area:
  • Multimodal Understanding: Systems must integrate and align information from images and text, where each provides complementary signals.
  • Selective Attribute Modification: The model must identify which visual elements to change while preserving everything else.
  • Data Complexity: Training requires triplets (reference image, modification text, target image), which are much harder to collect than simple image-text pairs.
  • Multiple Valid Interpretations: Multiple target images might be equally valid for a single query, complicating training and evaluation.

Generative zero-shot composed image retrieval

The paper’s framework adds a generative step to the ZS-CIR pipeline. Creating a visual representation (the pseudo-target image) that attempts to capture the desired result of the composition provides additional visual information that helps bridge the “representation gap” between composed embeddings in the language space and target image embeddings in the image space. This CIG component is an effective add-on that can be integrated with existing CIR methods to boost performance. The paper introduces a novel approach to Composed Image Retrieval (CIR), specifically focusing on improving Zero-Shot CIR (ZS-CIR).
Important links:
One key element discussed and utilized by the paper is Textual Inversion, which maps image features into a semantic token embedding space.
Textual inversion essentially learns a “pseudo token” or word embedding that represents the visual content of a specific image. This mapping aims to create representations compatible with the text encoder of a pretrained vision-language model, such as CLIP. Once the reference image is transformed into textual pseudo-tokens, these tokens are combined with tokens from the modification text. This forms a “unified query” that consists entirely of text-like tokens. This unified query is then encoded using the text encoder of the VLP model, allowing the image’s information to be integrated into textual prompts or sentences, enabling multimodal composition.
The method generates pseudo-target images that visually represent what the modified reference image should look like when modified by the delta caption. These generated images serve as additional visual information to enhance retrieval performance.

Training process

During the training phase of their Composed Image Generation (CIG) model, textual inversion is used to map the image latent embedding of a training image into the token embedding space. This pseudo-token embedding is combined with the image’s caption to compose a prompt embedding. This composed prompt serves as a textual condition for training a latent diffusion model.
At a high level, the model training approach is:
  • Self-supervised training using only standard image-caption pairs, not requiring expensive CIR triplet datasets.
  • A pre-trained textual inversion network maps image embeddings into the token embedding space.
  • Composed prompt embeddings are constructed by combining pseudo-tokens with the image’s caption.
  • A latent diffusion model (Stable Diffusion variants) is fine-tuned to reconstruct original images using these composed prompts as conditioning.

Inference workflow

During inference, the model addresses the challenge of retrieving images based on a reference image and text modification.
The key innovation is using a generative component to create a visual preview of what the target image should look like after applying the requested modifications. This pseudo-target image helps bridge the modality gap between language-space embeddings and image-space embeddings. After generation, the pseudo-target image is processed to extract complementary information that enhances the retrieval process.
This approach effectively provides two perspectives on the query: the original text-based representation and a visually informed representation that better aligns with the target image space.
At a high level, the model inference approach is:
  • Generate a pseudo-target image which visually represents how the modified image should look.
  • The pseudo-target image is mapped back to the token embedding space.
  • A second composed text embedding is created from the pseudo-target image and the delta caption.
  • The original and pseudo-target-based embeddings are combined with a weighting hyperparameter.
  • Images are retrieved by computing cosine similarity with the combined embedding.
This process leverages generated visual representations to bridge the gap between query and target image spaces.

Key insights from generative CIR research

  • Generating pseudo-target images effectively bridges the representation gap between language-space embeddings and image-space embeddings, providing visual information that better aligns with target images.
  • Self-supervised training is sufficient: Fine-tuning a diffusion model on image reconstruction using composed prompts induces the ability to generate useful pseudo-target images without requiring expensive triplet datasets.
  • Textual embedding fusion is superior: Combining the original composed embedding with the pseudo-target-derived embedding at the textual level yields better performance than image-level or token-level fusion.
  • Visual detail preservation: Unlike single-token representations, pseudo-target images maintain rich visual content from the reference while incorporating requested modifications, creating more effective retrieval queries.

Missing target-relevant information prediction with world model for accurate zero-shot composed image retrieval

A major challenge in ZS-CIR is accurately modifying a reference image according to manipulation text, especially when the text specifies visual content that is missing from the reference image.
Existing methods typically map the reference image into a pseudo-token of CLIP’s language space, but struggle with this because they ignore the missing target content. That’s because the CLIP embedding is coarse-grained and loses the visual details needed for CIR tasks.
Important links:
This paper introduces PrediCIR (Predict target image feature before retrieval for zero-shot Composed Image Retrieval), which explicitly predicts missing target visual content before performing image-to-word mapping. Rather than directly mapping existing features to pseudo-tokens, PrediCIR first predicts what visual elements are needed to fulfill the manipulation instruction, then adaptively combines this predicted content with the reference image’s existing features. It does this through three interconnected modules that work together during pre-training and inference:
  1. World view generation
  2. Target content predictor
  3. Predictive cross-modal architecture

World view generation

World view generation is a fundamental component and the initial step in the pre-training process of the PrediCIR model.
The main goal of this module is to construct source and target views along with corresponding actions from existing image-caption pairs, without requiring extra supervision. This module generates pseudo triplets in <source view, action, target view> .

How it works:

  • An original image from an image-caption pair is designated as the target view.
  • A corrupted version of that original image is created by randomly cropping certain visual content. This corrupted image serves as the source view. Random cropping is preferred over masking to align with the frozen CLIP model and preserve coherent regional context. The crop size and aspect ratios were analyzed for their influence on performance.
  • The caption associated with the original image is used as the action, representing the intent to transform the source view into the target view. The caption is embedded using the frozen CLIP language encoder to obtain an action embedding.

Why this data structure?

This specific structure <cropped image, caption, original image> simulates the ZS-CIR problem where a reference image (like the cropped image) needs to be modified according to a text instruction (the caption/action) to become a target image (the original image). It teaches the model to understand what content is “missing” in the source view relative to the target view, based on the action.

Target content predictor

The triplets generated by the World View Generation module are then used to train the Target Content Predictor (TCP) module, which functions as a world model.
Acting as a world model predictor (similar to a JEPA framework), the TCP takes the latent features of the source view (the cropped image patches), the action embedding (derived from the caption via the CLIP language encoder), and mask tokens representing the locations of missing content in the target view as input. Guided by the action, the TCP learns to predict the latent representation of the missing target visual content. This prediction occurs in the latent space. The TCP’s output includes the predicted latent features for the missing content and enhanced latent features for the source content.
These predicted missing features and enhanced source features from the TCP are then passed to the Predictive Cross-Modal Alignment module.

Predictive cross-modal alignment

The Predictive Cross-Modal Alignment (PMA) module bridges the gap between the visual features predicted by the Target Content Predictor (TCP) and the CLIP language space used for retrieval.
It takes the features representing the predicted missing visual content and the enhanced source content from the TCP, along with the global source feature, and combines them adaptively. This combined representation is mapped into the word token space to create a pseudo-word token.
The pseudo-word token represents the potential visual content of the target image, including the elements predicted by the TCP. This token is appended to a simple prompt like “a photo of S*”. The PMA is trained using a contrastive loss that encourages embedding the prompt sentence to align closely with the actual global feature embedding of the target image in the CLIP vision-language space.
This is the final step in the prediction-based image-to-word mapping that takes the predictor’s output and transforms it into a format (the pseudo-token) suitable for composing a query within the CLIP language space for Zero-Shot Composed Image Retrieval.

Key lessons and takeaways for practitioners

The PrediCIR paper strongly suggests that explicitly predicting missing content is a powerful paradigm for ZS-CIR.
Practitioners should consider leveraging prediction-based world models, carefully generating training data that simulates missing information, adaptively fusing original and predicted features, and aligning these predictive outputs with established multimodal spaces like CLIP to improve performance and generalization ability in CIR tasks.
Here are some of the key insights from the paper:
  1. Prediction is powerful: Explicitly predicting the visual content needed to fulfill a manipulation instruction, especially when missing in the reference image, improves ZS-CIR performance. Especially for “missing content” manipulations like changing domains (e.g., to origami) or adding objects.
  2. World models for visual transformations: Training a world model predictor using a JEPA-like framework on synthetic “world views” (source, action, target) generated from image-caption data learns visual prediction capabilities without needing extensive, hand-annotated ZS-CIR triplets.
  3. Data generation matters: Randomly cropping images to create source views creates diverse scenarios of missing content and aligns well with the frozen CLIP architecture used. Trying to predict the entire target image is less effective than predicting the missing parts, suggesting partial prediction is better for managing computation and avoiding overfitting.
  4. Improved pseudo-token quality: The prediction process yields a higher-quality pseudo-token that better represents the intended target image’s potential content and fine-grained details, which are important for accurate retrieval.

Imagine and seek: Improving composed image retrieval with an imagined proxy

Traditional ZS-CIR methods often rely on projecting query images into the text feature space and combining them with query text features for retrieval.
This approach can suffer from a natural gap between images and text, making it difficult to guarantee detailed alignment and often overlooking important semantic information present in the image but not explicitly captured by text features. This is particularly challenging with complex captions. Existing methods that leverage large language models (LLMs) to generate descriptions also tend to focus on the text side, neglecting the potential for direct imagination on the image side.
This paper introduces Imagined Proxy for CIR (IP-CIR), a training-free method that leverages the power of imagination from generative models to address these limitations.
Important links:
IP-CIR harnesses the power of imagination from generative models to create an “imagined proxy” image aligned with the query image and the relative text description.
This proxy image aims to provide additional details like style, instance attributes, and spatial relationships that text-based retrieval might miss. This contrasts with methods that rely solely on text modifications or text-space projection. The method then carefully integrates this visual proxy information with the original query image and textual guidance through robust features and a balanced retrieval metric.
The framework involves three main conceptual steps:
  1. Imagined retrieval proxy generation
  2. Constructing robust proxy features
  3. Balancing retrieval results
The innovation here is the direct use of image generation to create a visual “imagination” of the target image, which should provide a rich, image-side feature representation to complement and enhance traditional text-based retrieval in ZS-CIR.

Imagined retrieval proxy generation

The first stage focuses on generating a visual representation of what the user is looking for — an “imagined proxy image” that combines elements from the reference photo and the text description.
IP-CIR starts by analyzing the user’s query image using BLIP2, which automatically generates detailed captions describing what’s in the image. These captions are then combined with the user’s text modifications and fed into Qwen1.5–32B, a large language model that serves as the reasoning engine. Qwen analyzes this information and creates a detailed spatial layout for the target image.
This layout includes precise descriptions of each object and its positions. It decides which elements should come from the original image versus the text description. This essentially creates a blueprint that says, “Keep the dog from the photo, but add a red hat as described in the text.”
IP-CIR then uses controllable image generation techniques to create an actual proxy image following this layout. The generation process incorporates visual features from the original query image while adding new elements described in the text. The result is a concrete visual representation of the search target.

Constructing robust proxy features

Raw proxy images sometimes emphasize the wrong details or miss subtle textual requirements. This stage creates a more reliable search representation by combining multiple information sources.
Qwen generates detailed captions describing what the ideal target image should contain after applying all the requested modifications. IP-CIR compares these target descriptions with the original BLIP2 captions to identify exactly what should change. This creates a “semantic perturbation” that captures the direction and nature of the requested modifications.
The final search feature combines three components:
  • Features from the generated proxy image (the visual imagination)
  • Features from the original query image (preserving important context)
  • The semantic perturbation (ensuring text requirements are met)
This multi-source approach compensates for potential weaknesses in any individual component while preserving the most important information from visual and textual inputs.

Balancing retrieval results

The final stage addresses how to effectively combine results from traditional text-based search with the new image-based proxy approach.
IP-CIR now has two ways to search: using traditional text-based methods and using the new proxy-based visual approach. Each method produces similarity scores, but simply averaging them can lead to poor results where images excel in one area but fail in another.
IP-CIR uses “balanced similarity” by multiplying the text-based score with the proxy-based score. This multiplication is key — it ensures that only images performing well in both approaches receive high, balanced scores. An image that completely fails the text or visual test will have a very low balanced score.
The final search ranking combines the original text-based similarity with this balanced similarity using a weighted average. A lambda parameter controls how much weight to give each component, allowing the system to adapt to different types of content and datasets.

Why this approach works

This three-stage process creates a search system that truly understands complex visual-textual queries.
By generating actual proxy images, the system can capture spatial relationships and visual details that pure text descriptions might miss. Combining multiple information sources in the feature construction makes it more robust against any single component’s limitations. And by carefully balancing different similarity measures, it ensures that results satisfy both visual and textual requirements rather than excelling in just one area.
The result is a search system that can handle requests like “find me a living room like this one, but with a blue couch instead of brown” — understanding both the visual context of the reference image and the specific textual modifications requested.

Key lessons and takeaways for practitioners

  1. Text-only retrieval has limitations: Relying solely on text features for CIR overlooks important semantic information and detailed visual alignment due to the inherent gap between images and text. Text features can be coarse and easily confusable in fine-grained, complex imagination scenes.
  2. Visual imagination enhances retrieval: Creating an “imagined proxy” provides valuable additional information, such as style, instance attributes, and spatial relationships, that are often difficult for text-based methods to capture accurately, especially with complex captions.
  3. Harnessing generative models is feasible: Controllable generative models make it possible to create high-quality images that align with both query images and relative captions.
  4. LLMs for reasoning and layout: LLMs can understand the complex relationship between the query image (represented by automatically generated captions like those from BLIP2) and the relative text. They can reason out detailed spatial layouts for the imagined target image, including object descriptions, bounding box coordinates, and even determine which attributes should come from the original image versus the text. They can also help derive semantic features representing the direction of the textual edit.
  5. Raw proxy features aren’t enough; robust features are key: Simply using the raw features of a generated proxy image for retrieval can introduce noise or cause the system to overemphasize irrelevant details, such as a background not specified in the text. A more effective approach is to construct a robust proxy feature by combining information from the proxy image, the original query image (to compensate for lost details), and a “semantic perturbation” feature derived from inferred target captions (to capture the edit direction and mitigate focus on irrelevant details).
  6. Balancing modalities improves accuracy: Combining the retrieval similarities obtained from image-based proxy features and traditional text-based baseline methods is critical for achieving accurate results. A balanced metric, such as the one used in IP-CIR which involves multiplying the similarities, ensures that a high final score is only achieved if an image has reasonably high similarity in both modalities. This prevents retrieving images that only match one aspect (image or text) but not the other.

The future of retrieval is compositional

The three approaches showcased at CVPR 2025 represent distinct yet complementary philosophies in advancing Composed Image Retrieval.
Each tackles the fundamental challenge of bridging vision and language from different angles. Generative CIR creates visual previews of target images, excelling at complex transformations. PrediCIR analytically predicts missing content, proving powerful for domain transfers and object additions. IP-CIR combines multiple AI systems for holistic reasoning across visual and textual modalities.
All three methods recognize a fundamental truth: traditional text-image matching isn’t sufficient for compositional queries.
They require more sophisticated mechanisms to bridge the gap between how humans describe visual modifications and how systems process them.
The shift toward zero-shot approaches signals progress toward truly generalizable Visual AI systems that understand and manipulate visual concepts without exhaustive training on every scenario. As these methods mature and potentially converge, we’re approaching search systems that don’t just find what exists, but understand what we imagine.
The implications extend far beyond academic research.
From e-commerce platforms enabling intuitive product discovery to creative tools that understand artistic intent, CIR is positioning itself to transform how we interact with visual information. The question isn’t whether these systems will become mainstream, but how quickly they’ll integrate into the tools we use every day.
The future of search isn’t just retrieval — it’s visual reasoning.
Loading related posts...