Review of a Data-Centric AI Paper from NeurIPS 2024 — The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
This post is part of a five-part series examining notable data-centric AI papers from NeurIPS 2024. For brief summaries of all five papers, check out my overview post, where you’ll find links to each detailed analysis.
The quality and relevance of training data directly impact the performance of deep learning models, this is especially true in Visual AI.
While recent advancements in text-to-image generation have spurred interest in using synthetic data for training vision models, a new research paper challenges this trend. The study, which focuses on fine-tuning a pre-trained CLIP model for various visual recognition tasks, makes a compelling argument for the continued dominance of real data. The researchers demonstrate that retrieving targeted real images from the LAION-2B dataset, the same dataset used to train Stable Diffusion, consistently outperforms using synthetic images generated by Stable Diffusion.
This finding underscores a crucial point for data-centric AI: while synthetic data holds promise, we must carefully evaluate its effectiveness against a robust baseline of curated real data.
Relevant links:
The authors of this paper begin by highlighting the increasing demand for large amounts of high-quality data to train machine learning systems. They point out the challenges and costs associated with collecting and annotating real-world data, leading to exploring synthetic data as a potential solution.
- One promising approach is to leverage conditional generative models to create synthetic training data.
- This has gained traction in fields like natural language processing (NLP), where large language models are used to generate synthetic datasets for tasks like instruction tuning.
Similarly, there’s growing interest in using synthetic images from text-to-image generators to train models for visual recognition tasks in computer vision.
However, the authors raise a critical question: Given that synthetic images originate from the real-world data used to train the generative models, what additional value does the intermediate generation step provide? Wouldn’t it be more effective to directly utilize the relevant portions of the original real-world data?
To investigate this, the paper focuses on task adaptation, which aims to collect targeted images to fine-tune a pre-trained vision model for a specific downstream task. They compare the effectiveness of fine-tuning on:
- Targeted synthetic images generated by Stable Diffusion (trained on the LAION-2B dataset).
- Targeted real images retrieved directly from the LAION-2B dataset.
By contrasting these two approaches, the research aims to isolate and evaluate the true value added by using synthetic data generated from a model, compared to directly using the real-world data the model was trained on.
Adapting Pre-Trained Vision Models Using Synthetic or Retrieved Data
In task adaptation, the goal is to enhance the performance of a pre-trained vision model on a specific downstream visual classification task.
This adaptation is achieved by fine-tuning the model using a targeted dataset curated specifically for the task. The research compares the effectiveness of two distinct approaches for creating this adaptation dataset: generating synthetic images and retrieving real images.
Generating Synthetic Images
This method leverages a text-to-image generative model, specifically Stable Diffusion 1.5, pre-trained on the large-scale LAION-2B image-text dataset. The process starts by synthesizing image captions corresponding to the target task’s class names. This is done by prompting a large language model (LLaMA-2 7B).
These generated captions are then input for Stable Diffusion to synthesize targeted images.Each synthetic image is assigned a class label based on its class name.
This collection of synthetic images and labels forms the targeted synthetic dataset.
Retrieving Real Images
This approach does not generate new images; instead, it directly retrieves relevant images from the generative model’s pre-training dataset, LAION-2B.
Two retrieval strategies are used:
- Hard Substring Matching: This simple strategy involves retrieving images whose corresponding captions contain at least one of the target class names as a substring. This method is effective when the target concepts are concrete entities likely to be explicitly mentioned in the captions.
- Semantic k-NN Retrieval: This strategy uses semantic similarity in the CLIP image-text embedding space for abstract concepts that might not be directly named in captions. Multiple natural language search queries are created based on the target class names. Using these queries, an approximate k-NN search is performed to retrieve the k-nearest image-text pairs from LAION-2B based on their CLIP similarity to the query.
Retrieved images are assigned labels based on the class names they are matched with. This collection of retrieved images and labels forms the targeted retrieved dataset.
Data Filtering and Post-Processing
The curated datasets, both synthetic and retrieved, undergo further refinement to enhance their quality:
- Filtering: This step removes images with content misaligned with their assigned class labels. Both datasets are filtered by measuring the CLIP similarity of each image to text that represents its corresponding label. The top 30% of images with the highest similarity scores are retained.
- Post-processing: While synthetic datasets are inherently class-balanced due to the uniform generation process, retrieved datasets might exhibit class imbalance. A global threshold (M) is set to address this, and the retrieved dataset is truncated to ensure that each class label occurs at most M times.
By employing these methods for data curation, the study aims to create targeted adaptation datasets that are both relevant to the downstream task and balanced across classes. This allows for a fair and rigorous comparison of the effectiveness of synthetic and real data in fine-tuning pre-trained vision models for specific tasks.
Key Findings
The paper conducted a series of experiments to compare the effectiveness of fine-tuning a pre-trained vision model using targeted synthetic images versus targeted real images retrieved from the generative model’s training data. The researchers focused on five downstream tasks:
- ImageNet-1K: A large-scale image classification benchmark encompassing many object categories.
- Describable Textures (DTD): A dataset for recognizing various texture categories.
- FGVC-Aircraft: A fine-grained dataset for classifying different aircraft models.
- Stanford Cars: A fine-grained dataset for classifying different car models.
- Oxford Flowers-102: A fine-grained dataset for classifying different flower species.
- Retrieved real images consistently outperformed or matched synthetic images across all benchmarks and data scales. This indicates that directly training on the relevant portions of the generative model’s training data was more effective than using synthetic images derived from that same data.
- Synthetic data did exhibit some positive scaling in certain cases, but it generally lagged behind retrieved data. For instance, on the FGVC-Aircraft benchmark, increasing the size of the synthetic dataset led to improved performance, but it still required a much larger synthetic dataset to achieve the same level of accuracy as a smaller dataset of retrieved images.
- Training on synthetic data could sometimes improve a model’s task representation without significantly improving task performance. In some cases, the LP accuracy (representation quality) improved when training on synthetic data, but the corresponding ZS accuracy remained low. This suggests that while the model might have learned some general features relevant to the task, it struggled to directly apply this knowledge to accurately classify new images.
These findings highlight the limitations of using synthetic data generated by current text-to-image models for fine-tuning pre-trained vision models. The researchers conclude that further improvements in the quality and fidelity of synthetic image generation are needed to surpass the effectiveness of training directly on relevant real-world data.