Review of a Data-Centric AI Paper from NeurIPS 2024 —SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification
This post is part of a five-part series examining notable data-centric AI papers from NeurIPS 2024. For brief summaries of all five papers, check out my overview post, where you’ll find links to each detailed analysis.
While much of the machine learning discourse focuses on models, algorithms, and architectures, the critical role of data curation often remains in the shadows.
Data curation, which involves carefully selecting and organizing data to create a dataset, profoundly impacts the performance and robustness of machine learning models, particularly in image classification. Despite growing awareness of data curation’s significance, many studies fall short of best practices, often providing minimal information about their training data and its curation process.
This lack of transparency obscures the vital connection between data quality and model performance, hindering progress toward more robust and reliable machine learning systems.
Relevant links:
While data curation has historically been an implicit consideration in machine learning research, it’s recently gained prominence as a research topic in its own right. In this paper, the authors bring data curation into sharper focus and establish it as a distinct research area by formalizing the task as a rational choice problem whose goal is maximizing the utility of the resulting dataset within specific cost constraints.
The paper formalizes the task of data curation strategy as a function that takes a cost input and produces a set of samples drawn from a distribution over a set of plausible images.
In data curation, costs can arise from various sources:
- Data Acquisition: Gathering images or image-text pairs from the web, specialized databases, or through synthetic generation can be computationally expensive and time-consuming.
- Labeling: Obtaining accurate labels for the data, whether through expert annotation, crowdsourcing, or automated methods, incurs costs.
- Filtering: Selecting the most informative and relevant samples from a large pool of data often requires human effort or sophisticated algorithms, both of which have associated costs.
Data curation strategies can be viewed as a series of choices by curators to maximize the dataset’s utility within a given cost constraint. In this paper, the authors discuss five data curation strategies:
- Expert Curation: Considered the gold standard, this strategy involves human-in-the-loop at all stages, including selecting the label set, prefiltering images, and assigning labels with expert oversight. This method results in high-quality datasets, such as the original ImageNet, but is costly due to extensive human effort.
- Crowdsourced Labeling: This approach reduces labelling costs using a wider pool of annotators. Experts define the label set, but image prefiltering is omitted. Annotators can apply multiple labels per image, potentially leading to class imbalances.
- Schema Matching: This strategy leverages the existence of well-curated datasets by mapping their label sets to create new datasets. A schema is created to connect labels across datasets, often requiring expert input. While schema creation is relatively low-cost, the quality and balance of the resulting dataset depend heavily on the source datasets.
- Synthetic Data Generation: This strategy bypasses the need for real images by using generative models to create synthetic images and labels. The models are trained on existing datasets and can be conditioned on various factors, such as label sets, text captions, or images. However, synthetic images often lack fidelity compared to real images, presenting a challenge for this approach.
- Embedding-Based Search: This method utilizes pre-trained computer vision models, often vision-language models like CLIP, to search large, unlabeled datasets for images relevant to target classes. This technique can efficiently retrieve images semantically similar to those in a reference dataset or matching specific text prompts by comparing image embeddings. However, this approach can introduce label noise, requiring further filtering or correction techniques.
The underlying principle is that curators make rational choices to select a curation strategy that aims to maximize the utility of the set of samples while staying within the given cost constraint. Increasing the allowed cost typically allows for a larger and potentially more diverse set of samples, which is expected to lead to higher utility.
In essence, their formalization casts data curation as an optimization problem:
- Objective: Maximize the utility of the curated dataset.
- Decision Variables: The curation strategy which encompasses choices about data sources, labeling methods, filtering techniques, and more.
- Constraint: The total cost of curation must not exceed the allowed budget.
Curators must carefully consider the costs and benefits of different approaches to arrive at a dataset that effectively balances utility and resource constraints.
Exploring Utility and Analytic Metrics in Data Curation
The core idea is that a dataset possesses a certain level of utility, which reflects its effectiveness for the intended task (in this paper, the focus is image classification). This utility can be quantified through various metrics, broadly grouped into two categories: utility and analytic metrics.
These metrics play distinct but complementary roles in assessing the effectiveness of different curation methods.
Utility Metrics: Measuring Dataset Usefulness Through Model Training
Utility metrics focus on measuring the practical usefulness of a curated dataset for training image classification models. They involve training models on the dataset and evaluating their performance on various tasks.
The paper discussed the following key utility metrics:
- Base Accuracy: This metric measures the model’s performance on a holdout set drawn from the same distribution as the baseline dataset. In this research, the baseline is the original ImageNet-train dataset, and the holdout set is ImageNet-val. Base accuracy directly measures how well a model trained on a particular dataset generalizes to unseen data from the same distribution.
- OOD Robustness: This metric assesses the model’s ability to generalize to out-of-distribution (OOD) datasets, which differ in some way from the training distribution. This includes synthetic OOD shifts (e.g., ImageNet-C, which introduces image corruptions) and natural OOD shifts (e.g., ImageNet-Sketch, which uses sketches of objects). OOD robustness is crucial for evaluating a model’s ability to handle real-world scenarios where the data may not perfectly match the training distribution.
- Fine-tuning: This metric evaluates the model’s ability to adapt to new, unseen tasks after being pretrained on the curated dataset. Strong fine-tuning performance indicates that the pretrained model has learned generalizable features that transfer well to new domains.
- Self-Supervised Guidance: This metric uses a self-supervised learning method (specifically, DINO) to pretrain a model on the curated dataset without using any labels. The pretrained model is then evaluated on the ImageNet-val test set using k-NN classification. This approach measures the dataset’s usefulness for learning representations without relying on explicit labels.
Analytic Metrics: Characterizing Datasets Without Training
In contrast to utility metrics, analytical metrics aim to capture the essential characteristics of a dataset without requiring model training. They offer insights into potential factors influencing model performance and can be used for rapid evaluation and comparison of different datasets.
The paper categorizes analytic metrics as follows:
Summary Statistics
These metrics provide a basic overview of the dataset, including:
- Dataset Size: The number of unique samples in the dataset.
- Class Coverage: The number of classes in the label set represented in the dataset.
- Imbalance Metrics: These metrics capture the distribution of samples across classes, highlighting potential issues with class imbalance. The authors introduce two specific metrics:
- Left-Skewedness: This measures the concentration of samples in a few dominant classes. High left-skewedness indicates that a small number of classes account for a large proportion of the samples, which can bias the model towards those classes.
- Long-tailedness: This measures the proportion of classes with very few samples. A highly long-tailed dataset has many classes with limited representation, making it difficult for the model to learn effectively on those classes.
Quality Metrics
These metrics aim to assess the quality of the images and labels in the dataset. The sources consider several metrics, including:
- CLIPScore: This metric uses a CLIP model to evaluate the similarity between the images and their corresponding text labels, measuring image and label quality.
- CLIP-IQA: This metric uses CLIP and generic semantic opposite pairs (e.g., “good/bad”, “bright/dark”) to assess the quality of the images alone.
- Inception Score: This widely used metric measures the diversity and recognizability of generated images using a pretrained Inception v3 model.
- CMMD (CLIP Maximum Mean Discrepancy): This recent metric utilizes richer CLIP embeddings and the maximum mean discrepancy distance to evaluate image quality.
Correlational Metrics
These metrics examine the relationships between various dataset properties, such as:
- Correlation between precision and class count (indicating potential label noise in larger classes).
- Correlation between accuracy and confusion skewness (how concentrated model errors are on certain classes).
- Correlation between the accuracy of the ImageNet-1k model and the model trained on the shift dataset.
- Correlation between precision and recall.
- Correlation between class availability in ImageNet-1k and the shift dataset.
In essence, utility metrics answer the “what” question (which datasets lead to better model performance), while analytic metrics help to answer the “why” question (what characteristics of the datasets contribute to those performance differences).
The Focus is on Benchmarking, Not New Curation Methods
While the paper introduces a framework for evaluating data curation strategies, it doesn’t propose a novel method for data curation itself.
The primary goal of the paper is to:
- Bring attention to the importance of data curation.
- Establish a standardized way to assess and compare different data curation strategies.
- Provide insights into the strengths and limitations of existing curation methods.
The paper does, however, introduce SELECT and IMAGENET++.
- SELECT is a benchmark, criteria, and metrics to evaluate data curation strategies.
- IMAGENET++ is a dataset, a collection of curated image sets, used to test and compare different curation strategies using the SELECT benchmark.
While closely related, they serve distinct purposes.
SELECT: A Framework for Evaluation
- Purpose: Provide a standardized and comprehensive way to assess the quality and utility of datasets created using various data curation methods. It helps researchers compare different approaches and understand their strengths and weaknesses.
- Focus: Evaluate how well datasets support efficient learning for image classification tasks.
- Metrics: The utility and analytic metrics discussed above.
- Goal: Encourage a more systematic and rigorous evaluation of data curation strategies, moving beyond relying solely on base accuracy. Providing a diverse set of metrics highlights the importance of considering factors like robustness, generalization, and dataset properties when assessing the quality of curated data.
IMAGENET++: A Testbed for Data Curation Strategies
- Purpose: A large-scale dataset specifically designed to evaluate the SELECT benchmark. It provides a collection of datasets (referred to as “shifts”), each curated using one of the five strategies previously discussed, allowing for direct comparison of their performance.
- Focus: Image classification, building upon the widely studied ImageNet dataset.
- Datasets: It includes the original ImageNet training set (the baseline representing expert curation) and five shifts (refer to Table 1 in the paper).
- Goal: Enable researchers to empirically test and compare different data curation strategies in a controlled setting. By training models on these shifts and evaluating them using the SELECT benchmark, the authors gain insights into how various curation methods impact model performance across various tasks and metrics.
Key Findings
The central finding of the research is that while no single reduced-cost data curation strategy outperforms the original expert-curated ImageNet dataset across all metrics, some methods, particularly embedding-based search techniques, exhibit promising results and are worthy of further exploration and refinement.
Here are the key takeaways regarding the performance of each curation strategy:
- Expert Curation: As expected, the original ImageNet dataset, created through meticulous human effort, remains the best-performing dataset across most utility metrics, highlighting the enduring value of expert knowledge in crafting high-quality datasets.
- Crowdsourced Labeling: While less expensive than expert curation, this approach results in significant class imbalance, leading to subpar performance on most tasks. Surprisingly, even with human annotators, this method often underperforms compared to some less expensive methods.
- Embedding-Based Search: This strategy, utilizing CLIP embeddings to select images, emerges as the most promising reduced-cost method, consistently outperforming other techniques like synthetic image generation. However, it suffers from label noise, which hinders its ability to fully match expert-curated datasets
- Synthetic Data Generation: While offering potential cost savings, this method, relying on Stable Diffusion, struggles to generate high-quality images to compete with real image datasets. They note that current image quality metrics fail to accurately predict the utility of synthetic datasets, suggesting a need for better evaluation tools for this approach.
Ultimately, choosing a data curation strategy involves a trade-off between cost and performance. This research underscores data curation’s critical role in the success of machine learning models. The authors argue that continued research and development of more efficient and effective data curation strategies are crucial for unlocking the full potential of machine learning across various domains and applications.