Review of a Data-Centric AI Paper from NeurIPS 2024 — Understanding Bias in Large-Scale Visual Datasets
This post is part of a five-part series examining notable data-centric AI papers from NeurIPS 2024. For brief summaries of all five papers, check out my overview post, where you’ll find links to each detailed analysis.
A decade ago, researchers highlighted the issue of bias in visual datasets, demonstrating that models could easily predict the dataset origin of an image.
Despite efforts to create more diverse and comprehensive datasets, this problem persists. In this paper, researchers explored the specific forms of bias present in large-scale datasets like YFCC, CC, and DataComp (which the authors collectively call YCD).
Using a novel framework that applies various transformations to isolate different types of information, they discovered that semantic and structural biases are major contributors to the ease with which models can classify datasets.
Relevant links:
There is no doubt that bias exists in large-scale visual datasets.
While previous research focused on identifying social biases (e.g., gender, race, geographical representation) in datasets, this paper goes further by pinpointing the specific visual attributes contributing to this bias. To this end, they attempt to answer the following question: What are the concrete forms of visual bias are present in large-scale visual datasets?
As described in the paper, visual bias is the distinctive characteristics of images from different datasets that allow machine learning models to accurately predict their dataset origin. This ability to classify images based on datasets suggests a lack of diversity and representativeness in these datasets, potentially limiting the generalizability of models trained on them.
Types of Visual Biases
Using a framework based on image transformations and dataset classification, this paper focuses on understanding the specific forms of bias that differentiate large-scale visual datasets, leading to their easy classification by neural networks.
It outlines several types of visual bias, which can be categorized as:
Semantic Bias
This refers to biases related to the content and meaning represented in the images. The paper investigates two key aspects:
Object-level Bias
Two distinct patterns of object-level bias emerge when examining large-scale visual datasets, each affecting how models learn to understand our visual world.
- Uneven Distribution: The presence and frequency of specific object categories might vary drastically between datasets. For example, one dataset might overrepresent images with “cars” while another might have a much higher proportion of “household items.”
- Limited Object Diversity: Datasets could differ in the average number of unique object categories present per image. This suggests that some datasets might be more object-centric, focusing on images with a single or few dominant objects. In contrast, others might capture scenes with greater object variety.
Theme and Scene Bias
Datasets might exhibit distinct thematic focuses, which is evident through the scenes and activities. This bias can be explored by analyzing:
- High-level Scene Categories
- Depiction of Human Activities
- Presence of Artistic or Stylized Content
Structural Bias
This pertains to biases related to the spatial arrangement and composition of visual elements within images. The paper explores:
- Object Shape and Geometric Layout: Even without considering objects’ semantic meaning, their shapes and spatial configurations can indicate the dataset’s origin.
- Local vs. Global Spatial Structure: The paper investigates the role of spatial information at different scales:
- Local Structure: The arrangement of visual elements within small image patches can contribute to dataset-specific patterns.
- Global Structure: The overall composition and layout of elements across the entire image might also be biased.
Color Bias
This concerns biases related to the color palettes and distributions in the images. Datasets might have characteristic color profiles, even when considering only the average color values of images. The paper examines:
Frequency Bias
This focuses on biases present in different frequency components of the images, which can capture information about both texture and structure. Datasets may contain distinctive patterns in both. The paper considers:
The paper analyzes these forms of bias using a variety of image transformations that isolate or emphasize specific visual attributes. Observing how these transformations affect a model’s ability to classify datasets aims to reveal the concrete forms of bias that make datasets visually distinct.
Experimental Setup to Identify Dataset Bias
The paper runs an experimental setup designed to understand the specific forms of bias present in three large-scale visual datasets: YFCC, CC, and DataComp.
The core idea is to apply various image transformations to isolate different visual attributes and then assess how well a neural network can classify the images based on their dataset of origin after these transformations. The effectiveness of dataset classification on transformed images indicates the presence and strength of bias within the specific visual attribute targeted by the transformation.
For each dataset, 1 million images are randomly sampled for training and 10,000 images for validation. The primary task is dataset classification, where a neural network is trained to predict the dataset origin (YFCC, CC, or DataComp) of an input image. The classification accuracy on this task serves as a measure of the overall bias present in the datasets.
The authors then employ a variety of image transformations to isolate and analyze different visual attributes.
Semantic Transformations
- Semantic Segmentation: Transforming images into semantic segmentation maps, where each pixel is labeled with an object class, to assess bias in fine-grained semantic information.
- Object Detection: Extracting object bounding boxes with class labels to evaluate bias in coarse-grained object information.
- Image Captioning: Generating textual descriptions of images to represent semantic content without visual details, allowing analysis of bias solely in semantic concepts.
- Variational Autoencoder (VAE): Encoding and reconstructing images using a VAE to potentially reduce low-level image artifacts while preserving semantic information.
Structural Transformations
- Edge Detection: Using the Canny edge detector to highlight object boundaries, focusing on bias in object shape.
- Contour Extraction (SAM): Employing the Segment Anything Model (SAM) to delineate object contours, providing cleaner shape representations.
- Depth Estimation: Generating depth maps to capture the spatial geometry and relative object positions, examining bias in 3D spatial arrangements.
Spatial Permutations
- Pixel Shuffling: Randomly rearranging pixels to completely disrupt spatial structure.
- Patch Shuffling: Rearranging image patches to preserve some local spatial information while disrupting global structure.
Color Transformations
Reducing each image to a single color representing its average RGB value to isolate bias in color statistics.
Frequency Transformations
- High-pass Filtering: Retaining high-frequency components to analyze bias in textures and sharp transitions.
- Low-pass Filtering: Keeping low-frequency components to examine bias in overall structure and smooth variations.
Synthetic Image Generation
The researchers explored unconditional and text-to-image generation methods to understand whether dataset bias extends to synthetic data. The goal was to see if diffusion models, trained on these biased datasets, would produce synthetic images that also reflect the original biases.
- Unconditional Generation: Training a diffusion model on each dataset and generating synthetic images to see if the model inherits and reflects the original dataset bias.
- Text-to-Image Generation: Creating synthetic images conditioned on image captions to assess whether semantic bias is preserved in the generation process.
Classification Model and Training
A ConvNeXt-Tiny model is used as the base classifier for the dataset classification task. The model is trained separately on each set of transformed images.
The primary evaluation metric is the model’s classification accuracy on the validation set of transformed images. High accuracy indicates that the specific visual attribute targeted by the transformation contributes significantly to dataset bias.
Beyond classification accuracy, the sources perform further analyses to understand semantic bias:
- Object-Level Queries: Using object detectors pretrained on ImageNet, LVIS, and ADE20K to identify objects in each dataset and analyze their distribution and diversity.
- Open-Ended Language Analysis: Applying topic modeling (LDA) and prompting a large language model (GPT-4o) to extract and summarize semantic themes from image captions.
By systematically evaluating dataset classification performance across a wide range of image transformations, the sources aim to provide a comprehensive picture of the types and extent of bias present in the YCD datasets. This experimental setup allows them to draw conclusions about the role of specific visual attributes in dataset bias and to discuss potential implications for dataset curation and model training.
Forms of Bias in the YCD Datasets
The analysis of various image transformations reveals several types of bias present in the YFCC, CC, and DataComp datasets:
Semantic Bias
The research finds that semantic bias plays a major role in distinguishing the datasets.
Even when images are transformed to retain only semantic information (through semantic segmentation, object detection, or captions), the model can still predict their dataset origin with accuracy well above chance. This suggests that the datasets have substantial differences in the types of objects, scenes, and themes they represent.
Analysis of object distributions reveals a stark imbalance in the presence and frequency of specific objects across the datasets. For instance, YFCC is heavily populated with images containing “poles,” “stages,” and “parachutes,” while CC has a higher proportion of “sweatshirts,” “lampposts,” and “lanyards.” DataComp, in turn, is characterized by a preponderance of “vases,” “armchairs,” and “beds.”
YFCC exhibits a notably higher average number of unique objects per image than CC and DataComp. DataComp has the fewest unique objects per image, likely due to its filtering process, which prioritizes images with content similar to ImageNet.
Distinct Thematic Focuses
Open-ended language analysis, using topic modeling and LLM (GPT-4o) summarization of image captions, uncovers distinct thematic focuses for each dataset.
- YFCC: Strong emphasis on outdoor and natural scenes, human interactions, and social events. Captions frequently mention elements like “people,” “group,” “wearing,” “field,” “game,” “water,” “sky,” and “trees.”
- CC: A blend of YFCC’s dynamic scenes with a greater focus on indoor settings and household items. Captions often describe “rooms,” “dining tables,” “chairs,” and “designs.”
- DataComp: Concentrates on static objects, products, and digital graphics, with a prevalence of clean backgrounds and minimal human presence. Keywords like “logo,” “background,” “design,” “book,” “box,” and “bottle” are prominent.
Structural Bias
The research found that the model can classify datasets with even higher accuracy when using object contours (extracted through edge detection or SAM) and depth maps compared to semantic information alone. This highlights that object shapes and spatial configurations are strong indicators of dataset origin.
Surprisingly, shuffling image patches while maintaining local structure within each patch has minimal impact on dataset classification accuracy, especially with larger patch sizes. This indicates that local spatial information is a potent source of bias and sufficient for the model to learn dataset-specific patterns.
Color Bias
Even when reducing each image to its average RGB value, the model achieves a classification accuracy significantly higher than chance. This suggests that the datasets exhibit differences in overall color palettes and distributions.
- YFCC Notably Darker: Analysis of mean RGB values reveals that YFCC images are generally darker than those in CC and DataComp. This difference is also reflected in the classification results, where the model easily distinguishes YFCC based on color alone but struggles to differentiate between CC and DataComp.
- Confusion Between CC and DataComp: While the model easily classified YFCC images based on color, it had more difficulty distinguishing between CC and DataComp, which have similar color distributions
Frequency Bias
The model retains close-to-reference accuracy when trained on images with high-frequency or low-frequency components filtered out. This indicates that dataset bias exists across both frequency bands, implying that texture and structure contribute to the datasets’ visual distinctiveness.
These findings suggest that despite efforts to improve diversity, large-scale datasets still exhibit significant biases across various visual attributes.
While potentially subtle to human observers, neural networks readily exploit these biases, leading to concerns about the generalizability and robustness of models trained on such data. The sources argue that understanding these biases is crucial for creating more representative datasets and developing models that can perform reliably in diverse real-world scenarios.
Understanding Your Dataset with Transformations
While the paper focuses on classifying datasets to identify bias, you can adapt their transformation methodology to better understand your single dataset. Here’s how you can apply the transformations and interpret the results without relying on dataset classification accuracy:
Focus on Transformation Outputs as Representations
Instead of viewing transformations as a preprocessing step for classification, treat their outputs as new representations of your data. Each transformation emphasizes specific visual attributes while suppressing others.
Analyze the Transformed Data
Examine the transformed images directly. For example, look at the semantic segmentation maps to see if certain object classes are more prominent or spatially clustered. Analyze edge maps or contours to understand the prevalence and distribution of different object shapes. Inspect color histograms derived from averaging RGB values to see if your dataset is skewed towards particular color palettes. Calculate statistics on the transformed data, for example:
- Measure the average number of unique objects detected per image to assess object diversity.
- Compute the distribution of edge densities to understand the complexity of shapes.
- Analyze the frequency spectrum after applying high-pass and low-pass filters to determine the amount of information contained in different frequency bands.
Unsupervised Learning
Apply clustering algorithms to the transformed data to see if natural groupings emerge.
- For example, cluster images based on their semantic segmentation maps can be used to identify groups of images with similar object compositions.
- Cluster images based on their HOG features to see if distinct shape-based categories arise.
Natural Language Analysis
If captions are associated with your images, analyze them using techniques like topic modeling or LLM summarization to uncover prevalent themes and potential biases. For example, the sources used LDA to identify topics related to “outdoor scenes” in YFCC and “digital graphics” in DataComp.
Key Points
- Transformation is Key: The transformations are not about removing bias but about creating alternative representations highlighting specific visual aspects of your data.
- Focus on Interpretation: The goal is to gain insights into your dataset, not to achieve high classification accuracy.
- Context Matters: The meaning of the findings depends on how your data was collected and how it will be used.
By adopting the transformation approach from the sources, you can better understand your single dataset’s visual characteristics, potential biases, and underlying patterns.