ImageNet-D: New Synthetic Test Set Designed to Rigorously Evaluate the Robustness of Neural Networks
February 11, 2025 – Written by Harpreet Sahota
Neural networks are achieving incredible feats in zero-shot image classification, but how well do they really see?
Existing datasets for benchmarking the robustness of these models are limited to images that exist on the web or are created through time-consuming and resource-intensive manual collection. This makes it difficult to systematically evaluate how well these models generalize to unseen data and real-world conditions, including variations in background, texture, and material. One viable solution is to evaluate models on synthetically generated images, such as ImageNet-C, ImageNet-9, or Stylized-ImageNet. However, datasets like those mentioned rely on specific synthetic corruptions, backgrounds, and textures; moreover, they have limited variations and lack realistic image quality.
There’s also the added challenge of these models becoming so powerful that they achieve remarkably high accuracy in these synthetic image datasets.
ImageNet-D, a new benchmark generated with diffusion models, addresses these limitations, pushing models to their breaking points with challenging images and revealing critical failures in model robustness.
- It’s composed of 4,835 “hard images.”
- ImageNet-D spans 113 overlapping categories between ImageNet and ObjectNet.
- The dataset incorporates 547 nuisance variations, including a wide array of backgrounds (3,764), textures (498), and materials (573), making it far more diverse than previous benchmarks. By systematically varying these factors, ImageNet-D comprehensively assesses how well a model can truly “see” beyond superficial image features.
The jump from the “complexities of real-world data” to a synthetic dataset like ImageNet-D might seem counterintuitive, but it addresses key limitations in how neural network robustness is evaluated.
Some reasons why synthetic datasets are advantageous include:
- Need for Systematic Control Real-world data is inherently uncontrolled. If you want to test how a neural network responds to changes in background, texture, or material, it is hard to systematically create or find real-world data with all the combinations you need.
- Synthetic Data Offers Control & Scalability ImageNet-D leverages diffusion models to generate synthetic images, overcoming the limitations of real-world data. This approach allows researchers to systematically control and efficiently scale the dataset, exploring a much wider range of variations than would be feasible with real images alone. Using diffusion models, ImageNet-D can generate images with more diversified backgrounds, textures, and materials than existing datasets.
- Focus on “Hard” Examples ImageNet-D uses a hard image mining process to selectively retain images that cause failures in multiple vision models. By focusing on the weaknesses of current models, ImageNet-D provides a more informative evaluation.
- Quality Control via Human Verification While synthetic, ImageNet-D doesn’t sacrifice quality. A rigorous quality control process involving human annotators ensures that the generated images are valid, single-class, and of high quality.
Image Generation by Diffusion Models
The algorithm for generating images for ImageNet-D involves several key steps, leveraging Stable Diffusion and a hard image mining strategy. The process is formulated as: Image(C,N) = Stable Diffusion(Prompt(C, N)), where C is the object category, and N represents nuisances like background, material, and texture.
- Images are generated by pairing each object with all nuisances in diffusion model prompts, using 468 backgrounds, 47 textures, and 32 materials from the Broden dataset.
- Each image is labeled with its prompt category C as the ground truth for classification.
- An image is considered misclassified if the model’s predicted label doesn’t match the ground truth C.
Hard Image Mining with Shared Perception Failures
The hard image mining strategy of the ImageNet-D creation process identifies and selects the most challenging images for evaluating neural network robustness.
The goal is to create a test set that pushes the limits of vision models, exposing their weaknesses and failure points. This is achieved by focusing on images that are difficult for multiple models to classify correctly.
- Shared Perception Failure: The core concept is a “shared failure,” which occurs when an image causes multiple models to incorrectly predict the object’s label. The rationale is that images causing shared failures across different models will likely be intrinsically more challenging and informative for evaluating robustness.
- Surrogate Models: To identify these hard images, a set of pre-existing, well-established vision models are used as “surrogate models”. These models act as proxies to estimate the difficulty of images for other, potentially unknown “target models”. The surrogate models used include: CLIP (ViT-L/14, ViT-L/14–336px, and ResNet50), ResNet50, ViT-L/16, VGG16, and others.
The Mining Process
- Generate a large pool of synthetic images using diffusion models, as described earlier.
- Run each surrogate model on the generated images and record its predictions.
- Identify images where multiple surrogate models fail to predict the correct object label. These images are flagged as potential “hard” examples.
- The ImageNet-D test set is constructed using these shared failures of the surrogate models. The final ImageNet-D was created using the shared failures of 4 surrogate models.
The result is a carefully designed process to create a challenging and informative benchmark by selecting synthetic images that expose shared weaknesses across multiple vision models.
Quality Control by Human-in-the-Loop
The human-in-the-loop component is essential for verifying the quality and accuracy of the ImageNet-D dataset, ensuring that the images are correctly labeled and suitable for evaluating the robustness of neural networks.
While diffusion models and hard image mining generate and select challenging images, human annotation is essential to refining the dataset. Human annotation ensures that the ImageNet-D images are valid, single-class, and high-quality. Because ImageNet-D contains diverse object and nuisance pairings that may be uncommon, the labeling criteria consider the main object’s appearance and functionality.
679 qualified Amazon Mechanical Turk Workers participated in 1540 labeling tasks, achieving an agreement of 91.09% on a sampled image from ImageNet-D. Workers were asked to consider the following questions:
- Can you recognize the desired object ([ground truth category]) in the image?
- Can the object in the image be used as the desired object ([ground truth category])?
Sentinels were incorporated into each labeling task to maintain high-quality annotations. These include:
- Positive Sentinel: Images that clearly belong to the desired category are correctly classified by multiple models.
- Negative Sentinel: Images that do not belong to the desired category.
- Consistent Sentinel: Images are repeated randomly to check for consistency in the worker’s responses.
- Responses from workers who fail the sentinel checks are discarded.
How to Use and Interpret Results
So, you’ve tested your model on ImageNet-D — now what? Here’s how to interpret the results and gain valuable insights into your model’s strengths and weaknesses:
- Lower accuracy on ImageNet-D indicates a lack of robustness. Suppose your model performs significantly worse on ImageNet-D than standard benchmarks like ImageNet. In that case, it suggests that it struggles to generalize when faced with variations in background, texture, and material. This means the model likely relies on superficial features rather than truly “understanding” the object.
- Compare against other models. A single accuracy score is only so informative. To gauge your model’s robustness, compare its performance against other models tested on ImageNet-D. This will give you a sense of its relative standing and highlight areas where it excels or lags behind.
- Analyze failure cases. Don’t just look at the overall accuracy; analyze the specific images where your model fails. Are there particular backgrounds that consistently cause misclassifications? Is the model easily fooled by unusual textures or materials? By analyzing these failure cases, you can identify your model’s specific weaknesses and target your efforts for improvement.
Next Steps
If you’re interested in exploring the dataset, I’ve parsed it into FiftyOne format and uploaded it to Hugging Face. With a few lines of code, you can download and start exploring the dataset:
import fiftyone as fo import fiftyone.utils.huggingface as fouh dataset = fouh.load_from_hub("Voxel51/ImageNet-D") # Launch the App session = fo.launch_app(dataset)
Once the app is launched, you can explore what’s in the dataset!
Conclusion
Combining synthetic image generation through diffusion models, systematic hard image mining, and rigorous human verification, ImageNet-D offers a more comprehensive and challenging benchmark than previous datasets.
The results from ImageNet-D testing can reveal critical insights about a model’s true understanding of visual concepts beyond mere surface-level pattern matching.
As vision models advance, reliable ways to assess their limitations become increasingly important. ImageNet-D helps identify these limitations and provides a pathway for developing more robust models that better handle real-world variations in appearance, background, and context. For researchers and practitioners in computer vision, ImageNet-D is more than just another benchmark — it’s a valuable tool for understanding and improving how artificial neural networks see and interpret the visual world.