Exploring Amazon's Kaputt Dataset with FiftyOne

Congratulations to the winners and sponsors of the VAND 4.0 Kaputt Challenge at CVPR 2026.

The results are in. After six weeks of competition, hundreds of model submissions, and no shortage of creative solutions, the VAND 4.0 Retail Track has crowned its winners at CVPR 2026 in Honolulu. Powered by Amazon's Kaputt dataset, the track pushed teams to tackle defect detection under the messy, unpredictable conditions of real retail logistics. Before we dive into what made this challenge so compelling, we want to take a moment to recognize the people and organizations that made it possible.

Congratulations to the Winners 🏆

To every team that competed in the VAND 4.0 Kaputt Retail Track: well done. You tackled one of the genuinely hard problems in computer vision — anomaly detection under real-world conditions — and pushed the state of the art further than it has been before. The winning submissions demonstrated that closing the gap between controlled benchmarks and the messiness of the real world is possible, and the field is better for your work. Here’s a short video highlighting Xian Tao’s submission that won the “Best Performance” category.

What was the VAND 4.0 Challenge?

VAND — Visual Anomaly and Novelty Detection — is a long-running workshop series at CVPR that tracks progress on one of computer vision's most practically important problems: detecting when something is wrong with an object, without necessarily knowing in advance what "wrong" looks like.

The fourth edition, VAND 4.0 at CVPR 2026, expanded the challenge for the first time to include both manufacturing and retail logistics settings, across two independent tracks:

Industrial Track — using the MVTec AD 2 dataset, evaluating robustness under real-world distribution shifts like changing lighting conditions.
Retail Track — using Kaputt 2, an extension of the Kaputt dataset, evaluating defect detection under the chaos of retail logistics: wildly varying object poses, diverse appearances, and no controlled capture setup.

The Kaputt Dataset: Why It Matters

The word kaputt is German for broken, and the dataset lives up to its name — it is a benchmark built around things that have gone wrong, at a scale and realism that the field has not seen before.

The Problem with Existing Benchmarks

For years, anomaly detection research has been anchored to datasets like MVTec-AD and VisA. These are excellent controlled benchmarks, but their very controlledness has become a limitation. Objects are photographed from fixed viewpoints, under consistent lighting, with predictable backgrounds. State-of-the-art methods now score as high as 99.9% AUROC on MVTec — which sounds great until you deploy those same models in a real warehouse and watch the numbers collapse.

Retail logistics is nothing like a controlled photo studio. Items arrive in arbitrary orientations. The same product can look completely different depending on how it was packed, handled, or stored. Damage can be subtle — a small dent, a torn label corner, a scuffed surface — against a background of enormous normal variation.

What Kaputt Brings to the Table

Kaputt was introduced at ICCV 2025 by researchers at Amazon, and it was designed from the ground up for this harder problem. The numbers alone make it striking:

230,000+ images — roughly 40 times larger than MVTec
29,000+ defective instances — not a handful of defect examples per category, but a deep, varied collection
48,000+ distinct objects — covering diverse product types across a retail logistics setting
Heavy pose and appearance variation — the defining challenge that separates Kaputt from prior art

To validate just how hard it is, the Kaputt authors benchmarked multiple state-of-the-art anomaly detection methods on the dataset. The best any of them managed was 56.96% AUROC — barely better than a coin flip. That is not a failure of those methods. That is an honest measurement of how far the field still has to go when the real world refuses to cooperate.

Kaputt 2: Raising the Stakes

The VAND 4.0 challenge used Kaputt 2, an extension of the original dataset released at the challenge start. It builds on the same foundation while pushing the difficulty further, ensuring that solutions developed for the competition represent genuine progress rather than overfitting to a familiar benchmark.

Explore Kaputt with FiftyOne

A dataset with 230,000 images, dozens of defect types, multiple image modalities (full images, crops, segmentation masks), and rich per-sample metadata is not something you can understand by reading a README. You have to look at it. And looking at it — really looking, at the right samples, from the right angles, with the right context — is exactly what the open source FiftyOne library is built for.

Grouped datasets for multimodal data. Kaputt is not just a pile of images. Each sample has up to three related views: the original image, a tightly cropped region of interest, and a segmentation mask.

FiftyOne grouped dataset view of a Kaputt book sample showing the cropped image slice and its segmentation mask, with label fields in the sidebar

FiftyOne's grouped dataset structure handles this naturally, letting you browse all three views in sync in the same UI panel rather than trying to mentally correlate files across directories.

Interactive filtering without writing SQL. With compound indexes on fields like defect_types, item_material, and split, you can filter hundreds of thousands of samples to the exact subset you care about in milliseconds. Want to see plastic items that have major spillage?

FiftyOne App filtering Kaputt samples to hard plastic items with spillage defects, showing leaked product across the box floors

Two lines of Python, one filter expression. The FiftyOne App surfaces the same filters visually, with a sidebar that updates in real time as you drill down.

Embedding-based exploration. Raw browsing only gets you so far. FiftyOne Brain's compute_visualization takes CLIP embeddings and projects them down to two dimensions with UMAP, producing an interactive scatter plot where similar-looking samples cluster together.

FiftyOne Embeddings panel showing a UMAP plot of Kaputt samples colored by defect severity, beside a grid of shrink-wrapped retail items

Suddenly you can see the shape of the dataset — which defect types look alike, which materials form tight clusters, where the outliers live.

Similarity search. Found an interesting defective sample? FiftyOne's similarity index lets you instantly retrieve the N most visually similar samples across the whole dataset.

FiftyOne similarity search results for "books" returning 25 visually similar Kaputt samples, several tagged with defects

This is not just useful for exploration — it is the foundation of quality-control workflows, hard example mining, and dataset curation.

VLM integration. FiftyOne's plugin ecosystem brings modern vision-language models directly into your dataset workflow. Running FastVLM on the images adds natural-language captions to every sample, making it possible to search and filter the dataset by semantic content and to quickly spot what the model understands (and misunderstands) about each image.

FiftyOne sample with a FastVLM-generated caption describing the contents of a Kaputt cardboard box

Notebook and Tutorial Highlights: Exploring Kaputt in FiftyOne

This notebook is based on the official Voxel51 Kaputt tutorial, adapted to run end to end in a self-contained virtual environment. Here’s what the notebook does that the documented tutorial doesn’t:

Reads Parquet, not JSON — it parses the query-*.parquet index tables directly.
Both shapes available — it builds the flat kaputt dataset first (image per sample, crop/mask as fields, mask as a segmentation overlay), then a kaputt_grouped dataset with synchronized image/crop/mask slices, so you can use whichever fits the task.

Everything else from the original tutorial — compound/field indexes, CLIP embeddings with similarity search, UMAP visualization, and FastVLM captioning — is covered.

Next Steps

Everything you need to dive deeper:

The Dataset

kaputt-dataset.com — request access to the dataset (free for research use): https://www.kaputt-dataset.com/
Kaputt arXiv paper — the ICCV 2025 paper with full benchmark results and methodology: https://arxiv.org/abs/2510.05903

The FiftyOne + Kaputt Tutorial

FiftyOne Kaputt tutorial — the official Voxel51 tutorial this notebook is based on: https://docs.voxel51.com/tutorials/kaputt_dataset.html
Jupyter notebook — kaputt_explore.ipynb, the step-by-step notebook from this post

FiftyOne Resources

FiftyOne documentation — full API reference and user guide: https://docs.voxel51.com/
Grouped datasets guide — deep dive into multimodal grouped datasets: https://docs.voxel51.com/user_guide/groups.html
FiftyOne Brain — embeddings, similarity, and visualization: https://docs.voxel51.com/brain.html
FiftyOne Model Zoo — CLIP, FastVLM, and dozens of other models ready to use: https://docs.voxel51.com/model_zoo/index.html

Thank You to the Sponsors

None of this happens without the organizations willing to fund, organize, and champion a challenge this ambitious.

Amazon sponsored the $3,000 Best Performance prize and the $3,000 Best Zero-Shot/ Off-the-shelf VLM prize, and provided the Kaputt dataset itself — a massive undertaking representing years of data collection and annotation work in real retail logistics environments.

Voxel51 sponsored the $750 Best Performance Runner-Up prize, and provided the FiftyOne tooling that makes exploring datasets like Kaputt tractable for researchers and practitioners alike.

Intel sponsored the $6,000 Best Efficiency prize — a recognition that a model that can't run at scale isn't a real solution — and contributed challenge organizers Samet Akcay and Ashwin Vaidya.

MVTec sponsored the $400 Best Paper jury prize and provided the parallel Industrial Track benchmark (MVTec AD 2), along with organizers Lars Heckler-Kram, Jan-Hendrik Neudeck, and Ulla Scheler.

A special thank you to the challenge organizers from Amazon — Sebastian Höfer, Dorian Henning, and Anton Milan — who designed the competition, wrote the technical guidelines, and shepherded hundreds of participants through a six-week sprint.

Defect Detection Gets Real: Exploring Amazon's Kaputt Dataset with FiftyOne

Talk to an AI expert