Zero-Shot Data Reduction Techniques for Efficient Robotics & Visual AI Development
I’m amazed by the massive amount of data deployed vision systems can generate. Two robots I recently worked with, HSR and Digit, each have four or more cameras that can collect data at 15 Hz or faster. Using back-of-the-envelope calculations, you realize that a single robot (not even the fleet) can collect ImageNet-scale data in just a few hours and Florence-2-scale data in a few weeks (1.3M and 126M images, respectively). What’s more, all of that data is relevant to the environment and tasks that the robot is performing and can be used to improve its own perception capabilities. In an age of data-hungry visual AI, “data is the new oil,” but robot and AV systems are its renewable source.
“Data is the new oil” — Clive Robert Humby, Mathematician & Entrepreneur
Notably, refining and operationalizing all of this data is cost-prohibitive. The computational cost to train a single state-of-the-art deep learning model in various fields doubles every 3.4 months due to increasingly large models and datasets, requiring as much as hundreds of thousands of GPU hours. Also, despite progress in weak- and self-supervision, many production models train using full supervision on labeled data, adding annotation costs. But we can’t possibly label ImageNet-scale data every few hours on a per-robot basis (although annotation service providers would love that contract).
Rest assured, we can “tame” this influx of data. Typically, domain experts determine which data are most valuable, but this takes effort away from other development (yet another cost). Alternatively, coreset selection algorithms find a representative subset of data to train a model with lower cost and minimal impact on performance. However, previous state-of-the-art coreset methods can only select data that were pre-labeled. Voxel51 recently announced the open source solution, Zero-Shot Coreset Selection (ZCore), which selects valuable data without any labels, data-specific training, or domain expertise. ZCore offers an efficient, automated solution to sift through massive data to find hidden gems.
How ZCore Selection Works (Try it!)
State-of-the-art coreset methods use carefully designed criteria to quantify the importance of each data example using ground-truth labels and dataset-specific training (e.g., gradient dynamics from training on all of the data). Alternatively, ZCore uses existing foundation models to generate a zero-shot embedding space for unlabeled data and quantifies the relative importance of each example based on overall coverage and redundancy within the embedding distribution (see our paper for full details). On ImageNet, ZCore coreset-trained models are more accurate than previous label-based coresets at a 10% data rate, effectively removing annotation costs for 1.15 million images.
Our paper provides many results on image-based datasets, but does not explore datasets from robots. So, let’s do that here. We provide a ZCore demo below using HSR data from the lab combined with ODMD, ODMS, and TFOD Benchmark examples. To demo along on your data, run this single command from our repo:
pip install fiftyone
python visualize_image_folder.py --image_dir <path to your image folder>
This customizable script handles the entire embedding generation, ZCore selection, and the FiftyOne visualization process we use here.
In the demo above, image data varies across multiple robot cameras, environments, and tasks. In the embeddings view, clusters of images indicate similar or redundant data samples. On the other hand, each cluster represents a unique concept or setting in the broader dataset (i.e., objects, environments, and or viewpoints).
By running ZCore on this data, the individual value of each image within the context of the entire dataset is determined automatically. For example, the HSR grasp camera sequence approaching the pan is unique relative to the rest of the dataset, but only two images are needed to represent such a simple concept. Thus, the best image in that cluster is given a high ZCore score, the second best image is found at the opposite end of the sequence, and the remaining, redundant data have a low score. Across the entire robot dataset, ZCore automatically makes similar determinations without any labels, training, or domain expertise. Notably, in experiments using several public datasets, we find that ZCore selections lead to reliable model performance across a variety of dataset scales and applications (e.g., satellite image classification).
We threshold the ZCore score across all data to determine which images are included in our core set of data – either for model training, fine-tuning, or evaluation. In the demo above, we show threshold results for the top 1, 2, 4, 20, and 85 images. Notably, the top 85 ZCore images cover the dataset’s unique concepts without redundancy. Quickly optimizing the data volume means less manual review, lower annotation effort, less compute for model training or evaluation, and huge cost savings as a result.
Summary
Robotics, AVs, and other deployment platforms provide a renewable data supply for visual AI development. However, the sheer volume of data produced by these systems makes it cost-prohibitive to use all of it. Our research team at Voxel51 developed a tool to automatically select valuable subsets of data to efficiently manage scale without any data labels, training, or domain expertise. Using the ZCore technique, we quickly found a 95% reduction in robot data that covered all the settings of the initial, full dataset.
This tool is open source, available on GitHub, and will be added to the Enterprise version of FiftyOne soon. Give it a try and let us know what you think: https://github.com/voxel51/zcore.
Acknowledgments
Thank you to Jason Corso, Kirti Joshi, Manushree Gangwar, and Michelle Brinich for their feedback and suggestions on this blog!