An Efficient Technique to Quickly Understand Visual AI Model Performance on Unlabeled Data at Scale

Predicting how a deployed model will perform on new, unknown data can be a harrowing experience. Even when taking precautions like developing a comprehensive training set, the world changes fast and so does its data. What’s more, the world’s expanding data is unlabeled, so AI developers either need to manually review model performance on new data or pay the cost of labeling it for a quantitative evaluation.

Clever AI developers use embeddings-based workflows to manage data at scale. Data embeddings are generated using learned model features at the penultimate architectural layer of foundation or application-specific models. Embeddings simplify input data’s representation so we can quickly visualize datasets, select valuable and efficient core sets, and discover out-of-distribution examples. While embeddings-based workflows are incredibly useful, embeddings do not necessarily reveal how a model will perform on new data. Deployed models can still fail within the trainset embedding distribution and, on the other hand, are often accurate on out-of-distribution examples.

At the end of the day, the question we really care about is:

Will our deployed model fail or not?

And, if so, on which data?

To answer that question, we’ve developed a new open source technique to visualize model certainty on unlabeled data. Similar to embeddings, we use models to generate a lower-dimensional data representation that is efficient to operate on and easily visualizable. However, instead of using feature-based embeddings that are only meaningful to the model itself, we represent data using the actual full model outputs. In this way, we go beyond typical embeddings to quickly understand the actual model output reliability, where the rubber meets the road for deployed models and subsequent downstream tasks. As we will show, our model certainty-based approach enables us to understand how models will perform on new, unlabeled data.

Model Certainty Visualization across 190 Models with Increasing Train Images.

Model Certainty Visualization

To visualize model certainty on unlabeled data, we first process all new data through our deployment model. Here, we demonstrate our approach using image classification, but this technique is also applicable to other Visual AI problems like object detection and segmentation. For our “deployment model,” we use 190 ResNet18 models each trained on a unique variation of the CIFAR100 training set. To determine the training sets, we use ZCore-selected core sets ranging between 50 to 50,000 training examples, which enables us to visualize and understand the evolution of model certainty with increasing training data and performance. For our “new data,” we use the CIFAR100 test set, which enables us to quantify the actual model accuracy at the end of our experiment.

After processing data through our model, we visualize all of the model’s output logits. Our model generates 100 logits per image (one for each class), resulting in a 10,000 test image by 100 logit dimensional space. To visualize this space, we use UMAP to generate a lower-dimensional but locally-connected manifold surface that is fully observable in 2D. In this way, we directly see how all new data is structured in terms of our deployment model’s outputs (see video above).

We use model certainty to add “color” to our model output-based visualization. To do this, we first take the ratio of the highest and second-highest logit values for each image. We call this the first to second ratio (f2s). If f2s is high, the model is certain of a single output classification. If f2s is low, the model is uncertain between multiple possible output classifications. Finally, f2s values vary widely with different models and data, so we use its inverse 1/f2s ∈ [0,1] to simplify visualization and analysis.

Notably, our model certainty visualization process does not require labels for data.

Try It!

We provide detailed experiment results in the next section. If you would like to replicate our results on your own data or try our demo, simply download our demonstration model and run this single command from our repo:

Experiment Results

We compare model certainty, number of training examples used, and test accuracy for all 190 experiment models in the figure below. We calculate model certainty on a per-model basis as the mean 1/f2s score across all test images and calculate accuracy as the percent of test images that are classified correctly by the highest logit of each model’s output. Overall, model certainty and test accuracy have a -0.979 Pearson correlation coefficient, validating that model certainty can predict model accuracy.

For model training, both model accuracy and certainty increase with training data. ZCore-selected train sets achieve 60% accuracy with 10,000 examples and almost full-training performance with about half the data. Model certainty convergence is also visible in our earlier video from about 20,000 training examples onward.

We can also compare model certainty across individual test images for a single model. For this analysis, we use the model trained on 20,000 images (20K-Model), which is partially trained but has room for improvement. First, we provide the “big picture” model certainty visualization across all test images in the figure below. Notably, high certainty outputs (left, purple) fan out into increasingly isolated regions that correspond to individual classes (right). On the other hand, low certainty outputs (left, yellow) are mixed together in large clusters with several classes mixed together (right).

Digging deeper, we use FiftyOne to understand what these model certainty clusters mean in terms of actual test data. Let’s start by comparing apples to oranges:

In this cluster, we find a mix of fruit images along with a few outlier images belonging to other classes (e.g., “baby” in row 1). On the other hand, focusing on individual spikes where our model has high certainty, we find that all images belong to the same class:

This insight enables us to quickly separate high-certainty examples from mixed-certainty, mixed-class examples our model will likely benefit from in training.

Let’s look at another cluster of primarily bicycles, lawn mowers, and motorcycles:

And now focus on the interior, low-certainty area of this cluster:

In this interior space, we find a higher concentration of 1) difficult images and 2) classes that clearly do not belong in this region of the model output space, including a tractor, clock, butterfly (row 1), and lobster (row 4). As AI developers, we can quickly identify these high-value examples and update our training set to improve model performance.

Summary

Predicting how a deployed model will perform on new, unlabeled data is a difficult task, often requiring AI developers to manually review model performance or label new data for quantitative evaluation. As an extension of ZCore, our research team at Voxel51 developed a model certainty visualization tool that operates across entire datasets without labels and enables AI developers to quickly understand model performance. In experiments spanning 190 uniquely-trained models, we find that our quantitative measure for model certainty tracks with ground truth accuracy, validating that our model certainty tool can predict model performance. Finally, we show how our tool can quickly identify particularly difficult, high-value data that can be added to training to improve model performance.

This tool is open source and available on GitHub. Give it a try and let us know what you think: https://github.com/voxel51/zcore.

Talk to a computer vision expert