Skip to content

On Leaky Datasets and a Clever Horse

There was once a horse that could do arithmetic. Well, at least, it seemed like the horse could do arithmetic. The story of Clever Hans is not only an interesting tale and an important experiment in animal cognition, but also… a cautionary tale for machine learning practitioners!

If you are not familiar with the story, here is a recap. Wilhelm von Osten was a German mathematics teacher and hobbyist horse trainer. He decided to combine his passions by attempting to teach arithmetic to a horse named Hans. To Wilhelm’s delight, Hans was able to learn. Wilhelm would ask Hans a question with a numerical answer. For example, “If the eighth day of the month comes on a Tuesday, what is the date of the following Friday?” Then, Hans would tap away with his hoove eleven times on the floor. This, of course, shocked the world, and a panel of experts was assembled to study the horse.

Unfortunately for Hans and Wilhelm, it was concluded that Hans was not, in fact, a mathematical prodigy but was clever in an unexpected way. The biologist and psychologist Oskar Pfungst conducted an array of experiments to test Hans’s intellect. The results showed that Hans was able to correctly answer questions when two conditions were met: the questioner knew the answer, and Hans could see the questioner. This led to the conclusion that Hans had not learned mathematics but, rather, involuntary body language cues that the questioner would give off when Hans would tap the correct number of times.

What does this have to do with machine learning? Hans, a clever beast (perhaps even educated in the principles of Occam’s razor and Minimum Description Length), learned the simplest explanation for the data he was given: the high correlation between the questioner’s body language and the correct response to the query. Like Hans, machine learning algorithms find simple ways to describe the data they are presented with. This is typically a good thing and is what allows for generalization. However, sometimes the simple answer, like in Hans’s case, isn’t what we are interested in.

To verify our models’ generalization, we test on new queries not yet seen during training. However, what if, like in the case of Hans and Wilhelm, we are providing information during testing that gives away the answer due to some unforeseen connection with the train data? Then, we come to incorrect conclusions regarding the validity of our model. It was only when Oskar entirely removed all the superfluous information in the questions asked to test Hans (by hiding the questioner and ensuring they didn’t know the answer) that a valid test was conducted. Like Oskar, we must make sure that such superfluous information in our train set does not appear in our test set. We refer to such cases as leaks in the dataset.

That’s it, strange analogy over. Let’s talk about data leakage!

Leaks in Dataset Splits

The problem of data leakage is very basic, referring to the issue of validating a machine learning algorithm on data that is very similar, or maybe even an exact duplicate, of samples seen during train time. A more general and formal way of expressing this problem is that the samples in the different splits are not IID. This distinction is important to understand as it allows for the reframing of the leakage problem from one of exact duplicates to a more nuanced issue (and allows for a convoluted analogy with a horse). For example, it is common practice in tracking literature to hold out entire videos for the test set, rather than some segment for each video. This is done because despite different frames in the same video not being exactly the same (and thus not nicely fitting into our first definition), a machine learning algorithm may learn features that can easily decide the answer for a given video, but may not generalize to other videos where these features would not be present.

With this notion in mind, let’s look at two cases where data leakage can give a false sense of confidence in a model. We will visualize and analyze these datasets with the open-source library FiftyOne, using the leaky-splits module to easily find and address leaks in the dataset. If you’re not yet familiar, hundreds of thousands of AI builders use FiftyOne to build visual AI applications by refining data and models in one place.

A Case of Leaks in the Wild

The first case of leaky dataset splits that we will explore is ImageNet. Specifically, a subset of ImageNet called Stanford Dogs Dataset. This dataset contains 20,580 images of dogs from 120 different breeds. Labels and splits are provided in the original dataset. As a starting point, a model is trained on the dataset with the given splits. Its performance on the test set provided by the authors of the dataset is a modest top-1 accuracy of 76.8%; nothing to write home about, but certainly far better than the measly 0.83% accuracy of completely random guessing. This means that the model is reasonably well trained.

But how well trained is it really? A check for leaks in the dataset using FiftyOne’s new leaky-splits module can be easily done with just a few lines of code. Before running, make sure to run pip install --upgrade fiftyone, this will make sure you have access to this module.

import fiftyone as fo
from fiftyone.brain import compute_leaky_splits

leaks_index = compute_leaky_splits(dogs_dataset, splits='split', model="clip-vit-base32-torch")
leaks = leaks_index.leaks_view()

fo.launch_app(leaks)

If the code doesn’t give the intended results, try adjusting the threshold parameter for the function. This value changes the sensitivity of the algorithm. I’ve found that values in the range of 0.1 – 0.25 tend to give good results.

This launches the FiftyOne app and immediately shows us a list of proposed leaks in the dataset:

That’s a lot of leaks. In this run, 975 leaks were detected in the test set. That is, there were 975 samples out of 8580 (that’s over 11%) that were very similar to a sample from the train set. A look at these cases shows that these leaks are not simple cases of carelessly sending exact duplicates of an image to both the train and test splits, but rather a more difficult and nefarious case of leakage, where multiple images of the same subject are taken in quick succession. Here are some examples:

Fig 1. Dataset leaks detected by FiftyOne. Labels on images correspond to the split the image belonged to in the original ImageNet splits.

This means that around ~10% of the test dataset gives us a false, overly optimistic impression of model performance! This claim can be corroborated by checking the performance of the model on the test set once these leaks are removed. Rerunning the algorithm at a few thresholds (allowing for more or less sensitivity for leaks) shows the progressive decline in the performance of the model as data leaks are removed.


Fig 2. The plot shows the number of leaks detected with respect to the dataset’s size.

Fig 3. Performance of the model as more leaks are removed. Performance on the clean test set is 1.2% lower than on the full test set (threshold 0.25).

It is important to emphasize that this dataset has not been altered in any way. This is a subset of ImageNet with the official splits. Even well-established, thoroughly used datasets can have errors. With competition on leaderboards over which model is 0.1% more accurate, it is important to pause and ask if our testing methods are sound.

How Bad Can It Get?

After examining a real dataset, let’s look at how damaging the issue of data leakage can be in an extreme case. For this section, we will modify the test set of the dataset Food-101 to have many duplicates from the training dataset. Specifically, Food-101 has 25,250 images in its testing dataset. This set was progressively polluted with up to 20,000 randomly selected images from the train set. Then, a model was trained on the original dataset, and its performance was measured on increasingly leaky test sets. The results can be seen below:

Fig 4.  Performance of model on Food-101 dataset with progressively more leaks. The y-axis shows accuracy while the x-axis shows the percentage of the test set that is added duplicates from the train set. Train refers to performance on the train set. Full group is the original dataset plus some amount of leaks. Detected leaks is a subset of the test set that was flagged as a leak by the FiftyOne algorithm. Cleaned group is the test set after the removal of the leaks detected. Finally, original test is the original test set from Food-101.

The graph clearly shows performance increasing as duplicates are added into the test set. When ~44% of the test set contains leaks, there is an almost 7-point difference in performance between the original and leaky test set. When these leaks are removed using the method in the FiftyOne library, performance goes down by over 10 points. This indicates that in the original test set, there were already leaks. Further evidence of this can be seen at the 0% point. At this point, the test set is the original unmodified dataset. At 0%, the detected leaks performance is almost identical to the train set performance, and the cleaned group performance is the same as it is for higher percentages of leaks.

While this example is contrived and exaggerated to prove the point, it can still offer valuable insight. Leaks can give a very optimistic view of performance. Even when the fake duplicates are not considered, there is about a 3-point difference between the measured performance and the “actual” performance of the model.

Closing Remarks

In this blog post, two datasets were analyzed for leaks using FiftyOne. One is a subset of ImageNet, one of the most widely used benchmarks for image classification in the last decade, and the other is Food-101, a noisy and difficult dataset, which was further modified by adding duplicates from the train set to the test set. For both of these datasets, accuracy on the leaks was significantly higher than average test set performance. In both of the datasets, removing the leaks gave a more realistic (and lower) measure of model performance.

The important takeaway that I want to emphasize now is that while we all push to make our models better through newer architectures, bigger datasets, better priors for our losses, and highly optimized training regimes, we must remember that our insights are only as good as our measuring tools. A lack of confidence in our evaluations makes any decision made on them a house on shaky foundations. Wilhelm, the horse trainer from the story, really did believe that Hans learned math. He just didn’t realize he was using faulty evaluation methods. Remaining vigilant to these issues can help us draw sound conclusions from our data. I’d like to conclude by reminding the reader of one of the first things everyone learns in their first lab: every measurement has error, and error lowers confidence in results.

That is to say, you may have leaks in your own datasets. Fortunately, we have a solution. Check out the new compute_leaky_splits() method in FiftyOne to uncover and correct leaky splits in your data. To get started, try installing FiftyOne and playing around with it yourself. To stay up-to-date with the latest FiftyOne releases and news, come join our Discord community!