Find Exact and Approximate Duplicate Images with This Plugin
Welcome to week four of Ten Weeks of Plugins. During these ten weeks, we will be building a FiftyOne Plugin (or multiple!) each week and sharing the lessons learned!
If you’re new to them, FiftyOne Plugins provide a flexible mechanism for anyone to extend the functionality of their FiftyOne App. You may find the following resources helpful:
What we’ve built so far:
- Week 0: Image Quality Issues & Concept Interpolation
- Week 1: AI Art Gallery & Twilio Automation
- Week 2: Visual Question Answering
- Week 3: YouTube Player Panel
Ok, let’s dive into this week’s FiftyOne Plugin — Image Deduplication!
Image Deduplication 🖼️🪞🧹
The biggest challenge in training machine learning models is curating a high quality dataset. Duplicate (or very similar) data is a major roadblock to building such a dataset. Multiple copies of the same (or approximately the same) samples can lead to longer training times, higher training costs, and lower overall performance. On the flip side, you likely want a diverse dataset with good coverage over the data domain.
Duplicates come in two flavors:
- Exact duplicates: pixel-perfect matches, where one image is literally a down-to-the-bit copy of another
- Approximate duplicates: images (or other data) that are highly similar — typically evaluated by computing the closeness between samples with some similarity metric — and setting a threshold for similarity using this metric.
Deduplication is the task of removing these exact and approximate duplicates from a dataset.
Typically, deduplication involves writing a lot of code to find, visualize, and remove all of the duplicates in your dataset. With this FiftyOne plugin, that all changes. Now you can deduplicate your entire dataset from within the FiftyOne App, without writing a single line of code!
Plugin Overview & Functionality
For the fourth week of 10 Weeks of Plugins, I built an Image Deduplication Plugin. This plugin allows you to:
- Find both exact and approximate duplicate images in your dataset
- Visualize these groups of duplicates
- Delete all duplicates OR Keep a representative from each set of duplicates
The plugin has eight (!) operators (a powerful feature in FiftyOne that allow plugin developers to define custom operations that can be executed by users of the FiftyOne App), but don’t get overwhelmed — really it’s just two sets of analogous operators, for exact and approximate deduplication workflows.
After you install the plugin, when you open the operators list (pressing “
`” in the FiftyOne App) you should see these operators. Search for “dedup” to narrow down the list!
🔍 Finding Duplicates
The first pair of operators helps you to find duplicate images in your dataset.
find_approximate_duplicate_images: uses a similarity index to find approximate duplicates.
You can specify either a distance threshold (how close the images need to be according to the similarity metric to be considered near duplicates) or a fraction of the dataset to mark as near duplicates.
If you haven’t computed a similarity index on your dataset, you can do so by running:
import fiftyone.brain as fob fob.compute_similarity(dataset, brain_key = "sim", metric="cosine")
When the operation finishes, it will have created two saved views:
approx_dup_groups_view. You can access these by clicking on the saved views selector in the FiftyOne App, or programmatically via Python:
approx_dup_view = dataset.load_saved_view("approx_dup_view") approx_dup_groups_view = dataset.load_saved_view("approx_dup_groups_view")
find_exact_duplicates: uses file hashes to find exact duplicates
Essentially, the file hash computes a short signature for each sample based on the binary data stored in the image. The operator then checks if there are duplicate values of these signatures and marks these samples as duplicates.
The operator adds a
filehash field to each sample, and creates a saved view
exact_dup_view, which contains just the images with duplicate filehashes.
Once you have found exact and/or approximate duplicates in your dataset, you may want to view these duplicates. For approximate duplicates, for instance, you may want to verify that the distance threshold you set was rigorous enough.
The Image Deduplication plugin makes it easy to do this with the
display_exact_duplicate_groups operators. The names are pretty self-explanatory, but the former loads the
approx_dup_groups_view view we saved earlier, and the latter displays the samples in
exact_dup_view, grouped by
Once you have viewed your identified duplicates, it is time to clean your dataset. At this point, you have two options:
- Remove ALL duplicates: delete all samples marked as an exact or approximate duplicate
- Keep a representative: remove all but one duplicate from each set of exact or approximate duplicates
As always, there are sister operators for working with approximate and exact duplicates:
remove_all_approximate_duplicates: removes all near-duplicate images from a dataset
remove_all_exact_duplicates: removes all exact duplicate images from a dataset
deduplicate_approximate_duplicates: removes near-duplicate images from a dataset, keeping a representative image from each duplicate set
deduplicate_exact_duplicates: removes exact duplicate images from a dataset, keeping a representative image from each duplicate set
Here’s an example of each:
Installing the Plugin
If you haven’t already done so, install FiftyOne:
pip install fiftyone
Then you can download this plugin from the command line with:
fiftyone plugins download https://github.com/jacobmarks/image-dedup-plugin
Refresh the FiftyOne App, and you should see the eight operators in your operators list when you press the “
The Image Deduplication plugin is a Python Plugin with the usual structure (an
REAMDE.md files). Additionally, it has the following:
- An assets folder for storing icons
- A Python file
exact_dups.pyfor handling the logic and computations involved for exact duplicates
- A Python file
approx_dups.pyfor handling the logic and computations involved for approximate duplicates
Splitting Code into Submodules
It’s typically good practice in software development to make code modular, splitting self-contained pieces of logic into separate functions or files. This is known as separation of concerns.
The Image Deduplication plugin was a good exercise in applying this principle to FiftyOne’s plugin system. To utilize functions or variables you define in another file in the FiftyOne plugin’s directory, you need to add the path to that file to your system path.
Here’s an example where we import
find_exact_duplicates from the
from fiftyone.core.utils import add_sys_path with add_sys_path(os.path.dirname(os.path.abspath(__file__))): # pylint: disable=no-name-in-module,import-error from exact_dups import find_exact_duplicates
Starting from the innermost part of this expression:
__file__is a variable containing the path to the current module — in this case the
os.path.abspathgets the absolute path for this file
os.path.dirnameextracts the directory name of this absolute path
add_sys_pathis a FiftyOne utility function that adds this to our system path
The second to last line,
# pylint: disable=no-name-in-module,import-error tells our linter not to throw an error when linting the file.
Loading a View
When executed, the
display_exact_duplicate_groups operators each trigger the loading of specific views. Doing this is pretty straightforward, but it is worth noting that the data passed into
params in the
ctx.trigger() call needs to be serialized. In fact, all data passed into parameter dictionaries for FiftyOne operators needs to be serialized.
DatasetView objects are easy to serialize!
import json from bson import json_util def serialize_view(view): return json.loads(json_util.dumps(view._serialize()))
Icons for Each Operator
The last tip is a simple but fun one: By utilizing the
icon argument in the operator config, you can specify a unique icon for each operator. This is the icon that will then show up in the operators list with you hit “
For example, here’s the start of the operator definition for
class FindExactDuplicates(foo.Operator): @property def config(self): return foo.OperatorConfig( name="find_exact_duplicate_images", label="Dedup: Find exact duplicates", description="Find exact duplicates in the dataset", icon="/assets/exact_duplicates.svg", dynamic=True, )
I like to put all of the SVGs I use as icons in an
assets folder to stay organized 📁.
Building a high quality dataset doesn’t have to be a hassle. With our Image Quality Issues Plugin from week 0, you can find a variety of common issues potentially plaguing images in your dataset, from peculiar aspect ratios to oversaturation. Now with the Image Deduplication Plugin (this post) you can also find and eliminate duplicates from your dataset in mere minutes!
Stay tuned over the remaining weeks in the Ten Weeks of FiftyOne Plugins while we continue to pump out a killer lineup of plugins! You can track our journey in our ten-weeks-of-plugins repo — and I encourage you to fork the repo and join me on this journey!