Skip to content

Double Trouble: Eliminate Image Duplicates with FiftyOne

Find Exact and Approximate Duplicate Images with This Plugin

Welcome to week four of Ten Weeks of Plugins. During these ten weeks, we will be building a FiftyOne Plugin (or multiple!) each week and sharing the lessons learned!

If you’re new to them, FiftyOne Plugins provide a flexible mechanism for anyone to extend the functionality of their FiftyOne App. You may find the following resources helpful:

What we’ve built so far:

Ok, let’s dive into this week’s FiftyOne Plugin — Image Deduplication!

Image Deduplication 🖼️🪞🧹

The biggest challenge in training machine learning models is curating a high quality dataset. Duplicate (or very similar) data is a major roadblock to building such a dataset. Multiple copies of the same (or approximately the same) samples can lead to longer training times, higher training costs, and lower overall performance. On the flip side, you likely want a diverse dataset with good coverage over the data domain.

Duplicates come in two flavors:

  1. Exact duplicates: pixel-perfect matches, where one image is literally a down-to-the-bit copy of another
  2. Approximate duplicates: images (or other data) that are highly similar — typically evaluated by computing the closeness between samples with some similarity metric — and setting a threshold for similarity using this metric.

Deduplication is the task of removing these exact and approximate duplicates from a dataset. 

Typically, deduplication involves writing a lot of code to find, visualize, and remove all of the duplicates in your dataset. With this FiftyOne plugin, that all changes. Now you can deduplicate your entire dataset from within the FiftyOne App, without writing a single line of code!

Plugin Overview & Functionality

For the fourth week of 10 Weeks of Plugins, I built an Image Deduplication Plugin. This plugin allows you to:

  • Find both exact and approximate duplicate images in your dataset
  • Visualize these groups of duplicates
  • Delete all duplicates OR Keep a representative from each set of duplicates

The plugin has eight (!) operators (a powerful feature in FiftyOne that allow plugin developers to define custom operations that can be executed by users of the FiftyOne App), but don’t get overwhelmed — really it’s just two sets of analogous operators, for exact and approximate deduplication workflows.

After you install the plugin, when you open the operators list (pressing “`” in the FiftyOne App) you should see these operators. Search for “dedup” to narrow down the list!

🔍 Finding Duplicates

The first pair of operators helps you to find duplicate images in your dataset.

  • find_approximate_duplicate_images: uses a similarity index to find approximate duplicates.

You can specify either a distance threshold (how close the images need to be according to the similarity metric to be considered near duplicates) or a fraction of the dataset to mark as near duplicates. 

If you haven’t computed a similarity index on your dataset, you can do so by running:

import fiftyone.brain as fob
fob.compute_similarity(dataset, brain_key = "sim", metric="cosine")

For a large dataset, you may want to use a vector database. In this case, check out our native integrations with Pinecone, Qdrant, Milvus, and LanceDB!

When the operation finishes, it will have created two saved views: approx_dup_view, and approx_dup_groups_view. You can access these by clicking on the saved views selector in the FiftyOne App, or programmatically via Python:

approx_dup_view = dataset.load_saved_view("approx_dup_view")
approx_dup_groups_view = dataset.load_saved_view("approx_dup_groups_view")
  • find_exact_duplicates: uses file hashes to find exact duplicates

Essentially, the file hash computes a short signature for each sample based on the binary data stored in the image. The operator then checks if there are duplicate values of these signatures and marks these samples as duplicates. 

The operator adds a filehash field to each sample, and creates a saved view exact_dup_view, which contains just the images with duplicate filehashes.

🪟Viewing Duplicates

Once you have found exact and/or approximate duplicates in your dataset, you may want to view these duplicates. For approximate duplicates, for instance, you may want to verify that the distance threshold you set was rigorous enough.

The Image Deduplication plugin makes it easy to do this with the display_approximate_duplicate_groups and display_exact_duplicate_groups operators. The names are pretty self-explanatory, but the former loads the approx_dup_groups_view view we saved earlier, and the latter displays the samples in exact_dup_view, grouped by filehash.

🗑️Removing Duplicates

Once you have viewed your identified duplicates, it is time to clean your dataset. At this point, you have two options:

  1. Remove ALL duplicates: delete all samples marked as an exact or approximate duplicate
  2. Keep a representative: remove all but one duplicate from each set of exact or approximate duplicates

As always, there are sister operators for working with approximate and exact duplicates:

  • remove_all_approximate_duplicates: removes all near-duplicate images from a dataset
  • remove_all_exact_duplicates: removes all exact duplicate images from a dataset
  • deduplicate_approximate_duplicates: removes near-duplicate images from a dataset, keeping a representative image from each duplicate set
  • deduplicate_exact_duplicates: removes exact duplicate images from a dataset, keeping a representative image from each duplicate set

Here’s an example of each:

Installing the Plugin

If you haven’t already done so, install FiftyOne:

pip install fiftyone

Then you can download this plugin from the command line with:

fiftyone plugins download https://github.com/jacobmarks/image-dedup-plugin

Refresh the FiftyOne App, and you should see the eight operators in your operators list when you press the “`” key.

Lessons Learned

The Image Deduplication plugin is a Python Plugin with the usual structure (an __init__.py, fiftyone.yml, and REAMDE.md files). Additionally, it has the following:

  • An assets folder for storing icons
  • A Python file exact_dups.py for handling the logic and computations involved for exact duplicates
  • A Python file approx_dups.py for handling the logic and computations involved for approximate duplicates

Splitting Code into Submodules

It’s typically good practice in software development to make code modular, splitting self-contained pieces of logic into separate functions or files. This is known as separation of concerns

The Image Deduplication plugin was a good exercise in applying this principle to FiftyOne’s plugin system. To utilize functions or variables you define in another file in the FiftyOne plugin’s directory, you need to add the path to that file to your system path.

Here’s an example where we import find_exact_duplicates from the exact_dups file:

from fiftyone.core.utils import add_sys_path
with add_sys_path(os.path.dirname(os.path.abspath(__file__))):
            # pylint: disable=no-name-in-module,import-error
            from exact_dups import find_exact_duplicates

Starting from the innermost part of this expression:

  •  __file__ is a variable containing the path to the current module — in this case the __init__.py file.
  • os.path.abspath gets the absolute path for this file
  • os.path.dirname extracts the directory name of this absolute path
  • add_sys_path is a FiftyOne utility function that adds this to our system path

The second to last line, # pylint: disable=no-name-in-module,import-error tells our linter not to throw an error when linting the file.

Loading a View

When executed, the display_approximate_duplicate_groups and display_exact_duplicate_groups operators each trigger the loading of specific views. Doing this is pretty straightforward, but it is worth noting that the data passed into params in the ctx.trigger() call needs to be serialized. In fact, all data passed into parameter dictionaries for FiftyOne operators needs to be serialized.

Fortunately, FiftyOne DatasetView objects are easy to serialize!

import json
from bson import json_util

def serialize_view(view):
    return json.loads(json_util.dumps(view._serialize()))

Icons for Each Operator

The last tip is a simple but fun one: By utilizing the icon argument in the operator config, you can specify a unique icon for each operator. This is the icon that will then show up in the operators list with you hit “`”.

For example, here’s the start of the operator definition for FindExactDuplicates:

class FindExactDuplicates(foo.Operator):
    @property
    def config(self):
        return foo.OperatorConfig(
            name="find_exact_duplicate_images",
            label="Dedup: Find exact duplicates",
            description="Find exact duplicates in the dataset",
	     icon="/assets/exact_duplicates.svg",
            dynamic=True,
        )

I like to put all of the SVGs I use as icons in an assets folder to stay organized 📁.

Conclusion

Building a high quality dataset doesn’t have to be a hassle. With our Image Quality Issues Plugin from week 0, you can find a variety of common issues potentially plaguing images in your dataset, from peculiar aspect ratios to oversaturation. Now with the Image Deduplication Plugin (this post) you can also find and eliminate duplicates from your dataset in mere minutes!

Stay tuned over the remaining weeks in the Ten Weeks of FiftyOne Plugins while we continue to pump out a killer lineup of plugins! You can track our journey in our ten-weeks-of-plugins repo — and I encourage you to fork the repo and join me on this journey!