
State-of-the-art deep learning models are often trained on datasets with millions of images. Collecting your own datasets of that size is a difficult enough task in and of itself. You will likely need to automate the process of raw data curation.
Once you have collected your raw data, you need to ensure that the samples are of high quality to properly train and test your model. One requirement of a high-quality dataset is a lack of duplicate samples. Duplicates in your training set can cause your model to learn biases towards the duplicated samples which results in a model that will have a harder time generalizing to new data. Duplicates in your test set will result in incorrect performance evaluation of your model resulting in worse performance on unseen data.
Finding duplicate images manually in a dataset with millions of images is an expensive effort. It can be easy enough to write an algorithm that finds duplicates of the exact same image, but what if two images are only slightly different, varying only by a slight lighting change or pixel values? These samples can still hurt performance but are much more difficult to find.
This post will show you how to find and remove duplicate and near-duplicate images in your dataset automatically while also visualizing the images to ensure the right images are removed.
We will be looking at the CIFAR-100 dataset, which we find to contain over 4500 duplicate images, at times duplicated between the test and train splits
I will be using FiftyOne, an open-source ML developer tool I have been working on, in this post. It will let you easily manipulate and visualize image datasets as well as generate embedding on your data using its model zoo. The installation is as easy as:
pip install fiftyone

Overview
- Load dataset
- Compute embeddings
- Calculate similarity
- Visualize and remove duplicates
It should be noted that there are multiple ways that this can be done. The example here is useful for finding near-duplicates pairwise for every image in the dataset.
FiftyOne also provides a uniqueness
function that computes a scalar property over the dataset determining the uniqueness of a sample in relation to the rest of the data. It can also be used to manually find near-duplicates, with low uniqueness
indicating likely duplicate or near-duplicate images. You can see an example of it at the end of the post.
Alternatively, if you are only interested in exact duplicates, you can compute a hash over your files to quickly find matches. However, if images vary by only small pixel values, this method will fail to find the duplicates.
1) Load Dataset
The first step will be to load your dataset into FiftyOne.
In this example, I will be using the image recognition dataset, CIFAR-100. This dataset is fairly old (2009) but still relevant enough that papers submitted to ICLR 2021 are using it as a baseline.
CIFAR-100 contains 60,000 images between the train and test split annotated with 100 label classes grouped into 20 "super classes". This dataset also exists in the FiftyOne Dataset Zoo and can be easily loaded.
If you want to follow along with a lightweight version of this guide, you load a subset of the dataset containing 1000 samples (or however many you specify).
The loaded dataset can then be visualized in the FiftyOne App.

1a) Loading your own dataset
FiftyOne supports a host of label types. If you are using a classification dataset to look for duplicates, you can import a ImageClassificationDirectoryTree
dataset.
2) Generate Embeddings
Images store a lot of information in their pixel values. Comparing images pixel-by-pixel would be an expensive operation and result in poor quality results.
Instead, we can use a pretrained Computer Vision model to generate embeddings for each image. An embedding is a result of processing an image through a deep model into the form of a vector containing a few thousand values distilling the information stored in the millions of pixels.
The FiftyOne Model Zoo contains a host of different pretrained models that we can use for this task. In this example, we will use MobileNet trained to classify images on the ImageNet dataset. This model provides relatively high performance, but most importantly is lightweight and can process our dataset quicker than other models. Any off-the-shelf model will be informative, but one can easily experiment with other models that may be more useful for particular datasets.
We can easily load the model and compute embeddings on our dataset.
3) Calculate Similarity
Now that we have significantly reduced the dimensionality of our images, we can use classical similarity algorithms to compute how similar every Image embedding is to every other image embedding.
In this case, we will use cosine similarity provided by Scikit Learn since this algorithm is simple and works fairly well in high dimensional spaces.
The N x N similarity matrix provides a value between 0 (low similarity) and 1 (identical) for each pair of your N samples.
[1, 0.532, 0.624, 0.461, ...]
[0.422, 1, 0.125, 0.031, ...]
...
[..., 0.236, 0.942, 0.831, 1]
As you can see, all diagonal values are 1
since every image is identical to itself. We can subtract by the identity matrix (N x N matrix with 1’s on the diagonal and 0’s elsewhere) in order to zero out the diagonal so those values don’t show up when we look for samples with maximum similarity.
Note: Computing cosine similarity on datasets with more than 100,000 images can time and memory intensive. It is recommended to split the embeddings into batches and parallelize the process to speed up this computation.
4) Visualize and Remove Duplicates
We can now iterate through every sample and find which other samples are the most similar to it.
Visualizing the results and sorting by the samples with the highest similarity shows us the duplicates in the dataset.

Right off the bat, we can see a lot of duplicates and something even more problematic. Two of the images are duplicates but one is in the train split and one is in the test split…. and they are labeled differently as seal vs otter!!! There are a couple of glaring things wrong with this:
- It can’t be both a seal and an otter so one of the labels is wrong. Additionally, providing different labels for the train and test versions of the image will undoubtedly cause the model to fail.
- Test sets that contain duplicates of the training set will lead to false confidence in the generalizability of your model. If your test set is not truly independent of your training set, the apparent performance of your model will likely drop-off when applied to production data.
Removing Duplicates
Labeling and split mishaps aside, we still want to automatically remove the duplicate images in the dataset.
By looking through the results, we can find a threshold that we can use as a cutoff for when two images are determined to be duplicated. This threshold will be different for every dataset/model used in this process so the visualization step is crucial. The range slider in the FiftyOne App provides an easy way to reduce the max_similarity
score until we stop seeing duplicates.

It seems as though around 0.94 max_similarity
we are getting images that are duplicated but augmented by rotation, flipping, and color changes. Data augmentation is a valuable tool to increasing the diversity of your training dataset and the generalizability of your model in production; however, if the "atomic" training set has duplicates in it, then your model’s training time may still be biased towards the offending classes, so we also want to remove the near-duplicates.

Further inspection puts a good threshold for guaranteed duplicates around 0.92. Lower values likely also include duplicates but should be verified manually so that we do not remove useful data. We can filter the dataset through code as well to see just how many samples have a max_similarity
of > 0.92.
4,345 out of 60,000 are marked as duplicates!
We can now iterate through our dataset again and find all duplicates of > 0.92 similarity for each sample, either tagging them as "duplicate" or removing them.
We can find the number of duplicates that exist in both the train and test split as well as ones that are labeled differently.
The result is that 1621 of the 4345 duplicates are both in the train and test split and 427 are labeled differently!
(Optional) Find unique images
On a related note, FiftyOne also provides a more advanced method to compute the uniqueness of every image in a dataset. This will result in a score for every image indicating how unique the contents of the image are with respect to all other images. "Uniqueness" has the opposite polarity as "similarity". Images with a low uniqueness value are potential duplicates that you should explore.
Uniqueness can be helpful when deciding which samples to send to annotators. If you are only going to spend money to get the best subset of your data annotated, then you will want the most unique samples to train/test your model on.
