Build an Image Duplicate Finder System: A Guide

Your zero-to-hero guide to detecting duplicate and near-duplicate images

Orjuwan Zaafarani

Published in

Towards Data Science

10 min readApr 30, 2022

Do you want to identify duplicate or near-duplicate images? Or calculate the number of copies per image in a dataset?

If yes, then this article is for you.

The objective of this article is five-fold:

Comprehend the difference between Image Duplicate Finder and Content-Based Image Retrieval systems
Walkthrough the concepts of 5 different methods for comparing similar images
Learn each method implementation in python
Determine the sensitivity of image transformation on the overall performance of said algorithms
Pave the way for the selection process of the best method that suits your application needs in terms of speed and accuracy (experiments included)

The Basic Architecture

To start with, I need to define an important term. A query image is an image the user enters to obtain information. With the help of a similarity block, the system searches for similar images among a dataset, which computes how close the images are to each other. Image 1 illustrates the steps.

Image 1 — The Basic Structure of an Image Duplicate Finder System (image by author)

In section 3, we will be looking into this similarity block and exploring the most common methods of achieving this functionality.

Image Duplicate Finder vs. Content-Based Image Retrieval Systems

The main difference between both systems is that an image duplicate/near-duplicate finder detects only identical and near-identical images (Image 2). On the other hand, a content-based image retrieval (CBIR) system searches for similar regions of interest and displays the most similar images with those regions (Image 3).

Image 2 — An Example of the Ins and Outs of the Image Duplicate Finder System (image by author)

Image 3 — An Example of the Ins and Outs of the Content-Based Image Retrieval System (image by author)

Notice how the content-based image retrieval system recognizes the apple and outputs images of different scenes.

Five Common Methods for Comparing Similar Images

The article will visit five main methods:

Euclidean Distance
Structural Similarity Index Measure (SSIM)
Image Hashing
Cosine Similarity
The Similarity of Features (using CNN)

1. Euclidean Distance

Moving on to the first method, as shown in Image 4, the Euclidean distance is the straight line distance between 2 data points in a plane [8]. It is also known as the L2 norm distance metric.

Image 4 — Euclidean Distance (image by author)

We can represent images as vectors. A vector is a quantity that has an initial point and a terminal point [4]. Those two points form the two features of a vector: magnitude and orientation.

In a vector space, let’s assume we have two pictures to compare x =[x1, x2, x3] and y = [y1, y2, y3]. While Image 5 shows the general formula [8], Image 6 shows an example use.

Image 5 — The General Formula of Euclidean Distance (image by author)

Image 6 — An Example of applying the Euclidean Distance Formula (image by author)

The method formula is straightforward. Since it is similar to the Pythagorean theorem formula, it is also known as the Pythagorean distance [8].

In python, the implementation is pretty simple:

Implementation 1: using the Scipy library

Implementation 2: using NumPy's linalg.norm (reference)

2. Structural Similarity Index Measure (SSIM)

The paper Image Quality Assessment: From Error Visibility to Structural Similarity [1] introduced SSIM in 2004. It calculates the similarity between two given images, which results in a value between 0 and 1.

Other than finding duplicates, one of its many applications is to measure how compressing an image can affect its quality [2]. Also, it estimates how data transmission losses strongly degrade the quality as well [2].

As per the authors, the main three factors that affect the index are luminance, contrast, and structure [3]. Hence, if one of these factors changes, so will the index.

As for the implementation, it is as the following:

3. Image Hashing

Another way to compute the similarity between two images is Image hashing (also known as digital fingerprinting). It is the process of assigning unique hash values to every image. However, the method results in the same value for identicals. Average hashing is one of the many types of hashing. It works in the following way [6]. Also, refer to Image 7 for clarification.

Reduce the size of the image (for instance: 8x8)
Convert it to grayscale
Take the mean of it
Compare each pixel with the average. Assign the pixel a value of 1 if it is above the average and 0 otherwise
Construct the hash: set the 64 bits into a 64-bit integer

Image 7 — Average Hashing Steps (image by author)

The resulting 64-bit integer could look something like this:

1011111101100001110001110000111101101111100001110000001100001001

Note could. We can represent the image in different ways. Starting to list the 0 and 1 bits from the top-left corner (as in the above example), right-left corner, et cetera [6].

Best of all, if we change the aspect ratio, increase or decrease the brightness or contrast, or even alter the color of an image, its hash value will be the same [7], making it one of the best ways to compare identicals.

Comparing two images is done as the following [7]:

Construct the hash of each image (by following the above 5-steps)
Calculate the Hamming distance. (by counting the number of bit positions that are different from each hash) Zero distance indicates identical images. (The following block of code explains it better)

4. Cosine Similarity

Cosine similarity is a method of calculating the similarity of two vectors (which can be images) by taking the dot product and dividing it by the magnitudes of each vector [9], as shown below in Image 8.

Image 8 — Cosine Similarity Equation (image by author)

As the angle between two vectors gets small, the similarity gets stronger [9]. As depicted in Image 9, vectors C and B share high similarities in contrast to A and B since their angle is significantly smaller.

Image 9 — Cosine Similarity Illustration (image by author)

The following is the code for calculating the metric for two PIL images using torch.

5. The Similarity of Features (using CNN)

One last way to compare images is by computing the similarity of features. As known, a convolutional neural network CNN can pick the patterns of an image and make sense of it. Convolutional layers have filters that detect the patterns. Different patterns in an image can be edges, shapes, or textures. These patterns are called features.

We can extract those features from the convolutional layers of a CNN. Image 10 clearly illustrates a sample architecture. Usually, we designate the last convolutional layer of a network for feature extraction.

Image 10 — A Simple CNN Architecture (image by author)

One great state-of-art CNN architecture is EfficientNet. It is a scaling method that uniformly scales all dimensions: depth/width/resolution using a compound coefficient. I will not dig deep into it as it is out of the scope of this article. However, I will utilize it in the following sections.

Usually, the data science community widely uses the Similarity of Features in Content-Based Image Retrieval (CBIR) systems. The experiments section will explain why.

5.1. EfficientNet-b0 and Euclidean Distance

After extracting the features from EfficientNet, I apply Euclidean distance to measure the similarity between the features of the query and dataset images to find the most similar ones.

5.2. EfficientNet-b0 and Cosine Similarity

Computing the cosine similarity of features is pretty much like the previous one. Though, cosine similarity is applied instead of euclidean distance.

Before ending this section, what if the resulted similarity was 250, 0.04, or 10809? What is the number that makes a pair of images similar? The answer is as follows: you have to define this threshold based on research or special testing on your chosen dataset.

Datasets

Throughout the experiments, I used two different datasets for evaluation:

A part of the Fruits360 dataset (96 multi-size images)(license: CC BY-SA 4.0)
A dataset that I gathered and called SFBench consists of 40 pictures (3024x4032 px)(license: CC0: Public Domain)

I designated the first dataset to evaluate the duplicate/near-duplicate images finder since it consists of images of different fruits taken in 360 degrees for each category, as depicted in Image 3. Those frames are slightly different; Image 4 shows how the brown scratch shifted clockwise.

Image 11 — A Sample of the Fruits360 Dataset (image by author)

Image 12 — The Differences Between Three Frames of the Fruits360 Dataset (image by author)

Testing on this dataset will give us great feedback on how well the image duplicate/near-duplicate finder performance is since all the images are adjacent frames. It means that the pictures are majorly similar for each unique category.

Secondly, SFBench is a dataset of 40 images gathered to evaluate Content-Based Image Retrieval (CBIR) systems.

Note that it is not the article’s goal to build or evaluate a CBIR system. We used this dataset to only test how image transformations such as 3D projection and rotation can affect the performance of the methods.

A few sample images of the dataset are presented below in Image 13. Like the first dataset, it is composed of 4 images per scene.

Image 13 — A Sample of the SFBench dataset (image by author)

Experiments

I tested each method using both datasets in the following manner:

Experiment 1: Speed and Accuracy
Experiment 2: Resilience to Image Transformation
Experiment 3: Scipy distance.euclidean vs. Numpy linalg.norm Speed

Note: I used a 2019 MacBook Pro CPU for all the tests. Also, you can find all the tests in the Github repository.

Experiment 1: Speed and Accuracy

This test proposes the best method in terms of speed and accuracy for the image duplicate finder system. The followed steps are as follows:

Read the images of the Fruits360 dataset.
Convert them to RGB
Resize the images to a fixed size
Use the 5-methods
Get the 3-most similars for a reference image.
Calculate the average time in seconds the method took to compare a pair of images
Calculate the accuracy (for each reference image, if the 3-duplicates/near-duplicates are detected, the accuracy is 100%)

The results (shown in Table 1) clearly show that cosine similarity dominates while computing the similarity of features using CNNs is considered over-engineering for finding duplicates/near-duplicates of an image since its running time is around 250x slower than cosine similarity while keeping up close accuracy. Moreover, Image hashing is a great candidate if speed is a huge factor.

Visit this article [5] to learn more about when Euclidean distance is better than Cosine similarity and vice versa..

Experiment 2: Resilience to Image Transformations

This test follows the same steps as Experiment 1. The only differences are the used dataset and the resizing factor; I used the SFBench, noting that it is not the purpose of the image duplicate finder to detect and recognize similar transformed images. I am only assessing the resilience of the methods for their potential usage in a CBIR system.

Logically, the similarity of features method performed best as a CNN holds spatial information within its layers. Table 2 summarizes the results as follows:

Experiment 3: Scipy distance.euclidean vs. Numpy linalg.norm Speed (Extra)

The last experiment visits how Scipy & Numpy implementations compare by repeating the same operations ~2300 times. This test is an extra step for this article that does not affect the functionality of the image duplicate/near-duplicate finder system.

The results show that they both share similar performance (Table 3).

Conclusion

In conclusion, we walked through the concepts and Python code of Euclidean distance, SSIM, image hashing, cosine similarity, and the similarity of features. Also, we determined the sensitivity of image transformations on the performance of those algorithms. Finally, through experiments, we concluded the best methods according to the requirements: speed and accuracy.

The Article’s Material

You can find all the datasets, experiments, and results in the Github repository. Moreover, you can test out your choice of datasets as long as you’re following the same naming conventions for each dataset.