Detection of Duplicate Images Using Image Hash Functions
Automate the search for (near-)identical photos with the Python library undouble
Searching for (near-)identical photos across your entire system can be a tedious effort; you need to click across thousands of photos, and then decide for each photo whether it is a “similar” one. The most straightforward approach to detect duplicates would be on file size or filename. However, photos are usually derived from different sources such as mobile devices, social media apps, which leads to differences in file size, name, but also in resolution, scaling, compression, and brightness. Hash functions are ideal to detect (near-)identical photos because of the robustness against minor changes. I will summarize the concepts of hash functions, followed by a hands-on tutorial to demonstrate the steps of detecting duplicates while minimizing the number of false-positive collisions. All results are derived using the Python library undouble.
If you found this article helpful, use my referral link to continue learning without limits and sign up for a Medium membership. Plus, follow me to stay up-to-date with my latest content!
Visual similar but numerical different.
Two images can be visually similar but numerical different. Numerical differences can be caused by various reasons such as the use of social media apps which may change the brightness, contrast, gamma corrections, compression, resolution, and/or scaling. As an example, sending an image using Whatsapp results in a reduced resolution (Figure 1). Note that the amount of reduction may differ across devices and user-defined settings.
From a visual perspective, it is hard to see any changes between the source and the Whatsapp image but when we subtract the two images, the differences become clear (Figure 1C). In case you only have a few images on disk that needs to be undoubled, it is easy to pick those with the best resolution. However, when you occasionally or perhaps yearly dump all your images to disk, it becomes a time-consuming task, and even more challenging when family/friends/co-workers also share their almost-similar moments with you.
The question is not whether you have duplicate photos, but where they are.
Approaches such as sorting on file size or image subtraction would thus fail. There are solutions though; Hash functions! Hash functions are robust against small changes in brightness, contrast, gamma corrections, compression, scaling, and/or resolution, and therefore ideal to detect (near-)identical images. There are many applications for hash function usage, such as in digital forensics, copyright enforcement, and more generically for disk space reduction and thus undoubling.
The undouble library searches for duplicate images.
The undouble library can be used to detect identical images across an entire system or any input directory. Given the input data, preprocessing is performed on the images, the hashes are computed, and images are grouped based on the image hash. To organize the images on disk without performing manual actions, the move functionality will move the identical images except for the image with the largest resolution (that one is copied). A schematic overview of the taken steps is shown below.
In the following sections, I will describe in more detail the pre-processing steps, hash functions, grouping results, and plotting functionalities, such as the image hash.
Image Hash Functions.
A hash function transforms or maps the input data to a string with a fixed-length which can be seen as the “fingerprint” or “signature” of the input data; the image hash. Thus a good hash function should be fully determined by the input data or in our case, the image. There are various hash functions, such as; Average hash, Perceptual hash, Difference hash, Haar wavelet hash, Daubechies wavelet hash, HSV color hash, and Crop-resistant hash. Each hash function has specific properties to make it robust against certain changes such as brightness, contrast, gamma, correction, watermark, compression, scaling, and grayscaling.
A image hash function maps an image to a short string called image hash, and can be used for image authentication or as a digital fingerprint.
Nevertheless, it can occur that two visually different images get the same image hash, which is called a collision. More than one example is shown with the 101 objects dataset but let's start evaluating the robustness of the most well-known hash functions. The robustness is evaluated by changing the brightness (-50%, -20%, +20% +50%), contrast (-50%, -20%, +20% +50%), scaling, and compression (png, jpg) for a single image. In total, 10 different alterations for the “cat and dog” image (Figure 2) are created. Note that this will not evaluate collisions but the influence of brightness and contrast across different hash functions. All hash functions are evaluated with the python library undouble, which in its turn uses functionalities of the image hash library imagehash [3].
Pre-processing before computing the hash.
Before we can determine the image hash, pre-processing steps are required; 1. decolorize, 2. normalize the pixel values, and 3. scale the image. The reasoning for decolorizing is that the information required to “recognize” an image is readily present in the grayscale channel. In addition, the reduction from 24 bits per pixel for RGB to 8 bits per pixel is computationally more attractive; time-wise and memory-wise. The next step is to downsample/scale the image to a smaller size. Most often, a 64-bit hash is chosen, which simply means that the image is downsampled to 8 x 8 pixels. Below is shown an example of how the image hash is derived using Python for various hash functions.
pip install undoubleAverage hash.
After the decolorizing and scaling step, each pixel block is compared to the average (as the name suggests) of all pixel values of the image. In the example below, we will generate a 64-bit hash, which means that the image is scaled to 8×8 pixels. If the value in the pixel block is larger than the average, it gets value 1 (white) and otherwise a 0 (black). The image hash is created by flattening the binary array into a vector.
If we run the average hash function on the 10 altered images, we can see that the image hash is quite stable with minor differences (Figure 3). Only in the case of an increase of 50% brightness, the image hash starts to deviate. On average there are 2.1 changes in image hash across all combinations.
Perceptual hash.
The perceptual hash function starts with only decolorizing the image and then a Discrete Cosine Transform (DCT) is applied; first per row and afterward per column. The pixels with high frequencies are now cropped to 8 x 8 pixels. Each pixel block is then compared to the median of all gray values of the image. If the value in the pixel block is larger than the median, it gets value 1 and otherwise a 0. The final image hash is followed by flattening the binary array into a vector.
If we run the perceptual hash function on the 10 images with some alterations, various small differences in image hashes are detected (Figure 5). In the case of an increase of 50% brightness, the image hash starts even more to deviate. On average there are 4 changes in image hash across all combinations.
Difference hash.
After the step of decolorizing and scaling, the pixels are serially (from left to right per row) compared to their neighbor to the right. If the byte at position x is less than the byte at position (x+1), it gets value 1 and otherwise a 0. The final image hash is followed by flattening the binary array into a vector.
If we run the difference hash function on the 10 images with some alterations, the image hash results in some small variations (Figure 7). The increase of 50% brightness showed the largest differences. On average there are 3.8 changes in image hash across all combinations.
Haar wavelet hash.
After the step of decolorizing and scaling, a two-dimensional wavelet transform is applied to the image. Each pixel block is then compared to the median of all gray values of the image. If the value in the pixel block is larger than the median, it gets value 1 and otherwise a 0. The final image hash is followed by flattening the array into a vector.
If we compare the image hash of the 10 altered images we can see that the haar wavelet hash is quite stable despite some minor changes across the altered images. On average there are 2.5 changes in image hash across all combinations.
Finding identical images in a real-world dataset.
To demonstrate the performance of the various hash functions on a real-world dataset, I utilized the Caltech 101 [2] dataset and saved it to my local disk. The detection of duplicates is examined for aHash, pHash, dHash, and Wavelet hash. The Caltech dataset contains 9144 real-world images belonging to 101 categories. About 40 to 800 images per category. The size of each image is roughly 300 x 200 pixels. The input for the undouble library can simply be the directory location where all images are stored. All subdirectories will also be recursively analyzed. Note that this dataset does not contain ground truth labels of identical images. Therefore, I will visually examine all the results of the groupings and describe the results for each of the hash functions.
pip install undoubleThe average hash function detected 135 groups that could be linked to 335 images with an identical hash (threshold=0) based on the input hash size of 8 (64-bit). Despite identical images being detected, most of the groups showed collisions such as the top and bottom left, and/or near-identical images, such as the motorbikes (Figure 11). By increasing the hash size to 16 (256-bit), 28 groups for 64 images were detected. No collisions were present but near-identical images are detected, such as the motorbikes.
The wavelet hash function detected 141 groups that could be linked to 513 images with an identical hash (threshold=0) based on the input hash size of 8 (64-bit). A visual inspection showed that almost all groups contained either collisions or near-identical images (Figure 12). Who had known that a strawberry could have a similar image hash as the motorbike? By increasing the hash size to 16 (256-bit), 25 groups for 51 images were detected. No collisions were present but near-identical images are detected, such as the motorbikes.
The differential hash function detected 28 images that could be linked to 31 images with an identical hash (threshold=0). A visual inspection showed no collisions but near-identical images (two motorbikes) were detected. By increasing the hash size to 16 (256-bit), 8 groups for 16 images were detected. No collisions were present but only some near-identical images, such as the motorbikes. By increasing the hash size to 16 (256-bit), 8 groups for 16 images were detected. No collisions and no near-identical images are detected, all images are visually similar.
The perceptual hash function detected 38 groups that could be linked to 41 images with an identical hash (threshold=0). A visual inspection showed no collisions but near-identical images were detected, such as the motorbikes, as illustrated in Figure 14. By increasing the hash size to 16 (256-bit), 10 groups for 20 images were detected. No collisions and no near-identical images are detected, all images are visually similar.
Conclusions.
In the case of the cat and dog image experiment, the results were quite stable for each of the hash functions. However, when we use the real-world dataset, it becomes clear that a hash size of 8 (64-bit) results in many collisions for both average hash and wavelet hash. In addition, it groups many near-identical images. The results of differential hash were accurate but also the most conservative. On the other hand, the perceptual hash also showed accurate results but is less conservative. When the hash size is increased to 16 (256-bits), none of the hash functions resulted in collisions, but near-identical images were detected for average hash and wavelet hash. Both perceptual hash and differential hash showed no collisions nor near-identical images. Here again, the perceptual hash is less conservative.
Perceptual hash is most accurate for the detection of duplicate images.
Furthermore, the results may also depend on the type of input images. When images have a solid background, such as for the motorbikes and traffic signs, collisions can happen more often. One of the reasons is that the binary pixel information characterizes the image and forms the image hash becomes less unique. As an example, images with a solid background that contains any darker object in the center can easily get an identical image hash using average hash and also wavelet hash.
Hash Functions or Clustering?
Hash functions and clustering approaches aim to group similar images but there is a difference though. The hash functions will create a “fingerprint” or “signature” of the image; the image hash. Two identical image hashes should represent two identical images. In the case of clustering, features are extracted, distance metrics and linkages types are chosen, and finally, images are clustered. Taking different steps and/or methods will yield different grouping since it implicitly imposes a structure on the data, and thus the partitioning of the samples. Or in other words, there are more degrees of freedom to define “similar” images.
A hash function aims to create a unique signature that can be used to detect identical images whereas a clustering approach aims to detect groups of similar images.
To compare the results derived by hash functions vs. clustering, I will use the same 101 objects dataset and perform a clustering approach. The clustimage library [5] is used to cluster the images for which simply the path locations were given as an input. All images in subdirectories are recursively collected. With the default settings and PCA method, an optimum of 63 clusters is detected (Figure 15) for which the centroid image for the clusters is shown. Although some clusters contain images with high similarity (Figure 15D), images are not necessarily identical or near-identical. Read this blog [5] in case you want to learn more about clustering images.
Final words.
I touched on the concepts of hash functions for the detection of (near-)identical images. The perceptual hash function is recommended for the detection of duplicate images. With the undouble library it is relatively straightforward to detect (near-)identical images across an entire system or directory. It pipelines the process of pre-processing the images (grayscaling, normalizing, and scaling), computing the image hash, and grouping images based on image hash differences (Figure 10). A threshold of 0 will group images with an identical image hash. However, the threshold of 10 showed the best results when undoubling my personal photo deck because photos, such as from bursts, were also grouped. The results can easily be explored with the plot functionality and images can be undoubled with the move functionality. In the case of moving images, the image in the group with the highest resolution will be copied, and all other images are moved to the “undouble” subdirectory. In case you need more flexibility and freedom about the grouping of images, I recommend reading this blog about clustering images [4] which describes the usage of clustimage library [5] for the clustering of images.
Be Safe. Stay Frosty.
Cheers E.
If you found this article helpful, use my referral link to continue learning without limits and sign up for a Medium membership. Plus, follow me to stay up-to-date with my latest content!
Software
Let’s connect!
References
- Taskesen, E. (2022). undouble — Python library to detect (near-)identical images. [Computer software].
- L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models
from few training examples: an incremental Bayesian approach tested on
101 object categories. IEEE. CVPR 2004, Workshop on Generative-Model
Based Vision. 2004 - https://github.com/JohannesBuchner/imagehash
- Taskesen, E. A step-by-step guide for clustering images. Towards Data Science, 2021.
- Taskesen, E. (2021). Python package clustimage is for unsupervised clustering of images. [Computer software]. https://towardsdatascience.com/a-step-by-step-guide-for-clustering-images-4b45f9906128