The world’s leading publication for data science, AI, and ML professionals.

Finding Duplicate Images with Python

Automate the search for duplicate images on your computer

Image by KissCC0.
Image by KissCC0.

Did you ever find yourself in the situation of going through hundreds, maybe even thousands of images, only to realize that some actually look a bit "too similar"? Could they be duplicates? Then you probably checked both image resolutions, to then delete the one having the lowest.

I have found myself in this scenario quite a few times. It’s painful, time consuming and tiring. Especially, as mentioned above, if you have to go through thousands of images. Furthermore, some images may be easily classifiable as being duplicates at first look, but some images may need precise checking and may also result in you deleting images, that in reality were no duplicates.

That’s when I thought: let’s automate this process.

In today’s article, I will go through the process of writing a Python 3.8 script for the automated search for duplicate images in a folder on your local computer.

View the difPy project on GitHub and on PyPi. 🎉

1 | In Theory: Translating Images to Numeric Data

In order to compare images to one another, we need to somehow translate them into comparable, computer-readable i. e. numeric data. Those of you that are familiar with machine learning and computer vision may already know that images can be translated into matrices, or more precisely into a tensor. A tensor is a container which can hold data in N dimensions. A matrix therefore is a 2-dimensional tensor.

Image by Mukesh Mithrakumar from Dev.to
Image by Mukesh Mithrakumar from Dev.to

Every pixel in a colored image can be represented by a combination of red, green and blue colors. These three then make up the unique color that this pixel consists of.

If we have one matrix for red, one for green and one for blue for all pixels in our image, we can layer the 3 matrices over each other to then end up with: a tensor. 🎉

Image by Berton Earnshaw from Slideshare.net
Image by Berton Earnshaw from Slideshare.net

This is awesome, because it means we can represent our images by numbers. Already now there should be a bell ringing: numbers can easily be compared with each other. Assuming one image consists of the exact same tensor as another image, we can conclude: these are duplicates!

Great, now let’s move on and look how we can make the theory work in practice.

2 | In Practice: Translating Images to Numeric Data

Let’s first of all import the Python libraries we will be using for this script: skimage, matplotlib, numpy, openCV, os and imghdr.

We will start by writing our function _create_imgsmatrix. This is our final function, but feel free to check the detailed explanation on the code below:

This function will create a tensor for each image our algorithm finds in a specific folder. We read into our computer directory, iterate over all the files and make sure that the format of the files in our folder are actually images (i. e. JPG, PNG, etc.). There is a very handy library called imghdr which can help us in this process.

We first check if the file is accessible with:

os.path.isdir(directory + filename)

and then we check whether it is an image:

imghdr.what(directory + filename)

If it is not an image, this function will return None and therefore move on to the next file in the directory. If both outputs are valid, the function moves on to the next line, where we use the opencv library to decode our image file and convert it to a tensor:

img = cv2.imdecode(np.fromfile(directory + filename, dtype=np.uint8), cv2.IMREAD_UNCHANGED)

We then check if it has been successfully converted to a numpy n-dimensional array, and make sure our tensor has 3 layers at maximum.

In our last step, we resize our image tensor to a pixel size of 50 i. e. we want it to have a pixel width and height of 50. We do this, to speed up the comparison process of our images. If we have high resolution images, their respective tensor will also be very large, therefore resulting in longer computation times for the 1 on 1 comparison to other image tensors.

Finally, we add the resulting resized tensor to our _imgsmatrix list, and go on with the next file in our directory (if present).

Cool – after we have written this function, we are left with a list of tensors for all images in our directory. Now, how are we going to compare these to each other? For this we will make use of a metric called MSE.

3 | MSE (Mean Squared Error) for Image Similarity

The MSE or Mean Squared Error is frequently used in the field of statistics. For our use case, we will use it to compare how similar two images are to each other.

Mean Squared Error between the two images is the sum of the squared difference between the two images. The lower the error, the more "similar" the images are.

The MSE is calculated as following:

Image from Wikipedia
Image from Wikipedia

I will not go into much detail on what this formula does – instead let me give you the equivalent of what this looks like in Python code:

This function calculates the MSE between imageA and imageB, and will return a floating point value which we can interpret as the Similarity between these images. The lower this number, the more similar these images are.

4 | Last Tips & Considerations for Finding Duplicates

Before moving on to our last chapter of Putting it all Together, let me give you some last tips and considerations of what I learned during this project.

A. What target MSE to choose?

Since we are looking for duplicate images, and not only for similar images, can we assume that an MSE of 0 is our target for being confident the images are duplicates?

In general, yes. But in our case: remember we resized our images to 50 pixels width x 50 pixels height? This can result in giving slightly different tensors of images that may have had a different initial resolution, or size. While writing this function myself, I have seen that some images that were indeed duplicates, somehow did not result in having an MSE of 0. Therefore, it is a better practice to set our MSE threshold a bit higher, so we make sure to capture all images that are duplicates, even though not having the exact same tensor.

For this use case, I chose an MSE of 200 to be the maximum threshold. As soon as two images have an MSE lower than 200, we consider them as being duplicates.

B. What’s up with those rotated images?

This is also something I realized only after having written my first initial function. Of course, the tensor of our images will look different depending on our image rotation. Therefore we need to make sure we compare our image to all possible rotations of the second image.

Image from KissCC0 modified by the author
Image from KissCC0 modified by the author

The numpy library easily lets us perform this with the help of the built-in rot90 function, which rotates matrices by 90 degrees.

C. Comparing the image resolutions

Of course, apart from finding duplicates we also want our algorithm to help us with the decision of which ones we can delete. Therefore we want to compare the original resolution of the images and let our algorithm output the name of the file having the lowest.

We will do this by making use of the _os.stat stsize function, which will output the file sizes. The file having the lower size will be added to a separate list.

5 | Putting it all Together

First of all, congrats for sticking with me up until now and making it this far. We arrived at the final chapter where we will be putting it all together!

To summarize our steps:

  1. Compute image tensors of all images in a folder.
  2. Go through all image tensors one by one and computing their MSE. During this process we make sure to rotate our images by 90 degrees so we can also find duplicate images even though these did not have the same initial orientation.
  3. If the MSE of our two images < 200, classify them as duplicates.
  4. Check the file size of the original two files. The one having the lower size will be added to a list of images that can be deleted.

Instead of pasting the full code here, I will share with you the link to my GitHub repository where I have uploaded the full Duplicate Image finder (DIF) script.

elisemercury/Duplicate-Image-Finder

* | Update October 2021

The DIF is now also available as Python package difPy for you to install via the pip installer.

difPy

You can now just use pip install difPy and use the library as following:


I hope this article has been informative for you, and that I could help you in saving some valuable hours of image comparison and deduplication! At least for me personally, it did. 😜

Feel free to ping me if you have any questions or feedback, or post a comment below to let me know what you think.


References:

[1] A. Rosebrock, How-To: Python Compare Two Images (2014)

[2] J. Brownlee, A Gentle Introduction to Tensors for Machine Learning with NumPy (2018)


Related Articles