How to Create a Duplicate Image Detection System

Learn how to use perceptual hash functions and nearest neighbor searches to detect duplicate images

Matt Podolak
Towards Data Science

--

Photo by USGS on Unsplash

Table of Contents

  1. Motivation
  2. Implementation Details
  3. Building the System
  4. Testing
  5. Future Improvements
  6. References

Motivation

You might be wondering, “what’s the point of making a duplicate image detection system?”, well there are a few reasons why you might want to. These reasons include preventing copyright infringement to removing illegal or unwanted images from a web application [1]. Identifying illegal images has been a hot topic recently, with Apple announcing its plan to roll out a feature to detect CSAM in iOS 15 [2].

Apple CSAM Detection by Author, inspired by Apple

While Apple’s CSAM detection feature has now been delayed due to privacy backlash, there is still a lot for us to learn from the technology behind it[3]. The CSAM detection system developed by Apple uses a perceptual hashing function called NeuralHash to generate unique numerical fingerprints (hashes) for each image. Once encrypted, private set intersection (PSI) is used to determine if the image hash matches against a hash database of known CSAM while maintaining privacy[4]. It’s worth noting that this is an oversimplified description, if you’re interested in diving deep into the details, I’d recommend reading Apple’s technical summary.

Implementation Details

Duplicate image detection system (Image by author)

The system we will be building in this article will use a perceptual hashing function, similar to Apple’s CSAM Detection. However, instead of generating image hashes using NeuralHash, we will be using difference hash (dHash), which is simpler and less computationally intensive as it doesn’t require neural networks. Since we don’t have the same privacy constraints as Apple, we will be using nearest neighbor searches to identify duplicate images.

DIFFERENCE HASH

dHash is a perceptual hashing function that produces hash values that are resilient to image scaling, as well as changes in color, brightness, and aspect ratio [5]. There are four main steps for creating a difference hash for an image:

dHash procedure (Image by author)
  1. Convert to greyscale*
  2. Resize image to (hash_size+1, hash_size)
  3. Calculate horizontal gradient, reducing image size to (hash_size, hash_size)
  4. Assign bits based on horizontal gradient values

*We convert the image to greyscale before resizing for optimal performance

NEAREST NEIGHBORS

Image hashes that we want to check for duplicates against will be stored in a binary index for fast and efficient nearest neighbor searches. We will use Hamming distance as a metric to determine the similarity between image hashes, for dHash, distances less than 10 (96.09% similarity) likely indicate similar/duplicate images [5].

Building the System

In the rest of the article, we are going to build a simple RESTful API service using Python, that allows users to add image hashes to an image database and perform image lookups to see if they’re duplicates.

If you don’t have time to go through the rest of this article, you can find a copy of the code on GitHub, with a few changes to make it easier to expand upon.

GETTING STARTED

We are going to use fastapi and uvicornto create the API service, and to accept images uploads we need python-multipart.

Start by installing the required packages: pip install fastapi python-multipart uvicorn[standard]

Once installed, create the project repository with the following folder structure:

Folder structure

In app/api.py create an instance of FastAPI and assign it to app

In the main.py file import app and run it with uvicorn, now we can run the service with python main.py and connect to it at http://localhost:8080

API ROUTES

There are two main API endpoints that we need to define, one to add images to the database, and another to lookup images in the database.

In the app/api.py file we update the imports and add two routes:

  • Both routes accept an image through the file parameter, and the /api/check route accepts a dist parameter that will be used to specify the Hamming distance threshold when searching for duplicate images

HASHING

We are going to use dHashfrom the imagehash library, Pillow to load the images, and numpy to convert hashes into binary hash vectors.

First, install the required packages for hashing: pip install Pillow imagehash numpy

In app/utils.py we create a hash utility function using dHash with hash_size=16, the resulting hash has 256-bits, reducing the chance of a random collision:

  • To use the resulting hash in our binary index, we need to convert the array to a 1x32 vector of uint8 values, to do this we use numpy to resize our array and pack the binary values to uint8

Now we can use this utility function in app/api.pyto hash the images received by each endpoint:

INDEXING

Index creation and searching will be done using faiss, a library developed by Facebook for efficient similarity search.

Install the required packages: pip install faiss-cpu pytorch

To start using faiss we will create a couple utility functions in app/utils.pyto load/create the index and save it to disk

  • We are using the IndexBinaryFlat index, it may not be as fast as the other Binaryindices as it performs an exhaustive search, but this search has been optimized for 256-bit vectors.
  • IndexBinaryFlat utilizes Hamming distance as the vector similarity metric since it is one of the faiss binary indices.

Import the newly defined utility methods, and update the add_image handler to load the index. We use the index add method to insert the hash vector, afterward saving the updated index to the disk for future use:

Now we can update check_image to perform a nearest neighbors search against our index using range_search. This method searches for vectors within a Hamming distance of dist and returns a tuple

  • lims is an array of start and end indices for parsing D and I
  • D is an array of distances for the nearest neighbors
  • I is an array of the integer ids for the nearest neighbors

Testing

Now that both API endpoints are complete, we will test each endpoint using Postman.

ADD IMAGE

  • Add a file KEY and change the type to File
  • Select an image as the VALUE for file and send a request, you should get the following response:
{"msg": "added image"}

CHECK IMAGE

  • To test the duplicate image check endpoint, change the URL to http://localhost/api/check
  • Add a dist KEY with a VALUE of 10 and send the request
  • Keeping the same image for the file value, you should get the following response:
{"duplicate": true}

Now try changing the image to see what response you get!

After reaching this point in the article, you should now have a working duplicate image detection system using nearest neighbors search and dHash. If you encountered any issues with the code in this article, please open an issue on the GitHub, otherwise feel free to leave a star! To continue adding to your service, consider some of the future improvements mentioned in the next section.

Future Improvements

  • Error handling and input validation
  • Authentication
  • Front end user interface

In addition to the improvements mentioned above, I think it could be worthwhile to evaluate speed, memory usage, and accuracy for your image dataset using different hashing functions and binary indices.

References

[1] N. Dolhansky, C. Canton Ferrer, Adversarial collision attacks on image hashing functions (2020), Computing Research Repository

[2] Z. Whittaker, Apple confirms it will begin scanning iCloud Photos for child abuse images (2021), TechCrunch

[3] Z. Whittaker, Apple delays plans to roll out CSAM detection in iOS 15 after privacy backlash (2021), TechCrunch

[4] Apple Inc., CSAM Detection Technical Summary (2021), Apple

[5] N. Krawetz, Kind of Like That (2013), Hacker Factor Blog

--

--