
Table of Contents
Motivation
You might be wondering, "what’s the point of making a duplicate image detection system?", well there are a few reasons why you might want to. These reasons include preventing copyright infringement to removing illegal or unwanted images from a web application [1]. Identifying illegal images has been a hot topic recently, with Apple announcing its plan to roll out a feature to detect CSAM in iOS 15 [2].

While Apple’s CSAM detection feature has now been delayed due to privacy backlash, there is still a lot for us to learn from the technology behind it[3]. The CSAM detection system developed by Apple uses a perceptual hashing function called NeuralHash to generate unique numerical fingerprints (hashes) for each image. Once encrypted, private set intersection (PSI) is used to determine if the image hash matches against a hash database of known CSAM while maintaining privacy[4]. It’s worth noting that this is an oversimplified description, if you’re interested in diving deep into the details, I’d recommend reading Apple’s technical summary.
Implementation Details

The system we will be building in this article will use a perceptual hashing function, similar to Apple’s CSAM Detection. However, instead of generating image hashes using NeuralHash, we will be using difference hash (dHash), which is simpler and less computationally intensive as it doesn’t require neural networks. Since we don’t have the same privacy constraints as Apple, we will be using nearest neighbor searches to identify duplicate images.
DIFFERENCE HASH
dHash is a perceptual hashing function that produces hash values that are resilient to image scaling, as well as changes in color, brightness, and aspect ratio [5]. There are four main steps for creating a difference hash for an image:

- Convert to greyscale*
- Resize image to (hash_size+1, hash_size)
- Calculate horizontal gradient, reducing image size to (hash_size, hash_size)
- Assign bits based on horizontal gradient values
*We convert the image to greyscale before resizing for optimal performance
NEAREST NEIGHBORS
Image hashes that we want to check for duplicates against will be stored in a binary index for fast and efficient nearest neighbor searches. We will use Hamming distance as a metric to determine the similarity between image hashes, for dHash, distances less than 10 (96.09% similarity) likely indicate similar/duplicate images [5].
Building the System
In the rest of the article, we are going to build a simple RESTful API service using Python, that allows users to add image hashes to an image database and perform image lookups to see if they’re duplicates.
If you don’t have time to go through the rest of this article, you can find a copy of the code on GitHub, with a few changes to make it easier to expand upon.
GETTING STARTED
We are going to use fastapi
and uvicorn
to create the API service, and to accept images uploads we need python-multipart
.
Start by installing the required packages: pip install fastapi python-multipart uvicorn[standard]
Once installed, create the project repository with the following folder structure:

In app/api.py
create an instance of FastAPI
and assign it to app
In the main.py
file import app
and run it with uvicorn
, now we can run the service with python main.py
and connect to it at http://localhost:8080
API ROUTES
There are two main API endpoints that we need to define, one to add images to the database, and another to lookup images in the database.
In the app/api.py
file we update the imports and add two routes:
- Both routes accept an image through the
file
parameter, and the/api/check
route accepts adist
parameter that will be used to specify the Hamming distance threshold when searching for duplicate images
HASHING
We are going to use dHash
from the imagehash
library, Pillow
to load the images, and numpy
to convert hashes into binary hash vectors.
First, install the required packages for hashing: pip install Pillow imagehash numpy
In app/utils.py
we create a hash utility function using dHash
with hash_size=16
, the resulting hash has 256-bits, reducing the chance of a random collision:
- To use the resulting hash in our binary index, we need to convert the array to a 1×32 vector of
uint8
values, to do this we usenumpy
to resize our array and pack the binary values touint8
Now we can use this utility function in app/api.py
to hash the images received by each endpoint:
INDEXING
Index creation and searching will be done using faiss
, a library developed by Facebook for efficient similarity search.
Install the required packages: pip install faiss-cpu pytorch
To start using faiss
we will create a couple utility functions in app/utils.py
to load/create the index and save it to disk
- We are using the
IndexBinaryFlat
index, it may not be as fast as the otherBinary
indices as it performs an exhaustive search, but this search has been optimized for 256-bit vectors. IndexBinaryFlat
utilizes Hamming distance as the vector similarity metric since it is one of thefaiss
binary indices.
Import the newly defined utility methods, and update the add_image
handler to load the index. We use the index add
method to insert the hash vector, afterward saving the updated index to the disk for future use:
Now we can update check_image
to perform a Nearest Neighbors search against our index using range_search
. This method searches for vectors within a Hamming distance of dist
and returns a tuple
lims
is an array of start and end indices for parsingD
andI
D
is an array of distances for the nearest neighborsI
is an array of the integer ids for the nearest neighbors
Testing
Now that both API endpoints are complete, we will test each endpoint using Postman.
ADD IMAGE
- Create a new POST request with the following URL:
[http://localhost:8080/api/add/image](http://localhost:8080/api/add/image)
- Change the Body type to
form-data

- Add a
file
KEY and change the type toFile
- Select an image as the VALUE for
file
and send a request, you should get the following response:
{
"msg": "added image"
}
CHECK IMAGE
- To test the duplicate image check endpoint, change the URL to
[http://localhost/api/check](http://localhost/api/check)
- Add a
dist
KEY with a VALUE of10
and send the request - Keeping the same image for the
file
value, you should get the following response:
{
"duplicate": true
}
Now try changing the image to see what response you get!
After reaching this point in the article, you should now have a working duplicate image detection system using nearest neighbors search and dHash. If you encountered any issues with the code in this article, please open an issue on the GitHub, otherwise feel free to leave a star! To continue adding to your service, consider some of the future improvements mentioned in the next section.
Future Improvements
- Error handling and input validation
- Authentication
- Front end user interface
In addition to the improvements mentioned above, I think it could be worthwhile to evaluate speed, memory usage, and accuracy for your image dataset using different hashing functions and binary indices.
References
[1] N. Dolhansky, C. Canton Ferrer, Adversarial collision attacks on image hashing functions (2020), Computing Research Repository
[2] Z. Whittaker, Apple confirms it will begin scanning iCloud Photos for child abuse images (2021), TechCrunch
[3] Z. Whittaker, Apple delays plans to roll out CSAM detection in iOS 15 after privacy backlash (2021), TechCrunch
[4] Apple Inc., CSAM Detection Technical Summary (2021), Apple
[5] N. Krawetz, Kind of Like That (2013), Hacker Factor Blog