Unlocking the Power of Facial Blurring in Media: A Comprehensive Exploration and Model Comparison

Comparison of various face detection and blurring algorithms

Published in

Towards Data Science

12 min readSep 18, 2023

Processed photo by OSPAN ALI on Unsplash

In today’s data-driven world, ensuring the privacy and anonymity of individuals is of paramount importance. From protecting personal identities to complying with stringent regulations like GDPR, the need for efficient and reliable solutions to anonymize faces in various media formats has never been greater.

Introduction
Face Detection
- Haar Cascade
- MTCNN
- YOLO
Face Blurring
- Gaussian Blur
- Pixelization
Results and Discussion
- Real-Time performance
- Scenario-based evaluation
- Privacy
Usage in videos
Web Application
Conclusion

Introduction

In this project, we explore and compare several solutions for the topic of face blurring and develop a web application that allows for easy evaluation. Let’s explore the diverse applications driving the demand for such a system:

Preserving Privacy
Navigating Regulatory Landscapes: With the regulatory landscape evolving rapidly, industries and regions worldwide are enforcing stricter norms to safeguard individuals’ identities.
Training Data Confidentiality: Machine learning models thrive on diverse and well-prepared training data. However, sharing such data often requires careful anonymization.

This solution can be distilled into two essential components:

Face Detection
Face Blurring Techniques

Face detection

To address the anonymization challenge, the first step is to locate the area in the image where a face is present. For this purpose, I tested three models for image detection.

Haar Cascade

Figure 1. Haar-like features (source — original paper)

Haar Cascade is a machine learning method used for object detection, such as faces, in images or videos. It operates by utilizing a set of trained features called ‘Haar-like features’ (Figure 1), which are simple rectangular filters that focus on variations in pixel intensity within regions of the image. These features can capture edges, angles, and other characteristics commonly found in faces.

The training process involves providing the algorithm with positive examples (images containing faces) and negative examples (images without faces). The algorithm then learns to differentiate between these examples by adjusting the weights of the features. After training, the Haar Cascade essentially becomes a hierarchy of classifiers, with each stage progressively refining the detection process.

For face detection, I utilized a pre-trained Haar Cascade model trained on forward-facing images of faces.

import cv2
face_cascade = cv2.CascadeClassifier('./configs/haarcascade_frontalface_default.xml')

def haar(image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))
    print(len(faces) + " total faces detected.")
    for (x, y, w, h) in faces:
        print(f"Face detected in the box {x} {y} {x+w} {y+h}")

MTCNN

MTCNN (Multi-Task Cascaded Convolutional Networks) stands as a sophisticated and highly accurate algorithm for face detection, surpassing the capabilities of Haar Cascades. Designed to excel in scenarios with diverse face sizes, orientations, and lighting conditions, MTCNN leverages a series of neural networks, each tailored to execute specific tasks within the face detection process.

Phase One — Proposal Generation: MTCNN initiates the process by generating a multitude of potential face regions (bounding boxes) through a small neural network.
Phase Two — Refinement: Candidates generated in the first phase undergo filtering in this step. A second neural network evaluates the proposed bounding boxes, adjusting their positions for a more precise alignment with the true face boundaries. This aids in enhancing accuracy.
Phase Three — Facial Feature Points: This stage identifies facial landmarks, such as eye corners, nose, and mouth. The neural network is employed to accurately pinpoint these features.

MTCNN’s cascaded architecture allows it to swiftly discard regions devoid of faces early in the process, concentrating computations on areas with a higher probability of containing faces. Its ability to handle different scales (zoom levels) of faces and rotations makes it highly suitable for intricate scenarios compared to Haar Cascades. However, its computational intensity stems from its neural network-based sequential approach.

For the implementation of MTCNN, I utilized the mtcnn library.

import cv2
from mtcnn import MTCNN
detector = MTCNN()

def mtcnn_detector(image):
    faces = detector.detect_faces(image)
    print(len(faces) + " total faces detected.")
    for face in faces:
        x, y, w, h = face['box']
        print(f"Face detected in the box {x} {y} {x+w} {y+h}")

YOLOv5

Figure 3. YOLO Object Detection Process (source — original paper)

YOLO (You Only Look Once) is an algorithm employed for detecting a multitude of objects, including faces. Unlike its predecessors, YOLO performs detection in a single pass through a neural network, rendering it faster and more suitable for real-time applications and videos. The process of detecting faces in media with YOLO can be distilled in four parts:

Image Grid Division: The input image is divided into a grid of cells. Each cell is responsible for predicting objects located within its boundaries. For every cell, YOLO predicts bounding boxes, object probabilities, and class probabilities.
Bounding Box Prediction: Within each cell, YOLO predicts one or more bounding boxes along with their corresponding probabilities. These bounding boxes represent potential object locations. Each bounding box is defined by its center coordinates, width, height, and the probability that an object exists within that bounding box.
Class Prediction: For each bounding box, YOLO predicts the probabilities for various classes (e.g., ‘face,’ ‘car,’ ‘dog’) to which the object may belong.
Non-Maximum Suppression (NMS): To eliminate duplicate bounding boxes, YOLO applies NMS. This process discards redundant bounding boxes by evaluating their probabilities and overlap with other boxes, retaining only the most confident and non-overlapping ones.

The key advantage of YOLO lies in its speed. Since it processes the entire image in a single forward pass through the neural network, it’s significantly faster than algorithms involving sliding windows or region proposals. However, this speed might come at a slight trade-off with precision, especially for smaller objects or crowded scenes.

YOLO can be adapted for face detection by training it on face-specific data and modifying its output classes to include only one class (‘face’). For this, I utilized the ‘yoloface’ library, built upon YOLOv5.

import cv2
from yoloface import face_analysis
face=face_analysis()

def yolo_face_detection(image):
    img,box,conf=face.face_detection(image, model='tiny')
    print(len(box) + " total faces detected.")
    for i in range(len(box)):
        x, y, h, w = box[i]
        print(f"Face detected in the box {x} {y} {x+w} {y+h}")

Face blurring

After identifying the bounding boxes around potential faces in the image, the next step is to blur them to remove their identities. For this task, I developed two implementations. A reference image for demonstration is provided in Figure 4.

Figure 4. Reference image By Ethan Hoover on Unsplash

Gaussian Blur

Gaussian blur is an image processing technique used to reduce image noise and smudge details. This is particularly useful in the domain of face blurring as it erases specifics from that portion of the image. It computes an average of pixel values in the neighborhood around each pixel. This average is centered around the pixel being blurred and calculated using a Gaussian distribution, giving more weight to nearby pixels and less weight to distant ones. The result is a softened image with reduced high-frequency noise and fine details. The outcome of applying Gaussian Blur is depicted in Figure 5.

Gaussian Blur takes three parameters:

Image portion to be blurred.
Kernel size: the matrix used for the blurring operation. A larger kernel size leads to stronger blurring.
Standard deviation: A higher value enhances the blurring effect.

f = image[y:y + h, x:x + w]
blurred_face = cv2.GaussianBlur(f, (99, 99), 15)  # You can adjust blur parameters
image[y:y + h, x:x + w] = blurred_face

Pixelization

Pixelization is an image processing technique where the pixels in an image are replaced with larger blocks of a single color. This effect is achieved by dividing the image into a grid of cells, where each cell corresponds to a group of pixels. The color or intensity of all pixels in the cell is then taken as the average value of the colors of all pixels in that cell, and this average value is applied to all pixels in the cell. This process creates a simplified appearance, reducing the level of fine details in the image. The result of applying pixelization is shown in Figure 6. As you can observe, pixelization significantly complicates the identification of a person’s identity.

Pixelization takes one main parameter, which determines how many grouped pixels should represent a specific area. For instance, if we have a (10,10) section of the image containing a face, it will be replaced with a 10x10 group of pixels. A smaller number leads to greater blurring.

f = image[y:y + h, x:x + w]
f = cv2.resize(f, (10, 10), interpolation=cv2.INTER_NEAREST)
image[y:y + h, x:x + w] = cv2.resize(f, (w, h), interpolation=cv2.INTER_NEAREST)

Results and discussion

I will evaluate the different algorithms from two perspectives: Real-Time performance analysis and specific image scenarios.

Real-Time performance

Using the same reference image (Figure 4), the time required for each face detection algorithm to locate the bounding box of the face in the image was measured. The results are based on the average value of 10 measurements for each algorithm. The time needed for the blurring algorithms is negligible and will not be considered in the evaluation process.

Figure 4. Average time in seconds needed for each algorithm to detect face

It can be observed that YOLOv5 achieves the best performance (speed) due to its single-pass processing through the neural network. In contrast, methods like MTCNN require sequential traversal through multiple neural networks. This further complicates the process of parallelizing the algorithm.

Scenario-based performance

To evaluate the performance of the aforementioned algorithms, in addition to the reference image (Figure 4), I have selected several images that test the algorithms in various scenarios:

Reference image (Figure 4)
Group of people close together — to assess the algorithm’s ability to capture different face sizes, some closer and some further away (Figure 8)
Side-view faces — testing the algorithms’ capability to detect faces not looking directly at the camera (Figure 10)
Flipped face, 180 degrees — testing the algorithms’ ability to detect a face rotated by 180 degrees (Figure 11)
Flipped face, 90 degrees — testing the algorithms’ ability to detect a face rotated by 90 degrees, sideways (Figure 12)

Figure 8. Group of people by Nicholas Green on Unsplash

Figure 9. Mutiple faces by Naassom Azevedo on Unsplash

Figure 10. Side-view faces by Kraken Images on Unsplash

Figure 11. Flipped face 180 degrees from Figure 4.

Figure 12. Flipped face 90 degrees from Figure 4.

Haar Cascade

The Haar Cascade algorithm generally performs well in anonymizing faces, with a few exceptions. It successfully detects the reference image (Figure 4) and the ‘Multiple faces’ scenario (Figure 9) excellently. In the ‘Group of people’ scenario (Figure 8), it handles the task decently, though there are faces that are not entirely detected or are farther away. Haar Cascade encounters challenges with faces not directly facing the camera (Figure 10) and rotated faces (Figures 11 and 12), where it fails to recognize faces entirely.

MTCNN

MTCNN manages to achieve very similar results to Haar Cascade, with the same strengths and weaknesses. Additionally, MTCNN struggles to detect the face in Figure 9 with a darker skin tone.

YOLOv5

YOLOv5 yields slightly different results from Haar Cascade and MTCNN. It successfully detects one of the faces where people are not looking directly at the camera (Figure 10), as well as the face rotated by 180 degrees (Figure 11). However, in the ‘Group of people’ image (Figure 8), it doesn’t detect the faces farther away as effectively as the previously mentioned algorithms.

Privacy

When addressing the challenge of privacy in image processing, a crucial aspect to consider is the delicate balance between rendering faces unrecognizable while maintaining the natural appearance of the images.

Gaussian Blur

Gaussian blur effectively blurs the facial region in an image (as depicted in Figure 5). Nevertheless, its success is dependent upon the parameters of the Gaussian distribution employed for the blurring effect. In Figure 5, it’s evident that facial features remain discernible, suggesting the necessity for higher standard deviation and kernel sizes to achieve optimal results.

Pixelization

Pixelization (as illustrated in Figure 6) often appears more visually pleasing to the human eye due to its familiarity as a face-blurring method compared to Gaussian blur. The number of pixels employed for pixelization plays a pivotal role in this context as a smaller pixel count renders the face less recognizable but may result in a less natural appearance.

Overall there has been a preference for pixelization over the Gaussian Blur algorithm. It lies in its familiarity and contextual naturalness, striking a balance between privacy and aesthetics.

Reverse Engineering

With the rise of AI tools, it becomes imperative to anticipate the potential for reverse engineering techniques aimed at removing privacy filters from blurred images. Nevertheless, the very act of blurring a face irreversibly replaces specific facial details with more generalized ones. As of now, AI tools are only capable of reverse engineering a blurred face when presented with clear reference images of that same person. Paradoxically, this contradicts the need for reverse engineering in the first place, as it presupposes knowledge of the individual’s identity. Thus, face blurring stands as an efficient and necessary means of safeguarding privacy in the face of evolving AI capabilities.

Usage in videos

Since videos are essentially a sequence of images, it is relatively straightforward to modify each algorithm to perform anonymization for videos. However, here, processing time becomes crucial. For a given 30-second video recorded at 60 frames per second (images per second), the algorithms would need to process 1800 frames. In this context, algorithms like MTCNN would not be feasible, despite their improvements in certain scenarios. Hence, I decided to implement video anonymization using the YOLO model.

import cv2
from yoloface import face_analysis
face=face_analysis()

def yolo_face_detection_video(video_path, output_path, pixelate):
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        raise ValueError("Could not open video file")

    # Get video properties
    fps = int(cap.get(cv2.CAP_PROP_FPS))
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    # Define the codec and create a VideoWriter object for the output video
    fourcc = cv2.VideoWriter_fourcc(*'H264')
    out = cv2.VideoWriter(output_path, fourcc, fps, (frame_width, frame_height))

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        tm = time.time()
        img, box, conf = face.face_detection(frame_arr=frame, frame_status=True, model='tiny')
        print(pixelate)

        for i in range(len(box)):
            x, y, h, w = box[i]
            if pixelate:
                f = img[y:y + h, x:x + w]
                f = cv2.resize(f, (10, 10), interpolation=cv2.INTER_NEAREST)
                img[y:y + h, x:x + w] = cv2.resize(f, (w, h), interpolation=cv2.INTER_NEAREST)
            else:
                blurred_face = cv2.GaussianBlur(img[y:y + h, x:x + w], (99, 99), 30)  # You can adjust blur parameters
                img[y:y + h, x:x + w] = blurred_face

        print(time.time() - tm)
        out.write(img)

    cap.release()
    out.release()
    cv2.destroyAllWindows()

Web application

For a simplified evaluation of the different algorithms, I created a web application where users can upload any image or video, select the face detection and blurring algorithm, and after processing, the result is returned to the user. The implementation was done using Flask with Python on the backend, utilizing the mentioned libraries as well as OpenCV, and React.js on the frontend for user interaction with the models. The complete code is available at this link.

Conclusion

Within the scope of this post, various face detection algorithms, including Haar Cascade, MTCNN, and YOLOv5, were explored, compared, and analyzed across different aspects. The project also focused on image-blurring techniques.

Haar Cascade proved to be an efficient method in certain scenarios, exhibiting generally good temporal performance. MTCNN stood out as an algorithm with solid face detection capabilities in various conditions, although it struggled with faces that are not typically in a conventional orientation. YOLOv5, with its real-time face detection capabilities, emerged as an excellent choice for scenarios where time is a critical factor (such as videos), albeit with slightly reduced accuracy in group settings.

All algorithms and techniques were integrated into a single web application. This application provides easy access and utilization of all face detection and blurring methods, along with the ability to process videos using blurring techniques.

This post is a conclusion of my work for the “Digital Processing of Images” course at the Faculty of Computer Science and Engineering in Skopje. Thanks for reading!