The world’s leading publication for data science, AI, and ML professionals.

Robust face detection with MTCNN

Upgrade from old Viola-Jones detector

Introduction

Face detection is the task of detecting all human faces on a given image. This is typically done through extracting a list of bounding boxes, i.e. coordinates of smallest possible rectangles around faces. Perfect detector should be:

  • fast – ideally real-time (above 1 FPS minimum)
  • accurate – it should only detect faces (no false positives) and detect all faces (no false negatives)
  • robust – faces should be detected in different poses, rotation, lighting conditions etc.
  • utilizing all available resources – using GPU if possible, using color (RGB) input etc.

If you already know how the Viola-Jones detector works, you can skip the next part and go directly to the MTCNN section.

Viola-Jones detector

One of the oldest approaches is the Viola-Jones detector (paper). It works on grayscale images, as it interprets the image as a collection of Haar features, i.e. lighter and darker rectangles. There are many types of Haar features with different positioning of light and dark regions in the rectangle. They can be computed really fast through using a technique called integral images.

Haar features (source)
Haar features (source)

Such features are then fed to a cascade of AdaBoost classifiers. They are boosted ensembles of decision trees, arranged in a certain order. Each feature is fed to the first classifier – if it rejects it (i.e. thinks there is no face in this place), it is instantly rejected; if it accepts (i.e. thinks it is a part of the face), it goes as an input to the next classifier. This way Viola-Jones detector changes the task of detecting faces to a task of rejecting non-faces. As it turns out, we can very quickly filter out non-faces with cascade, resulting in very fast detection.

Viola-Jones cascade (source)
Viola-Jones cascade (source)

However, it has its problems. If the faces are of different sizes, the image is resized and the classifier runs on every image. The accuracy in such cases is not very great, it can also fail in quite surprising circumstances (see image below). The main disadvantage is the lack of robustness. The Viola-Jones classifier assumes the faces are looking straight at the camera (or almost straight), lighting is uniform and face is well visible. Such conditions are often met, e.g. when the smartphone camera detects faces for filtering, but not always. Prime examples are security (they are placed near the ceiling) and group photos (where lighting conditions vary).

MTCNN to the rescue

MultiTask Cascaded Convolutional Neural Network (paper) is a modern tool for face detection, leveraging a 3-stage neural network detector.

MTCNN work visualization (source)
MTCNN work visualization (source)

First, the image is resized multiple times to detect faces of different sizes. Then the P-network (Proposal) scans images, performing first detection. It has a low threshold for detection and therefore detects many false positives, even after NMS (Non-Maximum Suppression), but works like this on purpose.

The proposed regions (containing many false positives) are input for the second network, the R-network (Refine), which, as the name suggests, filters detections (also with NMS) to obtain quite precise bounding boxes. The final stage, the O-network (Output) performs the final refinement of the bounding boxes. This way not only faces are detected, but bounding boxes are very right and precise.

An optional feature of MTCNN is detecting facial landmarks, i.e. eyes, nose and corners of a mouth. It comes at almost no cost, since they are used anyway for face detection in the process, which is an additional advantage if you need those (e.g. for face alignment).

The official TensorFlow implementation of MTCNN works well, but the PyTorch one is faster (link). It achieves about 13 FPS on the full HD videos, and even up to 45 FPS on rescaled, using a few tricks (see the documentation). It’s also incredibly easy to install and use. I’ve also achieved 6–8 FPS on the CPU for full HD, so real-time processing is very much possible with MTCNN.

MTCNN is very accurate and robust. It properly detects faces even with different sizes, lighting and strong rotations. It’s a bit slower than the Viola-Jones detector, but with GPU not very much. It also uses color information, since CNNs get RGB images as input.

Comparison

Comparison of Viola-Jones and MTCNN detectors (image by author)
Comparison of Viola-Jones and MTCNN detectors (image by author)

Code and examples

To see how those two detectors compare on the real-world data, I used the code below and a few popular images.

The Viola-Jones code is based on this OpenCV tutorial. The pre-trained weights for face detection can be found here.

First image (source) is a typical family photo. Face detection is often used on such images, e.g. for automated image gallery management on iPhones.

Left: Viola-Jones, right: MTCNN (original image source)
Left: Viola-Jones, right: MTCNN (original image source)

A typical use case for Viola-Jones detection, a family photo. Great conditions, faces are approximately facing the camera, it works really well. MTCNN, since it’s more sensitive by default, detects a false negative. For such situations we can easily change the thresholds for MTCNN with thresholds attribute, which is very useful, but here we stick to the defaults.

A second image is a bit more challenging, a famous Solvay conference group photo (source). It’s similar to large family group photos, also popular target for face detection in social media.

Upper: Viola-Jones, lower: MTCNN (original image source)
Upper: Viola-Jones, lower: MTCNN (original image source)

Both detectors work well. The only face not detected by Viola-Jones is the rotated one. Here MTCNN shows its robustness – it detects this face. As you can see, the "trivial" photos can also be problematic for simple detectors.

The last photo is a typical street photo (source). Such conditions are more similar to the security camera photos, where ML is used for face detection, e.g. to speed up surveillance video analysis.

Upper: Viola-Jones, lower: MTCNN (original image source)
Upper: Viola-Jones, lower: MTCNN (original image source)

Images like this have many different light levels, so the Viola-Jones detector does not work well, it generates a lot of false positives and does not detect smaller, rotated or partially obstructed faces. MTCNN, on the other hand, has perfect detection, even for heavily obstructed faces.

Summary

You have learned about MTCNN, a robust and accurate alternative to the Viola-Jones detector. In real life conditions, the assumptions of the Viola-Jones framework often fail, but cleverly constructed Neural Networks can perform such tasks with ease.

If you want to know more about MTCNN and its technical details, see the excellent documentation (with tutorials and notebooks linked there).


Related Articles