Face Recognition on Jetson Nano using TensorRT

mtCNN and Google FaceNet in FP16 precision on Jetson Nano

Published in

Towards Data Science

4 min readJan 24, 2020

When it comes to Face Recognition there are many options to choose from. While most of them are cloud-based, I decided to build a hardware based face recognition system that does not need an internet connection which makes it particularly attractive for robotics, embedded systems, and automotive applications.

→ Link to Github repository

Update

The repository was updated (Setpember 18th, 2020) and everything works with Jetpack 4.4, Cuda 10.2 cudnn 8.0, TensorRT 7.x and OpenCV 4.1.1 now.

NVIDIA Jetson Nano

Source: https://developer.nvidia.com/embedded/jetson-nano-developer-kit

NVIDIA Jetson Nano is a single board computer for computation-intensive embedded applications that includes a 128-core Maxwell GPU and a quad-core ARM A57 64-bit CPU. Also, the single board computer is very suitable for the deployment of neural networks from the Computer Vision domain since it provides 472 GFLOPS of FP16 compute performance with 5–10W of power consumption [Source]. There are many tutorials that assure an easy start with the Jetson platform such as the Hello AI-World tutorials or Jetbot, a small DIY robot based on Jetson Nano.

NVIDIA TensorRT

TensorRT enables the optimization machine learning models trained in one of your favorite ML frameworks (TensorFlow, Keras, PyTorch, …) by merging layers and tensors, picking the best kernels for a specific GPU, and reducing the precision (FP16, INT8) of matrix multiplications while preserving their accuracy. Note that for INT8 precision an extra calibration step is needed to preserve accuracy. Since this significantly (at least in most cases) reduces the inference time and increases the resource efficiency, this is the ultimate step for the deployment of a machine learning model in robotics, embedded systems (with GPU), autonomous driving, and data centers.

Source: https://developer.nvidia.com/tensorrt

Now, what does FP16 mean? It is also known as Half Precision. Machine Learning models in most frameworks are trained using 32bits of precision. In the case of FP16, the precision of a trained model is reduced 16bits or for NVIDIA Jetson AGX Xavier you can even convert a model to INT8 (8bits). According to NVIDIA inference time of a trained model can be accelerated by up to 100+ times.

mtCNN

The multi-task Cascaded Convolutional Networks (mtCNN) is a deep learning based approach for face and landmark detection that is invariant to head pose, illuminations, and occlusions. Face and landmark locations are computed by a three-staged process in a coarse-to-fine manner while keeping real-time capabilities which is particularly important in the face recognition scenario.

Google FaceNet

Google’s FaceNet is a deep convolutional network embeds people’s faces from a 160x160 RGB-image into a 128-dimensional latent space and allows feature matching of the embedded faces. By saving embeddings of people’s faces in a database you can perform feature matching which allows to recognize a face since the euclidean distance of a currently visible face’s embedding will be much closer to a known embedding than to the others.

Setup & the Code

This project is set up and built in less than 5 minutes since Jetpack comes with Cuda, CudNN, TensorRT and OpenCV. So, don’t be scared to try it

→ Link to Github repository

To my knowledge, this is the first open-source cpp implementation that combines the mtCNN and Google FaceNet in TensorRT and I invite you to collaborate to improve the implementation in terms of its efficiency and features.

The setup for this project involves:
- Jetson Nano Developer Kit
- Raspberry Pi Camera v2 (or any USB-Camera supported by Jetson Nano)
- Optional: PWM fan for appropriate cooling of CPU and GPU
- Optional: Some kind of camera mount for the camera

Performance

Lastly, this is an overview of the performance of the face recognition on Jetson Nano in 640x480 (480p):
~60ms +/- 20ms for face detection using mtCNN
~22ms +/- 2ms for facenet inference