Face Recognition

Anas Cherradi

Published in

Towards Data Science

13 min readFeb 25, 2020

Face Recognition using PCA vs Deep Learning

We will be learning how we can perform face recognition using a pre-trained neural network with the triplet loss function.

Introduction

Image content analysis and pattern recognition are rapidly expanding areas of application today, thanks to the increased efficiency offered by the power of computers.

Even though the systems proposed in literature are becoming more robust, reliable and efficient in performing face recognition tasks, several real technical and application aspects in the field are often omitted, or very simplified, making formal and complete use of its performance far from being yet a final solution.

In this work, several scientific contributions that have demonstrated good results and provided good methods in the field of face recognition are being considered with some modifications in order to pick the most effective solution and ensure that the recognition performance is extremely interesting compared to the state-of-the-art.

We would like to develop a face recognition system that will be used within a class as an attendance system to mark presence of lecturers and students.

Data set

We will be using two different datasets, one for the PCA method, and another custom Dataset of faces for the CNN approach.

Olivetti dataset

https://scikit-learn.org/0.19/datasets/olivetti_faces.html

Olivetti is a face images dataset that was made between 1992 and 1994 at AT&T Laboratories Cambridge. It contains 10 different images of 40 distinct people with 400 face images.

Besides the fact that the images have the same background and same size, the images were converted to gray level and pixel values were scaled from 0 to 1.

This dataset will be our main reference for the rest of this study. Testing and Training subsets will be made from it.

When displaying the 10 pictures of each folder, we can see that the 10 images contains different facial expressions and lighting of the subjects.

10 faces for each subject in Olivetti dataset

As our machine learning models will need vectors, we use numpy reshape function we transform our data from a (400, 64, 64) array of images into a (400, 4096) vector.

Then we split the data into training and testing: 70% or 7 images per subject as training and 30% or 3 images per subject as testing subsets.

Made dataset

I used my pictures and those of 3 of my colleagues’ pictures along with some known actors from my favorite series “La Casa De Papel” to make a dataset of 10 subjects with about 10 pictures each.

Methodology

Our purpose is to make a facial recognition system which needs as less training data as possible. The main reason behind this constraint is the fact that it is more useful for a supervisor to have train the model with one or few pictures for each student rather than having to make a large Dataset with many images for the same person.

We will be comparing two main face classification models, PCA dimensionality reduction, and pretrained CNNs.

To perform face recognition, the following steps will be followed:

Detecting all faces included in the image (face detection).
Cropping the faces and extracting their features.
Applying a suitable facial recognition algorithm to compare faces with the database of students and lecturers.
Providing a file recording the identified attendants.

Using PCA and different classifiers to recognize faces

The purpose of the following study will be to perform facial recognition using six different classification models to see which one can be the best candidate to be used as an attendance system.

Principal Components Analysis

The first step is to normalise all faces of the training set by removing any common features between these faces, so that every face is left with only its unique features. This is going to be done by removing the average face (mean of pixels over the dataset) from each face.

Our vectors of images will include 64x64=4096 components for each image. These vectors will be created by converting the 2-dimensional image into a one vector by aligning the pixels.

From a numerical point of view, this large number of components may be exaggerated for representing such images. In order to reduce size of the data, we will apply PCA method to select only the main components of our images.

As we have many dimensions in our human faces data, the use of PCA enables to resume or remove the most correlated components and looks for the directions capturing the maximum variance so that we can get only the m most representative components.

The choice of number of components m will be made according to the best gotten accuracy when using classifiers on our data. The procedure consists of looping over several components and construct a PCA model for each specified number of principal components. Then we will build a classifier and compute accuracy from the confusion matrix to get the plot from which the best number of components can be chosen.

Once the m eigenfaces chosen, we can reproduce any face from the training set using these eigenfaces as shown in the picture.

After removing average face from training set, we convert the 2D images into a vectors. Then we apply PCA with to pick m most representative components. Once the m eigenfaces chosen, it is possible to reproduce any face from the training set using these eigenfaces.

Each face from training set can be projected as a weighted sum of the m selected eigenfaces, which is the representation of the given face in the eigen vectors space, plus the mean face. The weights associated to each eigenface represent the contribution of that eigenface to reproduction of the face original.

Once all faces of the training set are converted to their corresponding weight’s vectors, we are able to reproduce the training faces by representing them in the eigenspace.

But what if we would like to recognize a new face and match it with the corresponding subject from the training set?

Recognizing an unknown face

In order to recognize an unknown face, we perform the same steps that have been applied to the train images. We start by normalizing that face (removing the average face) and converting it to a vector. Then we project normalize the face vector onto the eigenspace we calculated before using PCA, which means to represent the unknown face as a combination of the m eigenfaces.

Once we get the weight vector of that unknown face, the next step would be to compare it with all the weight vectors of our training set using the Euclidean distance as a metric.

If the distance between the two vectors is above a certain threshold value, we will say this is an unknown face, otherwise, we will see the person that corresponds to the smallest distance and identify him as being this person.

Using different classifiers

A more robust technique will be to use a classifier (like SVM or KNN) instead of matching the input with the closest face from the training set.

The classifiers that will be introduced after reducing images dimension will be:

Linear Discriminant Analysis: It is a method used in statistics, pattern recognition, and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

Logistic Regression (One-vs-All): As we will have many different classes or subjects in a data-set, we will each time build a model considering the data belonging to one class as positive, and the rest of classes as negative. We will be repeating this process until we build as much models as many classes we have.

Gaussian NB: This is a learning algorithm based on applying Bayes’ theorem with the “naive” assumption of conditional independence. Without checking if the likelihood of the features is Gaussian, we will take this assumption and see if the results are going to be acceptable in term of accuracy.

KNN: It is a non-parametric learning algorithm which known to be “lazy”, which means that it does not use any data points to generate a model. The training is done in the classification and testing phase as the algorithm is only looking for feature similarity with other data points. The K is the number of neighbours, which must be an odd number to avoid having equal votes.

Decision Tree: As image classification is just a particular case within Pattern Recognition, Decision Trees can be used for this purpose.

SVM: Support vector machines are made in such a way they can perform two classes classification. They can be adapted to perform K classification in a very efficient way.

The choice of number of components for PCA

The following figure shows the number of principal components against the accuracy of the SVM -Radial Basis Function(RBF) Kernel- classifier.

**Accuracy against PCA number of components for SVM -Radial Basis Function(RBF) Kernel- classifier**

It seems that with only 40 components we were able to get the same accuracy gotten using 100 or more components. Which means we can reduce the computational time keeping the same performance of the model. In the following, we will be using 40 components.

Classifiers comparaison

After PCA phase using the chosen number of components, we introduce our data into the six different classifiers listed in Section 3 and compute accuracies for each.

Comparing different classifiers’ accuracies

According to the above results, Linear Discriminant Analysis and Logistic Regression seem to have the best performances and can be considered the best option for our classification problem. However, many data points are needed in order to predict on different pictures with different lightening.

Conclusion

A maximal accuracy of 93% is obtained on Olivetti Dataset, even though the images are relatively well positioned and illuminated. This low performance cannot be tolerated for an attendance system where errors are not allowed. Also, for this technique, we needed many images (around 10) to train the model.

On another side, let’s not forget that our attendance system must be able to serve a big university, with a number of students that changes in time. But the PCA model will need to be retrained each time a new student must be added to the database.

Facial recognition using Deep Metric Learning

Another approach to perform facial recognition consist of using a deep convolutional neural network architecture named Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

Inception architecture derives its name from the Network in network paper by Lin et al combined with the famous “we need to go deeper” internet meme coming from Inception movie.

In this work, we will be using a variant of the inception architecture . The model consists of a fully connected layer with 128 hidden units followed by an L2 normalization layer on top of the convolutional base.

The two top layers are the ones responsible for creating 128-dimensional embeddings from images.

Deep metric architecture, The CNN is trained to convert images to 128 d vectors.

Deep Metric Learning Architecture

The aim is to make the CNN learn to convert each image into a vector such that the Euclidian distance between all faces of the same identity is small, and the distance between a pair of faces from different identities is large.

More precisely, given a pair of images of different identities we want the distance to be large by at least a certain , but given two images of same identity we want their encodings to be similar as they both represent the same person.

The margin alpha that should kept must check the following formula:

d(A,P)+ α ≤d(A,N)

To train the neural network, we generate triplets of images from our dataset. Then after defining our Triplet Loss function we would like to minimize, we use gradient descent to tune the parameters of the CNN in order to learn encoding that gives small distance to two images of same class and high distance for images in different classes.

The triplet loss function to be minimized is defined as follow:

Instead of training the CNN model from scratch which will need millions of images, we will use the pretrained model available in keras-Openface project and tweak the weights of the neural network so that our alpha-margin (α=0.2) is checked.

Facial detection

Before feeding our image to the neural network, available faces must be found, cropped and aligned.

dlib library provides HOG+SVM face detector and provides also a pretrained CNN facial detector that can be ran using GPU or using CPU.

As shown in the following figure, for the same sample image containing 9 faces, we tried both facial detectors using CPU. Both extractors gave same result, but the CNN takes about 2 mins when the HOG takes only 6 seconds. After installing CUDA and compiling dlib library to run on GPU, the CNN was able to run 10 times faster.

Face detection will be performed using Dlib’s CNN model as the documentation insists on the high accuracy of CNN compared to HOG face detector.

The pretrained model was trained with aligned face images. But usually not all faces in pictures are not aligned properly. Ttherefore, the cropped face images must be aligned before feeding them to the neural network to achieve high accuracy in face recognition task.

Again, dlib have a pre-trained model for predicting and finding some the facial landmarks and then transforming them to the reference coordinates.

Using the CNN and distance metric to compare faces

Finally, we can use the CNN to extract 128-dimensional vectors, of faces from the aligned images. In the 128-d space, Euclidean distance corresponds directly to the measure of faces similarity.

After transforming all dataset images using the pre-trained CNN, we input the image we would like to recognize through the same process to get similar 128-d vector (embedding). Faces similarity can be measured then using Euclidean distance.

The following figure shows graphically how this process works and illustrate an example of distances measured by the algorithm between similar and different faces.

The distance between the anchor-positive pair is smaller than the distance between its anchor-negative pair (0.59 < 1.43)

Optimal distance threshold

To be able to tell whether two faces belong to the same person or not, a distance threshold must be determined. To find the optimal value of the threshold, different values will be tested using our database images. We will plot accuracy versus different values of the threshold distance and pick the best value.

From the plot below, the value d=0.72 will be chosen as it gives the best accuracy (96.8%).

F1 score and Accuracy of facial recognition using different threshold distances

Using classifiers

Instead of using the smallest distance to determine which faces was detected, it is more efficient to use a KNN or an SVM classification approach where K will be taken equal to 5.

The classifier was trained with 50% of labelled images from the dataset and tested on the remaining images and compared to SVM classifier.

Accuracies using KNN and SVM classifiers

For the rest of the study, KNN classifier will be used.

Data visualization

As visualizing the embedded images with 128 dimensions is not a very easy task, we will be using t-distributed stochastic neighbour embedding (t-SNE) which is a machine learning algorithm for nonlinear dimensionality reduction developed by Laurens van der Maaten and Geoffrey Ginton. This method will allow us to see the distribution of our images in a low dimensional space of two dimensions.

In the following figure, t-distributed Stochastic Neighbor Embedding (t-SNE) is applied to the 128-dimensional embedding vectors in order to summarize the dataset into 2D space. Except from a few outliers, identity clusters are well separated.

Conclusion and comparison with other state-of-art models

We have seen how PCA has a very low accuracy for being used as an attendance system, and that it needs many faces to be trained with as input data. Besides, each time we have a new student, the model must be retrained from scratch and the parameters (number of components of PCA) must determine again which doesn’t suit the requirements of the project.

On the other side, the pre-trained Inception CNN model for facial recognition that was tuned with our dataset gave a very good performance (97% accuracy on our custom dataset) especially when we added the KNN approach. Also, this CNN doesn’t need necessary having many pictures as we can measure distances between faces in the 128-dimensional space having only 1 reference (train) picture and another test picture. Furthermore, once trained, there is no need to find again the parameters of the CNN when new students enroll at school.

For the purpose of this project, many other CNN models were trained and tested on the same dataset, but only the Inception model was described in the previous sections. The best accuracy was gotten using ResNet network (29 convolutional layers pretrained model), and it will be the model that was chosen to work with as it was able to detect all faces correctly in our testing dataset.

The ResNet network which is a much deeper pretrained network that the Inception that achieved 99% accuracy on a publicly available dataset containing more than 13000 images called “Labelled Faces in the Wild”.

The table following table shows some of the best facial recognition models during the last 5 years. We can see that our model is comparable with the state of art verification methods that were tested on the same dataset.

Comparing the results of different verification methods tested on the Labelled Faces in the Wild dataset. The last row was added when the first part was taken from “Deep Face Recognition: A Survey” — Mei Wang, Weihong Deng (2019)

Code

Project : Face Recognition using CNN for attendance system:

GUI.zip

Edit description

drive.google.com

Face Recognition Using PCA on Olivetti dataset: