Real-time facial recognition application to identify Computer Science students at the University of Sonora using Pytorch and Multi-task Cascaded Convolutional Neural Networks (MTCNN)

Explanation and use, on how we completed our work

Published in

Towards Data Science

6 min readMay 19, 2021

Diana Laura Ballesteros Valenzuela, Eliud Gilberto Rodríguez Martínez, Hugo Francisco Cano Beyliss, Martín José Vega Noriega

Universidad de Sonora, Licenciatura, Hermosillo, Sonora,
México

mjvnor@outlook.com, dballesterosvalenzuela@gmail.com, slabxcobra@gmail.com, eliud.giroma@gmail.com.

Github: https://github.com/MJVNOR/Real-time-facial-recognition-MTCNN/tree/main

Abstract. This project was made with the objective of detecting each computer science student at the University of Sonora in Hermosillo Sonora and with that to be able to carry out other projects based on being able to recognize their face. For this project we use image processing with neural networks for facial recognition. We implemented MTCNN which is a neural network that detects faces and facial landmarks in images and is one of the most accurate tools today.

Keywords: facial recognition, neural networks, pytorch

1. Introduction

The number of institutions that uses facial recognition and its implementation increases everyday, this is because it has many benefits, such as identifying lost people, identifying possible thieves, identifying which workers went to work on a certain day, etc. facial recognition is: given an image of an “unknown” face, finding an image of the same face in a set of “known” images.

Facial recognition has had several stages in which it made progress. We can see how before it was only a thing of fiction and today we can find it in many applications and places.

Woodrow Wilson Bledsoe is considered to be the pioneer of this technology since, in 1960 he worked on a system to classify the features of the human face, although the procedure was very manual, he used a stylus and coordinates to precisely locate the eyes, nose or mouth in people.

Today we can see how facial recognition is part of the artificial intelligence that allows us to detect and identify human faces.

In this work, the use of tools from the artificial intelligence area is implemented to be able to identify human faces in real time and also if this face is part of our data set of students and thus be able to recognize the person

2. Proposed Artifacts

Real-time facial recognition application.

2.1 Content modeling:

The first thing we did was use an MTCNN network to generate a dataset of the students’ faces and also have it return a probability that both the face is indeed a face (only one face per image) and this face is in the form of tensor.

mtcnn0 = MTCNN(image_size=240, margin=0, keep_all=False, min_face_size=40)

We use a default model (vggface2) to input our tensors dataset and to return the most important data of each face (in matrix form), this is called embedding.

resnet = InceptionResnetV1(pretrained='vggface2'.eval()for img, idx in loader:
    face, prob = mtcnn0(img, return_prob=True)
    if face is not None and prob>0.92:
        emb = resnet(face.unsqueeze(0))
        embedding_list.append(emb.detach())
        name_list.append(idx_to_class[idx])data = [embedding_list, name_list]
torch.save(data, 'data.pt')

In a cycle we access the device’s camera to begin with our recognition. We use a second network of type MTCNN, this one to detect if there is a face (if it detects a face it prints/draws a box around each face) and also gives us the probability that it is indeed a face, but this time the input of this network will be the frames of our camera (clarify that this network can detect more than one face at a time for each frame). We use the vggface2 model again to embed our detected faces.

With the embeddings of the saved faces and the embedding that was generated with the frames of our camera, we calculate the distances between the groups of the students’ faces against the faces that we have detected in our frame and with that we take the smallest distance and we say that the face is of some student.

load_data = torch.load('data.pt')
embedding_list = load_data[0]
name_list = load_data[1]cam = cv2.VideoCapture(0)while True:
    img = Image.fromarray(frame)
    if img_cropped_list is not None:
        boxes, _ = mtcnn.detect(img)
        for i, prob in enumerate(prob_list):
            if prob>0.90:
                emb = resnet(img_cropped_list[i].unsqueeze(0)).detach()
                dist_list = [] 
                for idx, emb_db in enumerate(embedding_list):
                    dist = torch.dist(emb, emb_db).item()
                    dist_list.append(dist)min_dist = min(dist_list)
                min_dist_idx = dist_list.index(min_dist) 
                name = name_list[min_dist_idx]box = boxes[i]
                original_frame = frame.copy()if min_dist<0.90:frame = cv2.putText(frame, name+' '+str(min_dist), (box[0],box[1]), cv2.FONT_HERSHEY_SIMPLEX, 1, (63, 0, 252),1, cv2.LINE_AA)frame = cv2.rectangle(frame, (box[0],box[1]) , (box[2],box[3]), (13,214,53), 2)cv2.imshow("IMG", frame)

3. How the MTCNN works

The MTCNN model was proposed in the paper “Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks” by Zhang, Zhang and Zhifeng. This model was made with the main goal of detecting its faces and features, it consists of three stages of convolutional networks.

The model is based on making use of multi-task learning in order to divide the detection tasks of faces and features, since doing this in the multi-task way improves generalization by taking advantage of the domain-specific information contained in the signals training of the related tasks, and this is done by training the tasks in parallel while using a shared representation of the information.

The first step before starting the stages is to take the image and resize it at different scales to build a pyramid of images, which is the entrance to the first part of the network.

The first stage consists of having a fully convolutional network (FCN), we use this instead of a CNN because it does not have a dense layer in its architecture. This proposed network is used to obtain the faces and their respective bounding box. The output of this network gives us all the possible candidates of the faces.

In the second stage of the network, the output of the first stage is used as input of a CNN network and this is called “The Refine Network (R-Net)”, this reduces the number of possible candidates, calibrates the bounding boxes and a non-maximum suppression (NMS) is done to unite our possible candidates. As an output it tells us if the input is a face or not.

Finally, the third is very similar to the R-Net, it gives us the position of the eyes, nose and mouth of our candidates.

4. How we calculated the distance

We used the Euclidean distance. The Euclidean distance represents the minimum distance between two points in a plane, it’s a straight line between them.

This distance is calculated as the square root of the sum of the squared difference between the elements of two vectors, as indicated in the previous formula.

dist = torch.dist(emb,emb_db).item()

We give this function the face that we are detecting and we take the distance with each face from our database, from there we save all the distances and take the shortest one.

4.1. Detection results

This person does not exist, we took him from thispersondoesnotexist.com, we put him in our database and called him Andrew to show an example of the program working.

As we can see, the program detects “Andrew” with a distance of approximately 0.6237. The lower the value of the distance, the more likely that person is the one.

5. Mistakes we made and errors we found

We found that when we created the embeddings of the faces of our colleagues that we have in the database, one colleague’s computer ran out of memory, while another’s did not, in other words, you would have to have more than 1gb of memory (ram) available.

We also tried to implement multithreading and failed miserably.

6. Conclusions and Future Work

We can say that using an MTCNN and making use of embeddings to get the distances gave us very precise results (even with a small number of faces for each person), but the negative side is that when it detects a face the FPS decreases and when it detects you that you are “certain” person diminish even more.

We plan to change certain things in our application to achieve the lowest possible decrease in FPS (such as using another method in addition to calculating distances) and to be able to add more functionalities to the project in order to improve and make use of technology with our own colleagues.

References

Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499–1503. Lipschutz, S., Lars Lipson, M.: Set Theory: Discrete Mathematics. McGraw-Hill, pp. 1–22 (2007)
Caruana, R. Multitask Learning. Machine Learning 28, 41–75 (1997). https://doi.org/10.1023/A:1007379606734