
Snapchat was made popular by putting funny dog ears on people’s head, swapping faces and other tricks, that beyond funny, look impossible, even magical. I am in the digital visual effects industry so I am familiar with that magic.. and the desire to understand how it works behind the scene.
Behind the magic
Modifying people faces is routine work in Hollywood visual effects, it’s a well understood craft nowadays, but it typically requires tens of digital artists to achieve a photorealistic face transformation. How can we automate that?
Here’s a simplified breakdown of the steps these artists follow:
- Tracking the position, shape and movement of the face relative to the camera in 3D
- Animation of the 3D models to snap on the tracked face (e.g. a dog nose)
- Lighting and rendering of the 3D models into 2D images
- Compositing of the rendered CGI images with the live action footage
Automation of steps 2 and 3 is not very different from what happens in video games, it’s relatively straightforward. Compositing can be simplified to 3D foreground over live background, easy. The challenge is the tracking, how can a program ‘see’ the complex motion of a human head?
Tracking faces with Artificial Intelligence
The Computer Science community has been trying to track faces automatically for a long time and it’s hard. In the recent years, Machine Learning came to the rescue and many Deep Learning papers are published every year on the topic. I’ve spent a while looking for the "state of the art" and realised doing this in real-time is VERY HARD! A good reason to try and tackle the challenge (and that would work nicely with the AR beauty mode I have implemented).
"trying to track faces.. it’s hard.. doing this in real-time is VERY HARD!"
Here’s how I did it.
Designing the network
Convolutional Neural Networks are popular for visual analysis of images and commonly used for applications such as object detection and image recognition.

For a deep neural network to be evaluated in real-time (at least 30 times per second), a compact network is desired¹. With the popularity of Machine Learning and smart phones, new models are discovered every year that push the limit of efficiency – offering a trade-off between computational precision and overhead. Among such models, MobileNet, SqueezeNet and ShuffleNet are popular for applications on mobile devices, thanks to their compactness.

ShuttleNet V2¹ was recently introduced and offers state of the art performances, coming in various sizes to balance between speed and accuracy. It ships with PyTorch, one more reason to pick that model.
Choosing the features to learn

Now I need to find what features the CNN should learn. A common approach is defining a list of anchor points for different key parts of the face, also called ‘facial landmarks’.
The points are numbered and associated strategically around the eyes, eyebrows, nose, mouth and jawline. I want to train the network to identify the coordinate of each point, so I can later reconstruct masks or geometric meshes based on them.
Building a training dataset
Because I want to augment videos with 3D effects, I looked for a dataset with 3D landmark coordinates. 300W-LP is one of the few dataset that comes with 3D positions, it’s pretty large and as a bonus offers a good diversity of face angles. I want to benchmark my solution against the state of the art, recent publications test their models on AFLW2000–3D so I go for 300W-LP for training and test on AFLW2000–3D for comparison.



A note on these datasets, they are meant for the research community and generally not free for commercial use.
Augmenting the dataset
Dataset augmentation improves the accuracy of the training by adding even more variations to the set that it already has. I apply the following transformations to each image and landmark, to create new ones, by a random amount: rotation up to -/+ 40° around the centre, up to 10% translation and scale, and horizontal flip. I apply a different random transformation in memory on each image and for each learning pass (epoch) for additional augmentation.
It’s also necessary to crop the input image close to the bounding box of the landmarks for the CNN to recognise the landmarks at their relative locations. That’s done as a preprocess to save on load time from disk during training.
Designing the loss function

Typically an L2 loss function is used to measure the prediction error for landmark positions. A recent publication⁴ describes a so-called Wing loss function, that performs better for this application, which I could verify. I parametrise it with w=10 and ε = 2 as suggested by the author and sum the result over all landmark coordinates.
Training the network
Training a deep neural network is a very expensive operation that requires powerful computers. Using my laptop would have taken weeks, literally, for one training phase and building a decent setup costs thousands of dollars. I decided to leverage the cloud so I can pay just for the compute power I need.
I chose Genesis Cloud, that offers very competitive prices and $50 free credit to get started. I build a Linux VM with a GeForce GTX 1080 Ti, prepare an OS and storage image where I setup PyTorch and upload my code and the datasets, all through ssh. Once the system is setup, it can be started and shut down on demand, creating a snapshot allows to resume the work where I left it.

The inner training loop processes mini-batches of 32 images to maximise the parallel computation on GPU. A learning pass (epoch) process the entire set of about 60,000 images and takes about 4 minutes. The training converges around 70 epochs so I let it run overnight for 100 epochs to be safe.
I use the popular Adam optimiser that automatically adapts the learning rate, starting with a rate of 0.001. I found that setting the initial learning rate right is critical, if it’s too small the training converges too early in a sub-optimal solution. If it’s too large it has difficulties converging at all. I found the value through trial and error, which is time consuming.. and actually costly when paying the cloud per use!
Evaluation
All these efforts paid off, with the bigger network ShuffleNet V2 2x, I obtain a Normalised Mean Error (NME) of 2.796 on AFLW2000–3D. That’s better than the state of the art model⁵ on that dataset and its NME of 3.07, by a good margin, despite that model being much heavier! 💪
A comparison of the predicated landmarks with the ground truth confirms the theoretical result, the landmarks find their way with precision even for large angles (though AFLW2000–3D contain challenging cases with extreme angles where my model fails).

I don’t have a big CUDA GPU for inference (evaluation of the model) on my MacBook Pro. To maximise the utilisation of hardware resources I convert the model to the portable format ONNX and use the library ONNX Runtime from Microsoft that can infer the model much more efficiently on my machine (yes it runs on OSX!). The inference time is under 100ms on the CPU, which is pretty good, despite not real-time. But I keep in mind I use the biggest version of ShuffleNet V2 (2x) here for maximal precision and I can opt for a smaller one for speed (e.g. the 1x version is 4 times faster). I could also get better numbers running on GPU.
All was set for a robust facial tracking system.. or so I thought.
Limitations with videos
I finally get to plug the trained network to a video stream. An additional step is required in using a separate model to detect the face bounding box, so we can crop the image close to where the landmark will be, like in the training dataset. After a good amount of research, I opt to use this lightweight face detector that is very fast to evaluate yet precise (I did try others).
Despite all my efforts, with great sadness, I find that the result on videos is not robust at all: the markers shake a lot, drift over time and don’t detect eye blinks. 😱
What did I do wrong?
Shaky markers
After digging further, I realised what it is. The training dataset are still photographs, not videos frames! It does seem to work well on stills, but videos are different. They represent an additional challenge: temporal consistency (and motion blur).

As noted in this work⁶ from University of Technology Sydney and Facebook Reality Labs, the ground truth markers from the training dataset are annotated manually on each photo, which is not very precise. Different annotators position the landmarks at slightly different locations, as illustrated above.
Our network actually learns imprecise landmark locations – which result into random jittering in a small zone during prediction rather than exact location.

A common approach to address this issue is to stabilise the predicted markers after the fact. To that effect, as suggested by this other publication⁷ from Google Research, I use the 1€ filter⁸ is to smooth the motion noise. I found it gives better result than the traditional Kalman filter and is simpler to implement. I also use it to stabilise the box returned by the face detector, because it shakes a lot as well and that is not helping.
The motion filter makes a big difference. But it’s not perfect, specially with a low quality webcam like my laptop’s that produce noisy videos, which the model doesn’t seem to like very much.
The paper from UTS and Facebook RL mentioned above suggests to fine-tune the trained model essentially as follow:

- Predict the landmark positions by computing the optical flow forward across frames (OpenCV can do that)
- Compute the optical flow backward from that result
- Compute a loss function based on the distance between the prediction from optical flow and the prediction from pre-trained model
I haven’t tried it myself but the demo video looks promising.
Stiff eyes and eyebrows
The trained network suffers with eye motion, it doesn’t see eye blinks and can’t fully ‘close’ the eye landmarks. The eyebrows landmarks also seem stiff in their motion compared to what they are tracking.

My model became pretty good at predicting profiles, because there is a ton of them in the dataset (it’s actually perfectly balanced in terms of face angles), but did not learned eyes that well. A CNN can perform better if the features it learns tend to live around a consistent location, that’s why we have to detect the face bounding box before inference and crop around the landmarks during training: that puts the face at the centre every time.
As the face angle varies, the features inside the face – eyes, mouth and nose – change shape and position in the frame and can even disappear at wide angle. This makes its more difficult for the model to recognise them well every time. I tried augmenting the dataset for extreme eye poses (closed, wide open), but that had little effect.
Next steps
This project was fun but costly in time and money ($15 from my pocket on top of the initial free credit of $50). It was a great opportunity to learn but I don’t intend to reinvent the wheel. Since I started this journey, Google has open-sourced MediaPipe, that offers a cross-platform implementation of their model. Though initial demos of MediaPipe show some drifting, stiffness and shaking, which validates my outcomes.
Still, I have some ideas to improve my approach (starting with the optical flow based fine-tuning mentioned above).
3D Morphable Face Models

To be able to build an app like Snapchat, we actually need more than a couple of 3D points. A common approach is to use a 3D Morphable Face Model (3DMM) and fit it on the annotated 3D points. The dataset 300W-LP actually comes with such data. The network could learn all the vertices of the 3D model instead of a small amount of landmarks, such that a 3D face will be predicted directly by the model and ready to use for some AR fun!
Multiple Networks
In order to improve overall precision, the aforementioned Wing loss paper, as well as the work from Google Research, suggest to predict some landmarks with a first lightweight model. Thanks to the initial information, we can crop and align the face vertically to help the second model deal with less variation of poses.
I believe we can push that concept of specialised models further to involve multiple networks. We could train different CNNs for specific face angles, say one network for large angles (60°+), one for intermediate angles (30° to 60°) and one for frontal angles (0° to 30°), and somehow combine them.
Alternatively, different networks could learn different face parts, one for the eyes, one for the mouth, etc.. with the hope that simplifying the work of a model will increase precision. A first pass could detect the rough position and orientation of the head as above then crop and align the area of interest for the relevant expert networks.
Conclusion
Application of deep learning for Face Tracking is an active area of research where a lot of progress was made in the last couple of years. There is still room for improvement, in particular for tracking videos. The temporal coherence of predicted facial features and the precision for key areas like eyes and mouth remain a challenge.
I am excited to see what the future holds on that topic!
[1]: N. Ma et al., ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design (2018), ECCV
[2]: X. Dong et al., Style Aggregated Network for Facial Landmark Detection (2018), CVPR
[3]: X. Zhu et al., Face Alignment Across Large Poses: A 3D Solution (2016), CVPR
[4]: Z. Feng et al., Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks (2018), CVPR
[5]: J. Guo et al., Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment (2018), BMVC
[6]: X. Dong et al., Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors (2018), CVPR
[7]: Y. Kartynnik et al., Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs (2019), CVPR
[8]: G. Casiez et al., 1 € filter: a simple speed-based low-pass filter for noisy input in interactive systems (2012), CHI
[9]: V.H. Phung et al., A High-Accuracy Model Average Ensemble of Convolutional Neural Networks for Classification of Cloud Image Patches on Small Datasets (2019), Applied Sciences