The world’s leading publication for data science, AI, and ML professionals.

FaceMath: An Algebra of Ideas in A.I.

Transforming Faces, Words, and Other Cool Stuff

UNSUPERVISED LEARNING WITH ARTIFICIAL INTELLIGENCE

Close your eyes, and think in your head about a doll. Now image that same doll can talk. Watch the doll talking. What you just did there in your head – the thing you can see talking – is the merger of the idea of a doll with the idea of talking. As it turns out, we can use Artificial Intelligence to add concepts together using algebra, and that’s what this article is about. I like to call it "an algebra of ideas", because we are using basic math to add concepts together, forming compound concepts.

Let’s talk about ideas and turning them into numbers. Back in this cartoon AI article, we saw how blocks of text could be turned into special fixed-length lists of numbers (embedded into vectors), with the clustering of those vectors revealing information about the topic of groups of comics. Individual words can also be embedded into vectors, with their meaning given through their relationships to other words. The traditional example is that the vectors "king" – "man" + "woman" = "queen". This means that we can learn a representation such that the vector for king, man, and woman can be used to find the vector for queen. This is algebra on words, but we can do even more! It turns out you can do this with images as well!

Interpolating between yound and old faces (between faces, not ages). More on how we did this below.
Interpolating between yound and old faces (between faces, not ages). More on how we did this below.

By finding good encodings (i.e., embeddings) for a set of images, you can start to do the same kind of "feature math" we did on words, to modify the pictures in cool ways. With faces for example, a model can be trained to include features such as age, gender, or whether or not the face is wearing glasses. However, the neural network used to model faces (e.g., autoencoder or gan) is different than the type of math used to model words and their relationships to each other (word embedding with word2vec, gloVe, fastText, ELMo, BERT, XLNet, etc.).

Motivating Examples from the Literature

Applying this algebra of ideas concept to images has many practical applications. There are use cases ranging from snapchat’s gender swap filter, to face matching with image feature extraction. A famous example of this kind of technology is thispersondoesnotexist.com which generates realistic looking high-resolution photos of faces that are not real.

My motivation for researching this topic was some work by Alec Radford a few years back on DCGAN, in the paper "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks". Open the PDF and have a look at page 10. This is the point where the light in my head really turned on. They show how you can use a neural network to generate faces conditionally, by adding ideas together. We see on page 10 that the vectors for "man with glasses" minus "man withough glasses" plus "woman without glasses" gives us a face of a woman with glasses. It’s insane that this works. Those plus and minus signs motivated me to stay informed on progress in this area. It’s just super cool.

I’ll get to the stuff we built for this article in a bit. First more on what’s already out there. The latest and greatest work is what I want to move to talking about now. Have a look at whichfaceisreal.com and see if you can detect the real versus face photos. The cheat is to pay close attention to the background of the image. Although the fake images are super realistic, that are not as well trained to generate photorealistic backgrounds. The generator does understand stuff like optics of glasses, reflection, shaddow, etc, but it sometimes fills in the background poorly.

Three recent amazing projects to come out are embedded below as videos. In the first, "Text-based Editing of Talking-head Video", we see that video editing is going to be massively disrupted as we can now use Deep Learning to edit what people say in video. The motivating example is a report on the stock market. In the second video, you see "Few-Shot Adversarial Learning of Realistic Neural Talking Head Models", where the model learns to take a few pictures (frames) and makes a model that can talk (a talking head model). This paper uses quite a few beautiful tricks like meta-learning and a realism score. Finally, the third video is about learning and generating faces. The author used his own dataset. This is similar to the approach we followed below, using an autoencoder. We adapted a variational autoencoder (VAE) in our project, but that’s just a detail. He nicely explains the concept of having a model learn a low dimensional face model.

Having seen a sample of what’s out there, let’s do our own thing.

Plan o’ Action

For this article, the strategy we used for creating face embeddings was through training a convolutional autoencoder. We eventually settled on a variational autoencoder. It’s basically the machine learning version of a zip file: you take an image, compress it down to a small amount of data, and then try to use that data to redraw the original image. By modifying that compressed data, you can adjust how the picture is reconstructed by the model. That compressed data becomes the image embedding. The code we used for this was modified from Thomas Dehaene’s excellent article here.

Instead of working with fruits as in the original article, we made use of the UTKFace Dataset (see also this paper). It’s licensed for research use, and so here we are, doing the research. The main advantage of this dataset is that it contains aligned faces cropped correctly and tagged by race, age, and gender.

With autoencoders, the input to and output from the neural network are the same during training. This basically means that the neural network needs to learn to compress and then decompress the image, and in doing so, it learns the essential features that make up faces. If it can learn about faces, then we can use the trained model to do cool things like generate new faces, morph between faces, and add concepts to faces, like adding glasses, or subtracting one gender and adding another, or reducing age, and so on.

You will notice in this article that we are not trying to classify stuff by putting a label on it. We are instead interested in modelling the distribution of the data (faces)in the dataset and using that model to "do stuff" with math. This is the idea in unsupervised learning: you learn from the data without forcing the model to learn labels for the data.

Training the model to learn about faces

I worked on this project with Mary Kate MacPherson, and our first step was to train the autoencoder on the faces dataset. The structure is pretty much unchanged from the original article, other than increasing the number of epochs, as this had better results with the faces dataset. I spent about 2 whole days trying to write my own convolutional autoencoder based on my past work on audio stuff, but the results were just not as good as Tom’s VAE. The reconstructed faces turned out fairly well, though there’s definitely still some blur happening, as we see in the figures below. We will see later on that a sharpening filter can help with that.

Face reconstruction using variational autoencoder. The first row is the input and the second is the resulting output.
Face reconstruction using variational autoencoder. The first row is the input and the second is the resulting output.
Another example of reconstructed faces.
Another example of reconstructed faces.
And even more...
And even more…

Algebra! Experiments on faces from the validation dataset

With the model trained, it was time to have some fun experimenting with the embeddings on pictures that the model was not trained to recognize. The first experiment was a simple interpolation between two faces. The equation generating this was just averaging out the two face embeddings, with each iteration weighing the second face a bit more and the first face a bit less. This got some fun results!

The CEO to CTO morph.
The CEO to CTO morph.
And the gif, for your viewing pleasure.
And the gif, for your viewing pleasure.

And we can pull in celebrities!

Me morphing into Keanu Reeves. Because awesome!
Me morphing into Keanu Reeves. Because awesome!

Here is a gif of the transition from me to Keanu. It’s neat to see how the face morphs. That moustache growing is pretty mesmerising. It’s kind of choppy in places, but it’s a good start.

The Daniel Shapiro, PhD to Keanu morph as a gif.
The Daniel Shapiro, PhD to Keanu morph as a gif.

And here is the next logical step:

Mathieu Lemay to Keanu morph.
Mathieu Lemay to Keanu morph.

And we had to do a few more, just for fun.

The Sam to Daniel Shapiro, PhD morph. The beard gets weird as it grows in from the outside in and the inside out.
The Sam to Daniel Shapiro, PhD morph. The beard gets weird as it grows in from the outside in and the inside out.

The next experiment was messing around with the age of the face. The overall equation for this was:

"original face" – "average of faces original face’s age" + "average face of new age"

And so to turn someone 35 into a 5 year old, you would go:

i=image of 35 year old

a=average vector of the embeddings of 35 year old faces

n=average vector of the embeddings of 5 year old faces

and when you push into the decoder of the autoencoder the vector i-a+n, you get a baby face version of the original face.

Baby face of me. It didn't quite delete the beard.
Baby face of me. It didn’t quite delete the beard.

This turned out to work a lot better for very young faces than very old faces, since the blurriness of the images made it hard to show wrinkles and that kind of thing. The following image shows Mathieu Lemay turning into a baby.

Mathieu Lemay morphing into a devil baby!
Mathieu Lemay morphing into a devil baby!

And here is the gif of this magical process:

Baby eyes do not look good on an adult's face.
Baby eyes do not look good on an adult’s face.

Adding Glasses worked pretty well:

Adding glasses to Mathieu Lemay.
Adding glasses to Mathieu Lemay.

And it worked on me as well:

Adding glasses to me.
Adding glasses to me.

The following is a morph of Sam face maximizing feminine features:

And now let’s talk about generating brand new faces. We can do a random walk in the embedding space to get faces like this:

And the model seems to understand lighting:

Animation of generated faces. You can see that the lighting is a concept the neural network seems to understand.
Animation of generated faces. You can see that the lighting is a concept the neural network seems to understand.

And faces nearby in the embedding space look similar:

A smaller subset of the same pictures to show the similarity between images.
A smaller subset of the same pictures to show the similarity between images.

The generated faces can be pretty detailed when you blow them up:

Using a post-processing step to sharpen the output from our face generator looks even more realistic! Sharpened version of imagined faces (one male; one female):

Before
Before
After
After
Before
Before
After
After

What about other datasets?

Recall we wrote an article on waifu generation and Mary Kate MacPherson wanted to try this model with that dataset she curated. Here is an early look at the results after training on the waifu dataset, morphing between anime features:

Waifu morph
Waifu morph

A generated waifu:

Waifu morph as a gif:

Showing each of the frames that the gif is composed from:

Zoomed in gif to give you a clearer view of the details:

Conclusions?

We have a bunch more material that will probably go into a second article. I think the conclusion for this article is that we can do math in the images space, and operate on images like ideas, including age modification, face morphing, age modification, adding features like glasses (as we saw in the DCGAN paper), adding feminine features, generating faces, learning lighting, and more. We saw that this idea works on anime characters as well, amd that the output realism can be improved with classical Image Processing techniquest like a sharpening filter. In this article you followed along to see some background and application of unsupervised learning on faces datasets. If you liked this article, then have a look at some of my other articles, like "How to Price an AI Project" and "How to Hire an AI Consultant." And hey, check out our newsletter!

Until next time!

-Daniel Lemay.ai [email protected]


Related Articles