Pareidolia — Teaching Art to AI

Pareidolia is our first AI & Art project under the Alien Intelligence umbrella.

Published in

Towards Data Science

9 min readJun 4, 2020

At Alien Intelligence, we explore our ability to teach art to AI; have it generate some evidence of its understanding, and then analyse and interpret its response. We start with a simple “lesson,” and plan to gradually develop its content and complexity in an iterative process. The quality of the interactions and the respective outcomes will depend on both the technical capabilities of AI, as well as our own human ingenuity/limitations in communicating with it.

Pareidolia — it’s human to see humans

Pareidolia is the tendency for incorrect perception of a stimulus as an object, pattern or meaning known to the observer, such as seeing shapes in clouds, seeing faces in inanimate objects or abstract patterns, or hearing hidden messages in music. Wikipedia

Pareidolia — what do you see? (Google Image Search)

In this project, our goal is to communicate to the AI that certain works of art are perception or interpretations of objects and ideas from the real world. As the first class of such objects, we chose the most human object out there — the human face.

More specifically, we first start by showing the AI “real” human faces that are captured in photos. Next, we show it the artistic depiction of human faces as expressed in portrait paintings — ranging from realistic, all the way to abstract representation.

We then probe the AI’s “understanding”: we show it new portrait paintings that it hasn’t seen before, and ask it to generate a realistic photo that captures the essence of the face in that art piece (yes, the opposite direction). We were curious to see what it would produce. We have started with realistic paintings, but our intention is to further expand to abstract, cubist, surrealist, as well as 3D works. Finally, in true “Pareidolia” fashion, we will give it photos of objects that are not human faces, and explore how it projects them as a realistic photo of a human face.

We aim to explore the junction that doesn’t only stretch the AI’s capability of understanding and expressing art, but also our own human limitations in communicating our goals to the AI and collaborating with it.

All magic comes with a price

What makes teaching AI art magical, at this introductory level, is that there is no need to provide it with precise definitions and complex explanations of what a face is, what a portrait painting is, and how they relate to each other. Instead, we just give it (many) examples of both, and it somehow learns. Sounds exciting? Well, beware! All magic comes with a price.

In our case, this price originates from the AI’s lack of any prior knowledge about faces, portraits, or art. In fact, it lacks almost any prior knowledge about us, our histories, and our abortions. Nor of the world we live in. The only information it has, is whatever is stored in the images it is shown. Especially, it does not have access to the many concepts and facts we take for granted when WE look at portraits and photos.

For example, the fact that faces are part of the human body, that there are certain universal commonalities such as the general shape of the head, the existence and positions of the eyes, ears, nose, and the mouth. The fact that humans come in different genders, races , and ages, as well as in a spectrum of genetic variability — and that all of these are visible attributes of the face. There are also more nuanced facts, like the variability of hair and facial hair, and facial expressions. Furthermore, the knowledge of what is the “natural” orientation of a human face, and how does it look from above, or from the profile. It is this prior knowledge that allows us humans to effortlessly identify and analyse a human face, as well as to distinguish between a real face, and something that just looks like it.

The ability to distinguish between a real face, and something that just looks like it

Similarly, there are concepts and facts relating to portraits. For example, the nuanced understanding that the portrait attempts to capture a face, but not necessarily in a direct and accurate manner, like a mirror does. Rather, there are built in constraints as well as intended adaptations — the technique used, the artistic statement and agenda, the time and location of the execution, and the composition and setup of the artwork.

These are all pieces of knowledge we take for granted, and which are critical to the task of learning the relationship between photos and portraits. Pieces of knowledge to which the AI has no access.

Admittedly, one could think of elaborate ways of communicating these to the AI. For example, provide labels for the images and portraits — explicitly detailing gender, race, age, expressions, and other facial attributes (bald, with beard, moustache, blonde hair, long nose, thick eyebrows, …). However, we made a conscious decision not to use these, and see how far we can get with merely the unlabelled photos and portraits. We wanted to keep our dialogue with the AI simple.

Portraits from Mars and photos from Venus

By now, the importance of the images we use as examples for training must be obvious, as they encapsulate all the information the AI has access to. Let’s take a closer look at that then.

In an ideal world, we would provide the AI with pairs of images: A photo of a person’s face, and a matching portrait of that same person. Unfortunately, such datasets don’t exist. Most of the portraits we have are from before the camera was invented, and most of the photos we have, are of people who didn’t feel a need to get their portrait painted.

Moreover, the publicly available datasets of photos of human faces are often based on images of celebrities from across the web. These are heavily biased towards young, white, good-looking, fashionable, smiling faces, that are captured from an optimally aligned frontal position. This is in great contrast to the distribution of ages, expressions, positions, and textures that we find in portraits (except that they are mostly of white objects). This point is best illustrated by examples :

This seemingly simple difference introduces a HUGE challenge for our AI. Since the two datasets (photos vs portraits) actually represent two very different views of the human population. This indeed had very clear (and expected) affect on the learning, understanding, and output of the AI.

Again, there are ways to try and address this issue. Ranging from using a more representative dataset of photos (easier said than done, and often on the expense of quality), all the way to creating “synthetic portraits”. That is, algorithmically manufacture “artistic” portraits from photos, and using these as pairs.

However, as before, we decided to stick with simplicity at this stage, and not to use synthetic portraits.

Synthetically generated portraits (https://deepart.io/latest/)

Now that we are finally done with this long introduction, let’s have a look at our project, and what we actually produced.

Project Pareidolia

We thought it would be insightful and fun to reverse the artistic process, and have our outputs be synthesised “realistic” photos based on a portrait, rather than the other way around. Moreover, we didn’t want the AI to “simply” apply a stylistics filter over the original portrait, and make it look more like a photo. Instead, we wanted to capture the semantics of the portrait, and recreate an “artistic projection” of it into what looks like a realistic photo.

In order to achieve that, we set to train the AI to be able to separate the semantics of the photos and portraits from their style, and to map these semantics to a shared “face” space.

We then probed how well the AI succeeded in this task, by requesting it to generate a new synthetic photo, based on a portrait it has never seen before.

We thought this would be an insightful demonstration of “understanding” — both of the portrait, as well as of the human face.

Light technical interlude (read at your own risk)

While, as noted above, we wanted to keep the inputs as simple and as authentic as possible, we still had to apply some simple modifications. Specifically, face cropping. As humans, we are naturally drawn to the face in a portrait. However, as can be seen below, in reality — the face occupies only a small portion of the portrait. The rest is filled with the body, the mise-en-scène, and mostly — background. While we may not be bothered by it, it provides an abundance of distracting information for the AI that lacks any context and prior knowledge. So, in order to focus on what’s important — we cropped the faces for both collections.

Portraits — the face occupies only a small portion from the overall painting

Now that we had our input examples sorted, we had to choose the AI method that matched our goals.

Since we didn’t want to apply a “simple” filter, we decided not to go with a style transfer approach. Nor did we want to use a Machine Learning model that looks at the pixel level similarity between the source portrait and the target synthetic photo. So we looked for GAN models that operate on the semantics of the image (GAN stands for Generative Adversarial Networks — a machine learning technique that builds on ideas from game theory to generate outputs that match a desired criteria). Since we had the additional constraint of having two independent distributions (meaning, we didn’t have pairs of photo-portrait to train on, but rather two separate collections), we experimented with different members of the broader Cycle-GAN family. We experimented with different options and modifications, and eventually landed on a slightly modified version of MUNIT (10 epochs * 100k iterations).

Drum roll: results and concluding remarks

Here are the results! These are 3 palettes, each containing 24 pairs (4 in each row x 6 rows). Each pair is made of the original portrait on the left, and the synthesised photo on the right.

Some photos are surprisingly good, and some are awfully bad. One thing is clear: the AI is a victim of the bias we imposed on it through our “celebrity” photo training set. In its universe of photos, people are 20–30 years old, looking straight at the camera, smiling, with perfect skin, and straight hair. Too old or too young, facial hair, curly hair, or slightly unexpected angles — don’t pan out well. Still, it is thought provoking to consider this has been done with no context or explanations.

Each palette contains 24 pairs of an original portrait and the generated “photo”

Does the AI understand art? Ours definitely doesn’t. Definitely not in the way humans do. However, what it produced is definitely interesting and encouraging. Some of the results seem to point at fundamental gaps in understanding, yet others are surprisingly good and exciting. Moreover, as we experimented with the project, we floated many ideas on how to make it even better. The key point though is whether this is a worthwhile journey. Can we learn new insights about our world, and about our perception of it through this dialogue — by trying to teach art to the AI? We plan to continue and explore this question.

Enjoyed? Check out our next project – Rosy AI