The world’s leading publication for data science, AI, and ML professionals.

How Computers Play the Imitation Game: From Autoencoders to StyleGAN2s in Less Than 10 Minutes

A tourist guide to (novel) image generation for non-geeks

Photo by Wilhelm Gunkel on Unsplash
Photo by Wilhelm Gunkel on Unsplash

It all started in 1950 with a game played by three subjects. C would sit in a separate room asking open questions to A and B, receiving written answers and trying to discern from those whether A and B were humans or machines. The aim of C was to get it right, while that of both A and B was to fool C.

Seventy years have passed since machines started to study how to pass the Turing test, and they have come a long way.

I am an investor with some level of understanding of financial services after 15 years in the game, and a Machine Learning and Tensorflow free hitter but in no way a machine learning engineer. Still, I think that content generation techniques tell us a lot about how good computers have become in mimicking humans – and can teach us some humility and self-awareness in the process.

This post has the ambition of summarising in less than 10 minutes the development of the most recent computer techniques in playing the Imitation Game. I agree, the fact that Turing associated the success in this simple game to answering the deeper question "Can machine think?" is a big assumption, but one for another time. The 10 minutes start now.

Step 1: Autoencoders and Dimensionality Reduction

In order to replicate fairly what humans produce (and what we perceive as humanly produced), a computer should first learn to understand, or at least (to get ourselves as far as possible from the hairy concepts of consciousness and self-consciousness) to generalise content by synthesising its core dimensions.

In real life, it is typically possible to considerably reduce the number of features of a set without losing much information. Look at the following sequence of natural numbers.

Sequence = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]

Although the sequence includes 10 numbers, it is obvious that the list represents the first 10 instances of the sequence 2ⁿ, with n starting at 0. It would be trivial for a person, at most equipped of pen and paper, to replicate the first 10, 20, 50 members of such sequence without memorising them all. We have, in other words, identified the inherent lower-dimensional characteristics of the sequence that are sufficient for its perfect replication. Or, more simply, we found a patter. But can machine do this too?

You won’t be surprised to know they can, and via the same test-and-learn methodology that a human would apply to this and more complicated sequences. This is exactly what autoencoders do.

What is an autoencoder? An autoencoder is a model that looks at the input, extracts an efficient latent representation (the encoder), and spits out something that hopefully looks as similar as possible to the input (the decoder).

Graphical representation of an autoencoder: from inputs to reconstructions
Graphical representation of an autoencoder: from inputs to reconstructions

Let’s take few low-resolution images. The job of the encoder is to squeeze the information present in the (32 height x 32 width x 3 RGB channels) pixels into lower dimensionality; that of the decoder is to take that representation and reproduce something as close as possible to the original images. The first time our model does poorly by random guessing, then it does it a bit better, then a bit better, until hopefully after many iterations it becomes pretty good at it, depending on how much we want to squeeze the original information.

Original images vs. reconstructed images using 1 intermediate layer with 10 neurons - practically squeezing info included in 32x32x3 pixels into 10 perceptrons. Results are unsurprisingly poor (selu activation and 20 epochs for the geeks)
Original images vs. reconstructed images using 1 intermediate layer with 10 neurons – practically squeezing info included in 32x32x3 pixels into 10 perceptrons. Results are unsurprisingly poor (selu activation and 20 epochs for the geeks)
Reconstructed images using 1 intermediate layer with 100 neurons - results are obviously much better (architecture and other parameters remain)
Reconstructed images using 1 intermediate layer with 100 neurons – results are obviously much better (architecture and other parameters remain)

It won’t come as a surprise to know that if the information you want to encode is much (much) larger, or messier, and a patter is more difficult to be identified, our simple architecture won’t perform satisfactorily, and it will need to be expanded. But it’s always the same stuff. You expand the number of units in the encoding/ decoding layers, you stack those layers one on top of the other, you play with things called activation functions or hyperparameters, and you keep going from the inputs to the reconstructions. You get deep autoencoders (which include many layers), convolutional autoencoders (which include layers filtering the information via special filters), recurrent autoencoders (which keep some kind of memory of what they have done), etc. But if you got the idea of the original architecture, it’s basically the same.

Step 2: Variational Autoencoders and the Creation of New Stuff

But there’s a class of autoencoders, the so-called variational autoencoders, that doesn’t behave like the others, and instead of approximating as close as possible what you fed to it, generates something completely new but that might resemble the original input. You give the model 1,000,000 pictures of umbrellas, and (after A LOT of learning) it will produce a picture of what truly resembles an umbrella but that wasn’t one of the pictures you shared. It is the first step towards winning the Imitation Game.

A variational autoencoder does it introducing a bit of randomness in the process. After having studied the input, instead of producing a coding for it, the encoder generates the mean μ and standard deviation σ of the (theoretical) distribution of the inputs – i.e. the mean and standard deviation that the inputs __ would have under the assumption that they were normally distributed. If it sounds complicated let’s analyse it step by step:

  1. The encoder receives a list of inputs and guesses two values for μ and σ, let’s say 0.5 for both
  2. It then proceeds to sample randomly a list of numbers from a normal distribution with μ = sigma = 0.5
  3. Subsequently the decoder tries to reproduce the input (our picture) based on those random numbers, and presents it to the jury (an objective function)
  4. The jury, most likely, finds the result extremely poor (it is just random noise) and sends it back
  5. The result is bad for two reasons: the output didn’t really look like the input (i.e. the reconstruction loss was high), and the sampled pixels didn’t really look like they were sampled randomly from a normal distribution (i.e. the latent loss was high) – if the inputs were truly normally distributed with mean μ and standard deviation σ, then those parameters would most probably be different from 0.5
  6. The encoder updates his estimates of μ and σ and samples a new set of random numbers trying to reduce the latent loss
  7. Then the decoder produces a new estimate trying to reduce the reconstruction loss, and sends it again to the jury

After a LONG while, we will have model parameters good enough to be used to produce brand new images – that resemble the inputs the model has been trained on. We simply have to feed the model random noise (normally distributed), and decode it using the learned parameters.

New anime faces generated by a variational autoencoder - images are blurry and take a long time to train but the result is quite good, they truly remind us anime faces drawn by humans
New anime faces generated by a variational autoencoder – images are blurry and take a long time to train but the result is quite good, they truly remind us anime faces drawn by humans

Step 3: The Idea Behind Generative Adversary Network (GANs)

In 2014 a group of AI researchers realised that designing two different models trying to fool each other, instead of having them playing for the same team, results would dramatically improve. The idea was simple and genius.

We have a generator and a discriminator, working against each other. The generator takes random noise as input and outputs an image (in the same way a decoder was working in the variational autoencoder architecture described above). The discriminator, instead, receives a set of images and tries to identify whether the image is true or counterfeited. It is, de facto, an Imitation Game reconstructed within a machine learning environment.

Phase 1: get yourself a good discriminator. We first train the discriminator, with the aim of making it reasonably good at identifying true and counterfeited images. At the beginning this is quite easy, because the generator sees only the real images and the initial fake images produced by the generator (which really look like noise).

Phase 2: teach the generator to get good at cheating. We then train the generator to produce images that the discriminator would wrongly classify as real. The generator never sees any real image, but simply receives indirect feedback on his work based on how good the discriminator is at spotting his counterfeited images. It is the discriminator that has the benefit of seeing what true images look like.

Reproduced images get reasonably ok quickly (much more quickly than in the case of variational autoencoders) but stop improving almost as quickly. The reason behind it is that, trying continuously to outsmart each other, they settle on the technique that is reasonably successful.

Photo by Coach Edwin Indarto on Unsplash
Photo by Coach Edwin Indarto on Unsplash

Let’s assume, for example, that we are training a GAN to generate fake rock paper scissors images. At the start the generator is very bad, then it gets improving, and then most probably understands that tricking the discriminator with rock images (a fist) is much easier than developing a full hand with fingers – so it keeps generating a lot of fists, that probably don’t improve much, and some very bad papers and scissors. Ultimately the discriminator catches up and starts distinguishing the fake rocks from the real ones, so well that at some point the generator tries another strategy, i.e. showing papers or scissors. At the beginning they are so bad that the discriminator spots them all, but then it is the discriminator that forgets how to identify fake papers and scissors (remember, it has been training itself on rocks for a while in order to outsmart the generator) and the generator starts winning. A GAN can get stuck in this rotational strategy for a while, without really improving. Computers, too, are lazy.

Images of sign languages generated using a StyleGAN architecture - after 50 epochs the trend towards fists or fists-like images is evident
Images of sign languages generated using a StyleGAN architecture – after 50 epochs the trend towards fists or fists-like images is evident

Deep Convolutional GANs (DCGANs) came to help, stacking layer on layer, including filters, stochastic dropouts, and other techniques to limit the instability of GANs. Results improved for small images, but results weren’t perfect. In the case of larger images, the reconstructions might result locally convincing around certain details but offering a poor result overall.

In 2018 an Nvidia team proposed to use DCGANs to produce small images, and then expand them progressively via specific filters (convolutional layers) in a process called progressive growing (ProGANs). A few other innovations were introduced in order to avoid mode collapse (i.e. the laziness/ instability problem described above for GANs) and increase diversity. The combination of all those techniques gave extremely good results for high-quality images.

Progressive Growing of GANS for Improved Quality, Stability, and Variation. 1024×1024 images generated using the CELEBA-HQ dataset (after many days of training)
Progressive Growing of GANS for Improved Quality, Stability, and Variation. 1024×1024 images generated using the CELEBA-HQ dataset (after many days of training)

The team analysed how "well" a generated picture resembled an original one by comparing their style at each level. This comparison procedure provided the idea of StyleGANs, currently the most advanced technique in novel image generation.

Step 4: StyleGANs and The Person With No Name

It was the same Nvidia team that introduced the idea of StyleGANs, i.e. GANs where the generator got modified to use style transfer techniques in order to have a structure similar to the original images both locally and globally. The result for high-quality images was astonishing.

Rani Horev’s post on StyleGANs provides a wonderful explanation, but here below is the summary of their key characteristics:

  • A mapping network encodes the inputs into an intermediate vector that controls different visual features
  • A style model transfers the features encoded by the mapping network into the generated image at each resolution level (i.e. for low level and high level image features)
  • The use of stochastic variations to generate features that are not deterministic or that cannot be mapped by the mapping network (such as hair, freckles, wrinkles, etc.)
  • Style mixing at each level of the generated image in order to avoid correlation among levels that might result in generated images following specific (and not necessarily realistic) patterns

The approach, groundbreaking in the generation of novel realistic images, was further improved by another paper in 2020 (that introduced StyleGAN2s). The results? Judge for yourself.

Image generated by a StyleGAN2 architecture - credits to thispersondoesnotexist.com
Image generated by a StyleGAN2 architecture – credits to thispersondoesnotexist.com

If you, like me, found yourselves imagining the life of the man on the picture, his childhood, his youth, his loves and fears and pain and happiness, to then remind yourselves that it was actually only a collection of random pixels, then computers really came a long way in the Imitation Game.

I hope you enjoyed this brief but intense journey. The number of applications are enormous, and you can choose whether to be frightened or energised by it. It is your choice. We are only at the start.


Related Articles