The Variational Autoencoder as a Two-Player Game — Part I

Alice and Bob at the Autoencoding Olympics

Published in

Towards Data Science

17 min readApr 2, 2018

[Disclaimer: The aim of this article series is to make the basic ideas behind variational autoencoders and the encoding of natural language as accessible as possible, as well as encourage people already familiar with them to view them from a new perspective. In order to do so, I have made some use of artistic freedom for the sake of a nicer narrative, maybe sometimes sacrificing a bit of technical accuracy. I advise readers without any technical background to take everything with a grain of salt.]

Part II, Part III

Meet Alice and Bob (and Charlie)

The field of AI, and particularly the sub-field of Deep Learning, has been exploding with progress in the past few years. One particular approach, generative models, has been responsible for much of this progress.

An intuitive argument for why generative models are useful on the way to real artificial intelligence is that a system that can generate realistic data must have gained at least some understanding of the real world.

Generative models come in various flavors. One of them is the so called Variational Autoencoder (VAE), first introduced by Diederik Kingma and Max Welling in 2013.

VAEs have many practical applications, and many more are being discovered constantly.

They can be used to compress data, or reconstruct noisy or corrupted data. They allow us to smoothly interpolate between real data, e.g. taking a photo of one face and then gradually morphing it into another face. They allow for sophisticated data manipulation, for example realistically varying the hair length on the image of a person, or smoothly changing a voice recording from male to female without varying any other sound characteristics.

More subtly, but in practice often of most interest, they can uncover hidden concepts and relations in large amounts of unlabelled data. This puts them in the class of unsupervised learning algorithms (as opposed to supervised algorithms which require labelled data).

There is also a group of related models that follow very similar principles as the VAE. The model used for Google Translate for example is one of these. If you understand this series of articles, you are basically also ready to understand how Google Translate works under the hood.

There are many great blog posts that explain in technical detail and with code how VAEs work (e.g. this one and that one), and the academic literature is full of countless explanations, applications, and extensions to the original idea.

My goal here is neither to give you the technical understanding to actually implement a VAE, nor to comment on any particular recent development in the field.

Instead I want to provide a new way of viewing what a VAE is actually doing. A way that I hope is simple enough that you could explain it to your grandma or an elementary school student, while at the same time not leaving out too much detail or being too fluffy.

And even if you are an experienced practitioner or researcher in the field, I hope that this slightly quirky interpretation can maybe stimulate some new creative insights.

In this three part series, we are first going to explore the foundations of autoencoders in Part I. In Part II we will take a look at why it makes sense to make them variational (and what that even means). Finally, in Part III we will discover why encoding text is particularly challenging. Parts I and II are really just a build-up for Part III.

The basic concepts of (variational) autoencoders have seen extensive coverage, from simple introductions all the way to academic papers. However, for the concepts in Part III, I have so far not come across any good non-academic discussions. I hope this series can fill this gap and also teach you a lot of other things along the way.

My own background is in Quantum Information Theory. Both quantum physicists and information theorists love to understand and explain the world by boiling complex scenarios down into simple games.

For example many cryptographic problems can be phrased as a game between a sender and a receiver, with a malicious player who acts as an eavesdropper. Or check out the great little book “Q is for Quantum” by my former PhD advisor Terry Rudolph for many examples of easily understandable quantum games.

Having shifted my own research focus from physics to AI, I have spent many hours thinking about and working with VAEs. Given my quantum background, the idea of viewing a VAE as a game came quite naturally.

In particular, I see the VAE as a two-player cooperative game.

But before getting into the details of the game, let’s introduce the players. Meet Alice the Encoder

and Bob the Decoder (not to be confused with Bob the Builder).

Alice and Bob are very ambitious. Their goal is to compete in the newly established Autoencoding Olympics, where the world’s best encoder-decoder pairs meet to show off their skills.

To prepare for the Games, Alice and Bob have enlisted their friend Charlie the Critic, who will be judging the performance of our two contestants and act as their coach.

Now let’s take a look at the rules of the game.

The Autoencoding Game

As we will see, the game comes in various flavors and disciplines, each with its own set of rules and objectives.

We will first consider the most simple version of the game, on which all later versions will build. In deep learning speak, this simple version corresponds to a normal (i.e. non-variational) autoencoder.

The basic idea is this: Alice gets some kind of data, for example an image, a text, an audio clip, etc. She then has to communicate this to Bob, who doesn’t get to look at the data, in a way so that he can reconstruct what Alice saw or heard.

Charlie (who knows exactly what Alice saw) then evaluates how accurate Bob’s reconstruction was and gives him a score based on his performance.

The goal is for Alice and Bob to achieve the highest possible score, i.e. for Bob to perfectly reproduce the data Alice was given.

The catch is that Alice and Bob are separated from each other and can only communicate in a very limited way through a set of special devices.

In particular, Alice can’t just directly explain what she saw. The only information she can pass on to Bob is a bunch of numbers, a “code”.

Alice needs to encode the data.

How many numbers she is allowed to send is called the “code dimension” or “code size”. In real VAEs the code dimension can often be in the hundreds. But for simplicity and ease of visualization let us assume it is just two. This is not just a simplification for the sake of this game, but is also done in practice when people want to visualize the code.

If the code size is two, we can directly intepret the code as an (x, y)-coordinate in a two-dimensional coordinate system.

For this initial version of the game, we will also assume that the data Alice is shown are photos of cats and dogs. And there is lots of this data, which we will call the training data. Again for simplicity, let’s just say we have one million training images. That’s a fairly average to low number for realistic datasets.

Some representative examples could be

So given one of these images, Alice has to enter two numbers into her machine, say (2.846, -5.049), and send this code to Bob.

Bob now has to try and paint what he thinks Alice saw, given nothing but those two numbers.

Sounds hard?

The truth is, it’s much, much worse!

We all have a notion of what dogs and cats are, what they look like, how they behave, in what environment we usually see them, and so on. But AI has to start literally from scratch, with no preconception of the real world or what images are likely or not.

It’s as if both Bob and Alice grew up in an isolated room with their eyes closed and are now seeing the real world for the first time, through the photos they are shown. They have absolutely no notion of “dog” or “cat”, and at the beginning of the game, a photo of a cat would look to them just as probable and realistic as would random noise or an abstract painting.

They are not even told that the photos contain mostly these things we call “cat” and “dog”.

And the difficulty doesn’t even end there. While Alice actually gets to look at real photos, Bob never sees a real photo. All he ever sees is the code coming from Alice. He can only make random paintings based on these cryptic numbers.

But this is where the importance of our coach Charlie comes in. Every time Bob finishes a painting based on one of Alice’s codes, Charlie compares it with the original and assigns a score to it (the technical term for this score is “loss function”). The more accurate, the higher the score.

And more than just providing a single score to tell Bob how good (or bad) he was, Charlie also tells him which of his decisions contributed to this score and how. In technical terms, he provides Bob with “gradients”. Based on this information, Bob can tweak his process towards gaining a higher score next time.

And because Bob knows how Alice’s code influenced his process and final output, he can also tell Alice how to improve her encoding. What kind of code he would have liked to get in order to receive a higher score in this particular case. For example he could think “Number 1 should be 0.043 smaller, Number 2 should be 4.956 larger”.

This is the only communication allowed from Bob to Alice. We call this process “back-propagation” since we start from the final score and then work backwards, based on the feedback adjusting the process that lead to the score.

This method of back-propagation currently forms the basis of training the majority of deep neural networks, not just VAEs or generative models.

While adjusting their process, Alice and Bob need to be careful not to tweak it too much based on a single photo. If they do, they only ever become better at the previous image they were given, and then “overwrite” that progress by over-adjusting to the next one.

Instead they need to make small updates to their process based on each feedback, and hope that over time those small adjustments will add up to a unique process that gives them a good score across ALL images, not just the current one.

How much they adjust their process based on the feedback for each individual image is determined by what’s called the learning rate. The higher the learning rate, the more quickly they adjust their process. But if the learning rate is too high, they risk overcompensating for each new image.

Initially, since Alice has never seen real photos and also has no idea what codes Bob would like, all she can do is send Bob random numbers.

Also, Bob has no idea what these numbers mean, nor what kind of paintings he is expected to produce. Literally any painting is as likely to him as any other. It’s complete randomness throughout.

But after producing many random paintings which all get terrible scores from Charlie, Bob starts to notice something.

When he uses certain colors in the centre of his painting, and others on the edge, he gets slightly higher scores. Particularly when he paints a gray or brown blob in the centre that has two round shapes in it, he usually gets a higher average score than if he just paints completely at random.

Bob, without having a concept of “fur” or “eyes”, just discovered that most of the pictures contain animals with fur, that usually have two eyes.

Remember this for later. What might seem like great inital progress here will actually come back to haunt Alice and Bob later in Part III when they are trying to encode text.

Also, this approach of completely ignoring Alice’s input and just painting a brown blob with two circles only gets him so far. He’s stuck and can’t increase his score. He needs to start trying to make use of Alice’s code, which at this point is still random since Alice has not gotten any useful feedback from Bob. He needs to start incorporating the code in his creative process and give Alice clues to steer her encoding in the right direction.

The process that ensues is a joint effort between Alice and Bob. Bob first learns what images are at all realistic or likely, and Alice, aided by Bob’s feedback, can then steer him in the right direction for a given input. Without Bob having a knowledge of the world, like at the beginning of the game, she would literally have to specify every single pixel of the image for him. Later she can get away with much less information (cat or dog? Brown or gray? Sitting, standing, or running? Etc.)

As a short aside on data compression, let us consider a trivial version of the game. In this version the code dimension is exactly the same as the number of pixels of the image.

With a machine that has as many input slots as the image has pixels, Alice can simply transfer the exact image through the machine. (Assuming it’s black and white. For a color photo we actually need three times the number of pixels, to encode the red, green, and blue values for every pixel.)

Bob simply needs to paint each pixel exactly as Alice instructs him and can get an absolute perfect replica, 100% of the time.

A slightly less trivial version is where the code size is half the number of pixels.

Alice can now learn to specify every second pixel, a downsampled version of the photo. Bob just needs to fill in the remaining pixels. Assuming he has a good notion of how the real world looks, this becomes a fairly trivial task. The images won’t all be 100% accurate, but close enough.

But Alice and Bob won’t get away with this “pixel encoding” forever. As the code size gets smaller, say down to two like in our original game, they have to change their strategy if they want to have any hope of being successful.

Instead of encoding pixels, they need to come up with a smarter, more “information dense” code. A code that captures more abstract concepts, like what animal, what pose, what camera angle, etc.

This is the basis of data compression. The more we want to compress a piece of data, the more efficient a code we need to devise (and the more we risk it not being reconstructed completely faithfully).

The trick is to break down complex information into a few concepts that are as simple and universal as possible, but still allow for a fairly faithful reconstruction.

Back to our original game, Alice and Bob have now played for a while (and I really mean a LONG while), and have played the game with each of the photos in the training dataset maybe ten times or so. Bob has become a prolific painter, having painted millions of paintings, and Alice has become an expert information encoder and learned to provide Bob with codes that help him figure out what photo she saw.

Bob’s idea of the real world has dramatically improved compared to the original randomness.

However, note that Alice and Bob only know things they have encountered in their training images. Another animal will be almost as unlikely to them as random noise (although they might for example find such concepts as eyes, legs, and fur familiar from their cat/dog centred worldview).

Alice and Bob have become pretty good at the game and start scoring fairly high.

The Learned Code

Let’s take a brief look at the kind of code Alice might have figured out during training.

Since the code consists of two numbers, we can easily visualize them as x- and y-coordinates in a plane.

In one possible code, Alice could have decided to use a negative x-value for cats and a positive one for dogs. The y-axis could have been used for fur color. Positive values being darker, negative ones lighter.

With this code, if Alice gets a photo of an extremely “dogly” black dog, she would send Bob two large positive numbers. If she sees an extremely “catly” white cat, she sends two large negative numbers. For a photo where she’s barely sure if it’s a dog or a cat, and the fur color is of medium darkness, she sends two values close to zero. And so on.

If Bob has learned that this is the code Alice is using, it will allow him to make much better guesses for his paintings.

However, while being a huge improvement over completely random paintings, his new paintings will still contain a lot of randomness and guessing. He knows if it’s a cat or a dog and how dark the fur is, but there is a lot of information this particular encoding cannot transmit.

For example, Bob has no idea what the fur length and pattern are, what pose the animal is in and in which part of the image, what’s in the background, what is the lighting of the scene, ...

He needs much more information to fully specify the problem. That’s why real neural networks usually tend to learn very complicated codes that in most cases don’t have an easy human interpretation of the axes. That way they can cram much more information into these two numbers.

If you think about it, there is really lots of info in an a simple image. If you are asked to look for cats or dogs you will view images in a particular way, very binary. Cat or dog?

If you’re on the other hand asked if the cats or dogs are in an urban or natural scence you’ll consider the images very differently. Alice and Bob are not asked to look for anything specific. They don’t have this human context. They only know the task of reconstructing the entire image.

However, as they learn, they might naturally figure out concepts like “dog” and “cat”. But there is also a large chance that an AI will come up with concepts that are completely meaningless to us humans. Concepts that are much more abstract and “efficient”.

So if Alice and Bob are playing for long enough, they will eventually agree on one of these more complicated codes that capture highly abstract and efficient concepts.

Seems like they mastered the game, right?

What happens if Alice and Bob get too “smart”

Well, not so fast. They need to be careful. There is a risk of becoming “too good”.

Alice and Bob, with the help of Charlie, are currently training for the Autoencoding Olympics. During training they practice with the same set of photos over and over again.

But the photos that will be used at the Olympics are a well kept secret. They won’t see them until they actually get to do them at competition time.

At that point, they might realize they are in trouble.

Instead of understanding the “general world of dog and cat photos”, they just learned to perfectly memorize all the photos in their training set.

It turns out they actually did not discover any of the concepts we discussed in the previous section. They basically just figured out a way to cheat to get a perfect score on their training images, without actually having any real understanding.

This problem, extremely common in all of deep learning, is known as overfitting.

Much like a student who steals the answers to a test and then memorizes those instead of actually studying the subject, Alice and Bob, once overfitting sets in, are 100% confident on examples they have encountered, but have absolutely no clue about anything else.

Taking this to the extreme, if Alice and Bob are very “smart”, have large memory, and are allowed to train for a VERY long time, they can actually learn to perfectly encode any image in their training dataset into even just a single number, no matter how many images there are.

How could they do this?

There are an infinite number of codes that would allow them to do this, but let’s consider a particularly simple on.

We earlier assumed that there are exactly one million training images. Given that knowledge, they could just agree to encode them starting with 0.000001 for the first one, 0.000002 for the second, and so on up to 1.000000 for the millionth one. For each image, they simply decide on a unique code number.

In this code, consecutively numbered images are very close in terms of their code, but the images don’t have to be similar in the least. Image 156 could be a black dog playing with a ball in a living room, and image 157 could be a white cat chasing a mouse through grass.

Whenever Alice sees image 156, she just sends 0.000156. And Bob has learned exactly what he needs to paint to get a perfect score. And similarly he has figured out what exactly to paint when he sees 0.000157 flashing up on his machine.

They get a perfect score, 100% of the time.

But if Bob sees 0.0001565? Or -0.000001? Or 1.000001?

Absolutely no idea.

Similarly if Alice is suddenly shown a new image that she never encountered before, how would she fit that into the existing code so that Bob could make an informed guess straight away?

Again, absolutely no idea.

This is not learning or understanding. It’s pure memorization.

Every little change, every tiny unfamiliarity, completely throws them off.

We need to prevent them from reaching this stage, instead enforcing understanding.

If done correctly, they might still to some extent remember examples they previously encountered, but also have learned general principles that allow them to reason about things they haven’t previously seen.

This is called generalization, and is one of the core criteria for almost any deep learning algorithm.

One way to prevent the problem, which is the approach people used to take for a long time (and are still using jointly with more sophisticated approaches), is simply to stop the training early, before a basic understanding gives way to memorization.

But in the case of autoencoders, there is a better way of ensuring actual understanding and meaningful codes.

In Part II, we will take a look at how Alice and Bob deal with their defeat, and see how a new training method, using so called variational machines, can help them improve their performance and come up with meaningful codes that generalize to previously unseen data.