PointGAN: A breakdown of the simplest GAN possible

How hard can it be to teach a machine to generate a single point

Vitalii Pogribnyi

Published in

Towards Data Science

12 min readMay 31, 2022

Introduction

This article describes the creation of a Generator — Discriminator pair, that would be consistent and robust (more or less). What’s more, I aimed to get a specific Discriminator function, which would ensure the Generator’s consistent learning (see plot below).

The task for the simplest GAN would be to generate a point. A single number, say “1.” The Generator would be a single neuron with one weight and one bias value. Thus, the Discriminator thus has to be bell-shaped, like this:

The x-axis denotes the input to the Discriminator, and y — is the probability of this input being authentic.

The value “1” is considered real, the values close to “1” are somewhat likely to be real; the values further away — probably fake.

1. Baseline training

1.1 Common terms

I should start this section by describing what Generator and Discriminator should be like and why they should be like this.

First, let’s think again about what a Generator is. It is a function that transforms a random input into a real-looking signal. As described in the introduction, the realistic signal will be a single value “1.” For the input, I will take a random value between -1 and +1.

The function itself will be linear, since this example aims to be simple:

y = w * x + b

Where y is the output, x is the input; w is neural network weight, and b — is its bias. The solution to the problem would be:

w = 0
b = 1

In this case, no matter what “x” is, “y” is always “1.” Graphically it is represented by a vertical line, as shown in the plot below.

There are a lot of plots in this article that show training processes titled as “weight” and “bias.” These values represent the “w” and “b” I’ve just described. So when looking at the plots, we should feel happy when the “weight” converges to 0 and the “bias” — to 1.

Second, the Discriminator. This is another function, which outputs a probability of its input being authentic. In our example, it should output a value close to “1” given the input “1,” and otherwise should output “0.” For the GAN network, it also acts as a loss function for the Generator, so it should be smooth. Here is a plot of both Generator and Discriminator I expect to get:

The Discriminator function (orange) is the same as above, with the Generator function (blue) in form of “y = w * x + b”.

In terms of formulas, the Discriminator would be as simple as possible. Yet it can’t be linear, since we aim to get a bell shape. So the simplest possible solution would be a multilayer perceptron with two hidden nodes:

By the way, if I want to see what my Generator and Discriminator functions look like, I would simply pass a range of numbers through them:

This is in fact how I got the plots above.

1.2 GAN training

Before we get to the first training, I’d like to discuss how the model will be trained and how I’m going to visualize it. Take a look at the following image:

This image shows the Generator and Discriminator at the very beginning of the training, so their parameters are random.

The function represented by the blue line is meant to transform a random input into a realistic value (now it outputs random values as well, since it’s untrained). So for example, if we generated random values [-2.05, 0.1, 2.5] this function would transform them into (approximately) [-0.2, 0.3, 0.49]:

These values are then passed to the Discriminator (note the scale difference, 2 squares per value on the vertical axis vs 1 square per value on the horizontal):

The scores outputted by the Discriminator are then collected. The procedure is repeated with the real value:

The average output would be 0.48 for the fake values and 0.56 for the real ones. After that, both Generator and Discriminator will take different actions in order to train.

Discriminator’s point of view.

The Discriminator tries to make the real and fake values more distinguishable by their scores. In this case, it can be achieved by making the function steeper:

You may notice that I’m not defining what is “better” in the image above. Later I will use binary cross-entropy for this purpose, but it doesn’t calculate the average of raw outputs. Don’t pay much attention to it now, since it’s a rough illustration of the process. Even the numbers in this whole section are made up.

Generator’s point of view.

The Generator, in its turn, tries to update its function in a way that when the generated values are passed through the Discriminator, their score is higher. It can be achieved by simply generating larger values:

Note that for the same input [-2.05, 0.1, 2.5], the output is larger: [0.2, 0.4, 0.75]. For the new values the Discriminator outputs in average 0.51, which is better.

Note that the inputs in the lower image are shifted to the right, compared to the top image.

Combining the training step for both models, we get the following transition:

The process is then repeated with new random values as the Generator input. The overall training passes the stages shown in the image below.

Training progress: a) Initial state. b) The Generator tries to generate larger values, The Discriminator rates larger values as more realistic. c), d), e) The Generator starts to output values larger than “1.” The Discriminator starts to rate the large values as fake, too. f) The Generator outputs “1” no matter what is the input, The Discriminator rates only “1” as the most realistic value.

Since we will have a lot of small steps like these, it is convenient to combine them into an animation, which would look like the following:

Evolution of Generator (blue) and Discriminator (orange) models during the training.

I will use such animations a lot, for demonstration purposes. The code that generates it will be described below.

1.3 Baseline code

We start with the models’ code, since I’ve described them in the previous section. This is essentially the formulas above, transformed in PyTorch code:

This code will be saved in the models.py file. So whenever I import the models, I mean this code.

The training code follows next, and this is a bit tricky. We need to make two passes through the data: one for the Generator training and one for the Discriminator. I will do it in a simple way. I will make them take turns. The Generator will wait while the Discriminator is training and vice versa:

As for the loss function, I won’t use the conventional GAN loss, just to make it simpler to understand and to show that it doesn’t have to be like this. Nevertheless, the idea behind the loss will remain the same. The Generator wants the Discriminator to output 1 given its output:

By random_input I will take values from 0 to 1 generated the following way:

The Discriminator collects the output from the Generator and the real values, and tries to separate them:

The rest of the code is a regular routine:

The code already contains logging, which is described below.

1.4 Logging

There is one thing I’d like to add to the training — make it to generate animation, the one which is described in section “1.2 GAN training.” Besides that it looks cool, it gives a good understanding of what is going on inside. And what is good about such a simple model — we can afford this type of visualization.

The type of animation I will be using is matplotlib’s FuncAnimation. The idea is that I create a figure and a function that updates the figure; the library will call this function for every frame, and generate an animation:

On some systems, there may not be present some libraries needed for generating a video. In this case, one may try and generate GIFs instead:

Apart from the animation logs, I want to watch how the weight and bias of the Generator change, to see if the training moves in the right direction. I use Weights & Biases in this example, but other choices, like MlFlow, are also acceptable.

The training code produces the following training process (I varied learning rates for the models and run the code with different random seeds):

Mean value and standard deviation for the weight and bias of the Generator.

Note that the correct process would end up with the weight set to “0” and the bias to “1.”

The trainings were mostly successful, some of them depicted below:

But there are a couple of bad examples:

In terms of visual representation, the GIFs, here are some examples:

Since most of the trainings reached their target, this code may be considered “working”, but one may not agree with it. Let’s see what is wrong and how we can fix it.

2. Problems of the Baseline

There are only two things that make me question whether my training is correct. First, it does fail sometimes, so it must be somewhat unstable. Second, is the Discriminator function. There are only a handful of cases where it looks the way I was expecting. Let’s examine the problems I encountered.

2.1 Generator is fine

At some point, the Generator starts to output realistic values (because we train it to do so) and there is no way for the Discriminator to find the difference. But the network does not know this, so it continues the training. Since the input is already realistic, it is invalid. And we all know that the networks trained on invalid data are inadequate.

2.2 Loss function doesn’t work

Trying to identify the problem, I took a Generator from one of the failed cases and plotted some of the Discriminator parameters along with a loss value it produce.

The Generator has already produced values close to real, something in the range [-0.59, +0.62].

What I found was rather surprising, because a better Discriminator function (that I, as a human, know is better) — in reality gave worse loss values:

Loss values for different Discriminator functions

This was because the values that have been generated were close to real. So the correct Discriminator would evaluate them roughly the same, with only a minor difference. On the other hand, the incorrect Discriminator would have slightly improved its performance by making radical changes like in the plot above.

What first came to mind was that this Discriminator evaluates the real examples as real with a 0.5 probability. This is something easy to fix by giving weights to the loss function. This experiment is described in the section below, with all the other experiments. But long story short, it didn’t work.

2.3 Setting up experiments

Before moving on to the solution that worked, I’d like to share the results of the experiments, based on the assumptions I made above.

Loss function easing

The current cross-entropy loss function wants the network output to be 0 and 1, which makes the values before activation (sigmoid) very large positive or negative values. Since our Discriminator is meant to be flexible, large values are not something we want. One solution for that is to make the targets not so extreme. I would set them to 0.1 and 0.9 instead of 0 and 1. So that the weights are not forced to be large. In the end, all we need from the Discriminator are gradients.

In code, I will change the target for the discriminator to this (with easing parameter being varied):

Which after training again, will yield these curves:

This looks more stable, but I still don’t like the GIFs:

2. Weighting the real examples

In code, I will set less weight to the fake examples (here weight is a small value):

Ran the training again, obtained the following image. In short, didn’t work:

3. Weight decay

One of the reasons the training may fail is that the Discriminator may go too far trying to classify the output from an un-trained Generator. Especially if the Discriminator learns significantly faster than the Generator.

So given a starting point like this

Generator (blue) and Discriminator (orange) function at the beginning of the training

The Discriminator will quickly learn a function like this:

Generator (blue) and Discriminator (orange) functions at the end of the training

Note the sharpness of the Discriminator function. This means that its weights are large, and the gradients outside the (small) curvy area are close to zero; so the model learns very slowly (if at all).

So the idea was to add a weight decay to prevent the model from reaching this state:

Which gives the following training stats:

And is visualized as follows:

One may see that this method is capable to improve the training, but I’m still not satisfied with the training process.

3. The fix

3.1 Basic idea

Let’s take a look again at the situation when the Generator outputs realistic answers but the Discriminator still has to learn something. One solution to this problem would be to make the Generator’s output invalid. Only during the Discriminator training, of course. This would resemble some kind of dropout. This should work, because the actual dropout works, but there are too few parameters in the Generator. If we zero out one of them, the output will change too much.

The solution I’ve come up with involves adding Gaussian noise to the Generator’s parameters (to the weight and bias). In that way, even when the Generator is perfectly correct, it will generate slightly invalid data for the Discriminator, so that it can learn:

The remaining problem is that the training goes too noisy since the gradients change rapidly with each optimization step. So I decided to make a series of evaluations with a different noise, before the optimization — some kind of a batch inside the batch:

This improves the training process:

And at the end generates a beautiful Generator function:

But still, some trainings didn’t go well. Here are the training logs:

3.2 The improvement

I have to mention that the situation I have described in the “Loss function doesn’t work” section may also appear here if the noise of the weights is too small. It just doesn’t push the model far enough to generate an obviously wrong example. So I decided to increase the noise level:

NOISE_SCALE = 1.5  # Instead of 0.5

This step improved the stats, but there were still fails remaining:

The next improvement I applied was the weight decay, for the reasons I described above in the “Setting up experiments” section under “3. Weight decay.” This yielded the following stats:

Which means no failing runs at all. The weight decay has a side effect that the Discriminator function becomes smoother, so the GIFs look gorgeous:

For my example, I took weight decay value as 1e-1:

But keep in mind that it depends on the weight noise level. If it’s small, you may need to decrease the weight decay. Otherwise, the Discriminator will go flat.

The full updated code would look as follows:

Conclusion

This code seems to be consistent and robust, at least for such a simple task. Some parameters still need to be tuned, like internal batch size or weight decay value, but overall, I’m satisfied with the result. The final version of the code does not require as much struggle as the first one, so I consider it a success.

Hope it was helpful, happy coding!