Unsupervised Classification with Generative Models

A simple test of machine intelligence

Svetlana Rakhmanova Shchegrova
Towards Data Science

--

Image by the author

It has been my impression, that in the immense space of Artificial Intelligence (AI) concepts and tools, Generative Adversarial Networks (GANs) stand aside as an untamed beast. Everybody realizes how powerful and cool they are, few know how to train them, and even fewer can actually find any use for them for a practical task. I might be wrong, so feel free to correct me. Meanwhile, I would like to take another look at this wonderful machinery and investigate its possible use for classification and embedding. For this purpose, we will stick to the following plan

  • Refresh our knowledge of GANs and the arguments for using WGANs as far as the training strategy is concerned
  • Review a custom implementation of WGAN
  • Train the WGAN on a set of objects with designed properities
  • Use the trained model to classify another set of objects and see if we can interpret this classification

Introduction to GANs

GANs were introduced in reference [1]. They consist of two parts — a discriminator and a generator.

A discriminator is a function that takes in an object and converts it into a number.

Schematic representation of a discriminator part of a GAN. Image by the author

Of course, depending on the complexity of the object, it might be a formidable task to turn it into a number. For that reason, we might employ a pretty sophisticated function for a discriminator, like, for instance, a deep layered Convolutional Neural Network (CNN).

A generator performs an opposite task, in a sense. It takes some random data as an input and generates an object out of it. And by random data, I really mean some random data having nothing to do with our object. Usually, it is a vector of certain length drawn from a random distribution. The vector length and the distribution are fixed, as in non-trainable, parameters of the model, that don’t really matter much.

Schematic representation of a generator part of a GAN. Image by the author.

As in the case of the discriminator, the generator function better have some complexity. Usually, a generator and a discriminator have a similar architecture, as we shall see in an example shortly.

I purposely use the word “object” instead of “image” because I want us to think bigger. It could be any digital representation of a physical entity, process, or a piece of information. Of course, the key here is that we are interested in a particular type of objects, and the discriminator and the generator deal with the same type of objects. So, apart from our model, generator + discriminator, we must have the data. The data gives examples of the objects that we want to work with. The objective is to teach the model to recognize these objects.

The training goes like this. The generator starts generating objects with the goal of making something similar to our data set. In the beginning these objects look nothing like that. We feed them to the discriminator and collect the scores, and by scores we mean the output of the discriminator. In addition, the discriminator scores the real objects. And here the animosity between the generator and the discriminator begins, which gives the name “Adversarial” to the approach. The discriminator is trained to score the real and the generated objects maximally differently. The generator is trained to generated objects that are very similar to the real ones so that the discriminator scores them similarly. The trick now is to turn this strategy into an objective function, or two, to be optimized.

WGAN approach to training

The narrative here usually goes something like this. In machine learning, discriminative models learn the probability conditional on the data while generative models learn the joint probability. Intuitively, this sounds about right. Personally, I have not been able to find a very compelling either strict mathematical proof or comprehensive illustrative discussion on the subject to provide a good reference for this statement. It has, however, been repeated so many times in the literature that we can take it for granted now, especially, considering, that we are going to explore this property later.

Since we have a generative model on our hands, we are going to learn the full probability, and, during training, we need some measure of how close we are to the real thing. When talking about comparing probabilities, the standard measures like cross entropy and Kullback-Leibler divergence come up. Suppose we want to quantify the difference between two distributions p(x) and q(x). It can be done with the help of cross entropy defined as

with the last term being the Kullback-Leibler divergence

which can be used as a measure of the difference on its own. That works well when there is at least some overlap between the probability densities.

A example a situation when the difference between two distributions can be successfully assessed by cross entropy metric. Image by the author.

But if the distributions are too far from each other, these measures fail. For example, the situations in the figure below would look identical to cross entropy objective function and no meaningful quantitation would be produced.

Examples of situations when cross entropy is not helpful. Image by the author.

Most importantly though, the optimization of the objective function is based on gradient techniques and the gradients will all vanish or be undefined in cases when the distributions are too far apart. Not only our cross entropy objective function would not tell us how far we are, it would not even indicate in which direction to go!

For that reason, the authors of reference [2] suggested a different measure — Wasserstein distance. Expressed in words, it is defined something like “the minimum work required to transform the distribution q(x) into p(x)”. That is, literally, if our q(x) and p(x) were piles of dirt, how much work would it require to transport q(x) to the place where p(x) sits and build exactly the same pile as p(x) out of q(x). Sounds funny, but if we forget for the moment about all the troubles we have to go in to express it mathematically, we would realize that this solves our problem conceptually, as it takes into account the difference in the shape of the piles and the distance between them.

So, how do we convert that process definition into a mathematical expression? Unfortunately, the answer is not simple. I can try to sketch it here but it would take a couple of good-sized posts to go into all the details. A well written blog on the subject can be found in [3]. First, to transform our word definition of Wasserstein distance W(p,q) into something resembling math, we can write

What it says here is that we are looking for a minimum (inf) over all the transports plans γ defined such that

of the work required for the transport. Here x is the support of p(x), that is all the values where p is not zero, or the location of the p dirt pile. And y is the support of q(y), that is the values where q is not zero, or the location of the q dirt pile. And the work is defined as a chunk of mass (density γ) multiplied by the distance it needs to be transported by, ||x-y||. That is still not very helpful. But we can see now that it is an optimization problem. Some people studied similar problems and showed that there is an equivalent formulation (Kantorovich-Rubinstein duality) that allows rewriting our definition as

which says that we are now looking for a maximum (sup) of this expression over all functions whose growth is bound by 1 (Lipschitz constraint). There are a few steps skipped here. They can be found in [3] and the references therein. I highly recommend reading up about the derivation and the useful properties of the Wasserstein distance, we, however, will stipulate them here and move forward.

This is already something that we can use. We can say that f(x) is our discriminator function and train the model in a way that maximizes this last expression for W(p,q). The only thing we have to do is to figure out how to make the function f(x) satisfy the Lipschitz constraint. The authors of reference [4] suggested imposing a gradient constraint forcing it to be close to 1. This can be done by adding a penalty term to the W(p,q)

where the gradient is evaluated on x ̂, a subset of points sampled along the straight lines between p and q. Alright, it is time to see this in practice.

The model

First let’s take a look at my simple implementation of a WGAN in the code below.

It is a class that extends keras Model and defines generator and discriminator members as also keras Models. The actual structure of the generator and discriminator has to be built outside of the WGAN model class. Instead, the class implements the training method. The training is done separately for the generator and the discriminator. The model defines one step for each, and, usually, for one step in the generator training there are multiple steps in discriminator training. The generator training step simply maximizes the score of the generated, fake, object, or rather minimizes the negative of the score. The discriminator step, in line with our Wasserstein distance strategy, maximizes the difference between the real and the fake objects’ scores, or rather minimizes the negative of the difference, keeping in mind the gradient penalty.

Here are the details of the generator and the discriminator functions that I used for the results provided in this post:

They both use, so called, residual block that encourages the model to focus its attention on fitting something that did not fit well on the previous step and promotes robust convergence:

I would like to point out that for the purposes of this post we do not need the perfect model, we just need a model that more or less works. We are going to check how it worked later, but now we will look at the training set.

Training objects

For the training objects I have created two classes of images. The images are 3 channel 64x64 pixels. Here is the code that generated them:

These are two different shapes —a square and a circle — each of different color, as exemplified on the picture below:

Two classes of objects used for training. Image by the author

The normalized color is defined as a real number between 0 and 1 for the first channel, with offsets by 0.1 and 0.2 in the other two channels. The sign of the offset can be changed and is “+” for the square and “-” for the circle. There is no particular meaning in the way the color is generated. I just wanted to have an easily changeable and record-able set of parameters. With that, the color is recorded as a pair of (base color, sign of the offset), as it is marked on the picture. The values of the offsets are not changed. Additionally, the random normal noise of predefined variance is added to each channel. The standard deviation of the noise is the last number in the image label.

The training set contains 300 copies of each class. The copies of the same class are not identical due the random noise. Note, that the addition of the noise does not serve as a some kind of regularization. Rather it is meant to represent a real continuous feature of our data. We, thus, have three features of different types: shape is categorical; color is semi-categorical, as, in a digital image, it has many levels but they are still discrete (3 channel times 256 levels in each); and noise is continuous, as defined by its variance.

The training

I trained the model in batches of 8 for maximum 200 epochs. This run I will use for benchmark results. The plot below shows the values of discriminator objective function over the training iterations.

Trace plot of the discriminator loss function during 200-epoch training. Image by the author.

This is our Wasserstein distance, and it is going in the right direction. There is a noticeable jump in the last 1000 iterations. I am not sure why the convergence is not more monotonous, but GANs are tricky and our training set is a bit too artificial. Overall, we can be quite satisfied with this training.

For additional checkup, I plot the gradient penalty term here. It is not exactly zero, but it is well bounded from above and is going down. All good again.

Trace plot of the gradient penalty term during 200-epoch training. Image by the author.

Now, that our model is trained to a reasonable extent, we are going to test it out.

Generating objects

After a GAN model is trained, people usually discard the discriminator part of it, and use the generator to produce new data with the desired properties. We will do that as well, but mostly as another checkup that the training worked sensibly. Below are examples of the images our trained model has generated. They are the same size, 64x64, just rendered smaller.

Objects generated by the model after 200-epoch training. Image by the author

One can see that the model has kind of gotten the hang of it — at least it did not generate an image of a cat! It got the colors right. The shapes, even though recognizable, are sometimes mixed together. In general, there is a potential for improvement in generated data quality, but the model was on its way. Let us see if it is enough for our purposes.

Scoring objects

And now we are getting to what we were after to begin with — scoring the objects the model has never seen. My purpose all along was to see if the trained discriminator scores have any pattern to them. Can they tell us what the model learnt about the underlying data?

Here is the test set for the discriminator:

10 classes of objects used for discriminator testing. Image by the author.

I have added more variety to it in shape and color, but otherwise the images are generated in the same schema and of the same size, 64x64. The new shape is a triangle, and the two new colors, (0.85,+1) and (0.7,-1), are created from the old colors, (0.85,-1) and (0.7,+1), by switching the offset sign. Overall, there are 10 classes of objects in the test set, two of which are the same as in the training set as far as our object features — shape,color, and noise — are concerned. Each class has 40 replicates in the test set.

I ran all the test images through the trained discriminator and collected the scores. Here is the table summarizing the results and providing the legend for the score histogram below. Highlighted in red are the objects used for training, the rest of the objects the model has not seen before.

Summary of scores for the 10 classes in the test set. The legend colors correspond to the histogram image below. Image by the author.
Histogram of discriminator scores for the test set colored by object class. Image by the author.

Amazingly, there is a good separation between almost all of the classes tested. The model has recognized that they belong somewhere in different spaces of the distribution.

As I would have expected, the shape and the color are the easiest features for the model to recognize. There were no confusion there. The scores for the new shape, the triangle, are very well separated from the rest of the shapes (the two greenish bars in the middle). Even within the triangle shape, the model was able to separate the objects by color.

The noise level seems to be a less distinctive feature. The two squares of the same color were well separated by noise (see the two left-most groups on the histogram). However, the two circle classes of the same color but different noise levels were scored very closely. Nonetheless, if we zoom in on the scores distribution, shown below, we can see that there is tiny but clean separation. There is a chance that the gap would have grown have we trained the model longer.

A zoom-in view of the scores for the two closely scored classes of “circle (0.85, -1) 0.03” (bright yellow) and “circle (0.85 -1) 0.01” (dark golden). Image by the author.

The only unresolved case remaining so far is the two triangles of the same color but different noise levels.

So, if we wanted, we could use thus trained discriminator as a pretty good classifier. I would like to remind you, there were no labels provided during the training. The model learnt to separate the two training classes by itself. And, in addition, it also learnt to separate correctly the classes it has never seen!

Scoring with partially trained models

At that point, I asked myself a question — how early in the training process does the separation of the scores begin? And how is it related to the apparent quality of generated images? To answer this I ran the training for, probably, sub-optimal levels of 10, 20, and 100 epochs. In each case, the training run has been started from scratch, as opposed to continuing the previous run. That resulted in the absolute values of scores to be on different scales, but that was also one of the aspects that I wanted to test. The obvious and expected conclusion is that the absolute values of the scores have no meaning.

The picture below shows the images that the model was able to generate at each of these training levels on the left, and the histograms of discriminator scores colored by test image classes using the same color scheme as above.

Generated objects, on the left, and corresponding test objects’ score histograms, on the right, for different extents of the training. Image by the author.

To make them a little bit more comparable, I used the “square(0.85,+1)0.03” training class as a reference and subtracted the mean score of this class from all the scores. That is, the scores of the “square(0.85,+1)0.03” class are always centered at 0. Also, if you have not noticed already, the scores of the same shapes are united by color scheme, to facilitate visual comparison. That is, all circles scores on the histogram are yellowish color, all squares are purplish, and triangles are greenish. Moreover, the intensity of the score color is related to the noise level of the object group. The more intense, darker, shade indicates lower, 0.01, noise level. Ironically, the actual color of the object is not color-coded.

One thing, that is immediately obvious from this picture, is that the score clustering begins very early in the training process. After only 10 epochs, the training is obviously not sufficient enough to generate any decent images. Nevertheless, the model already scores shapes differently. The score clustering becomes more granular after 20 epochs, while the data generation is still not going too well. It is worth pointing out that the new, triangle, shape is consistently scored well outside of the other, already seen, shapes.

All of that, of course, is an assessment from the human point of view. But, in the spirit of our times, we should be wondering what does the machine think. Let us take a look at that in the next section.

Using WGAN scores for embedding

Employing the discriminator part of a properly trained GAN model as a classifier is already a respectable strategy. We can, however, get more creative and feed these scores into another model, combine them with additional and meta data, and perform some sophisticated analysis. This situation is known in machine learning as embedding. More precisely, an embedding means converting some diverse set of data formats, or reducing a high-dimensional data, into more computationally friendly data types. An example would be converting strings representing words into real numbers.

We are already doing this with our shape images and producing a number for each, but let us be more critical and explore how suitable these numbers are for embedding. What constitutes a good embedding? I am not aware of universally accepted answer to this question, so I am going to state my point of view. The key feature of numbers is comparability, and with that comes the concept of distance. That is if you have two numbers you can tell if they are the same or different, and, in the latter case, by how much. I need to apologize to mathematicians here, who, at that point, would start talking about fields and manifolds and measures defined on therein, for using such a pedestrian language. When we convert a seemingly non-comparable objects into numbers, we automatically get this additional benefit of comparability. This could be a double-edged sword when we perform further analysis. On one hand, the machine might find some patterns that we have missed, on the other hand, it can put meaning into something that does not have any. The famous example from word embedding where you can perform an arithmetic operations with embedded words and get something intuitively correct, like king - man + woman = queen, is an example of good embedding. However, I suspect, since people are using only this particular example, there are many other situations in the same field of Natural Language Processing (NLP) where a counter-intuitive meaning gets assigned or any intuitive meaning is lost. To summarize, a good embedding, in my mind, is the one that a) clusters objects in some logical way; b) defines the distance between objects such that it allows interpretation; c) does it consistently.

What do we have with our scores? We are good for item a) as we clearly observe some meaningful grouping. Let us examine a couple of quantitative comparisons that the model suggests. Based on the differences between the scores, the machine arranges the “circle(0.85,+1)0.03" and the “square(0.7,+1)0.03” objects in these positions relative to the “triangle(0.7,-1)0.01”.

Relative scores for different test classes. Image by the author.

If I had guess, I would say that the deciding factor here might have been the area, which is a part of shape property quantification. In another example, the circles of different colors and noisiness are placed at the following distances:

Relative scores for different test classes. Image by the author.

That can be nicely explained by color as the predominant consideration. There is another interesting observation: the width of the scores distribution seems to be related to the image group noise level. This could be interpreted as a measure of interaction between the color and the noise features. In any event, the item b) might also be satisfied. But, if the results with partially trained models are any indication, item c) is the weakest part of our approach. Every time we retrain the model, the relative scores of the classes will be different.

What I have discussed in this last section of the post are mostly qualitative musings, even though they are related to quantitative estimates. I have not done a comprehensive study of the embedding application. So far, the answer to the question whether our discriminator produces scores that are suitable for embedding would probably be “no”, not in its current form anyway. But I have an idea for which you would have to wait for another post :-)

The conclusion

In summary, we went through the basic principles of GANs. We also reviewed the arguments for using Wasserstein distance as an objective function while training them. All of that was used for an illustration of how the GANs work internally on a toy data set. But the main point of the post was to draw attention to the underappreciated part of the GANs — the discriminator. I hope I was able to convinced the reader that it is doing some wonderful work under the hood. I really think that it is time for GANs to take more commanding role in the machine learning world.

References

[1] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio (2014), Generative Adversarial Networks, arXiv:1406.2661

[2] M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein GAN (2017), arXiv:1701.07875

[3] V. Herrmann, Wasserstein GAN and the Kantorovich-Rubinstein Duality

[4] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, Improved Training of Wasserstein GNAs (2017) arXiv:1704.00028

--

--