DEEP LEARNING FOR BEGINNERS

Keywords to know before you start reading papers on GANs

Important recurring keywords/concepts explained in plain English

Published in

Towards Data Science

12 min readMar 22, 2021

There is no denying the fact that GANs are awesome! If you don’t know what they are, check out this article where I explain GANs from scratch to a 5-year old and how to implement GANs in Pytorch! In a nutshell, GANs belong to a category of generative models that let us generate incredibly realistic synthetic data, with the same qualities as that of the underlying training data. That means if you feed the model images of a few bedroom decors, after few hours of training it can generate never-seen-before brand-new ideas for your interior design.

Bedroom designs generated by a StyleGAN [Made available under Creative Commons BY-NC 4.0 license]

Over the past few weeks, I have probably read a dozen papers on GANs (and its variants) and tinkered around with their code on custom images (courtesy open-source Github repos). While most of these papers are brilliantly written, I wish there were a few keywords that I had known before I plunged into these academically-written manuscripts. Below I will discuss a few of them and hope it saves you some time (and frustration) when you encounter them in papers. Just to be clear, this is not an article to explain these papers in-depth or even how to code them, but to simply explain what certain keywords mean in specific contexts. Rest assured, as and when I read more, I will make sure to keep this list up-to-date.

As for the pre-requisites, I am assuming most of you already know what Discriminator and Generator networks are with regard to GANs. And that's about it! For those of you who might need a recap:

A Generator network’s aim to produce fake images that look real. It takes as input a random vector (say a 100-dimensional array of numbers from Gaussian distribution) and outputs a a realistic-enough image that looks like it could belong to our training set!
A Discriminator network correctly guesses whether an image is fake (i.e. generated by the Generator) or real (i.e. coming directly from the input source).

Let’s begin!

Latent Representation of an image

To understand latent representation, think of it this way: any colored image of mine with height and width as 100 would probably be stored in an array of shape (100, 100, 3). To represent or visualize this image in some form, I would need approximately 100*100*3 ≈ 300k dimensions. Ouch, that’s a lot!

So now we try to find a compressed representation of my image such that it requires fewer than 300k dimensions. Let’s say we somehow find a representation using some dimensionality reduction technique that uses only 5 dimensions. That means, now my image can be represented using a (hypothetical) vector v1 = [.1,.56,.89,.34,.90] (where .1, .56, .89, etc. are the values along each of the five axes) and my friend’s image can be represented using vector v2 = [.20,.45,.86,.21,.32]. These vectors are known as the latent representations of images.

Of course, visualizing them would still be a challenge because 5-dimensional representations are harder to parse. In reality, however, we use an even bigger representation (in the order of 100s) than simply 5.

Latent space

Both the vectors described above (along with many others of my friends, colleagues, family members, etc.) constitute the latent space. In this latent space, images that are similar (say, images of two cats) will be bundled up closer, and images that are strikingly different (cats vs. cars) will be farther apart.

In short, latent space can be thought of as space where all the latent representations of an image live. This space could be 2D if each image is represented using a two-element vector; 3D if each image is represented using a three-element vector; and so on.

This space is known as “latent”, meaning hidden, simply because it is quite hard to visualize the space in reality. Can you imagine visualizing anything beyond 3-D space in your head, let alone a 100-D space!

Latent space is simply any hypothetical space that contains points (representing images) in a way that a Generator knows how to convert a point from the latent space to an image (preferably similar looking to the dataset it was trained on).

P.S. It would be a shame if I didn’t link this awesome article by Ekin Tiu which explains the intuition behind latent space in much more detail. Also, don’t forget to check out his exceptional visual representation of latent spaces containing digits from 0–9.

Z-space

Following the earlier definition of latent space, Z-space can be defined as space where all the z-vectors live. A z vector is nothing but a vector containing random values from a Gaussian (normal) distribution.

The z-vector is often passed as an input into a fully trained GAN generator model following which the model spits out a real-looking fake image.

If you come across something like “sample a point from the Z space” or “sample a latent code from the Z space” in one of the GAN papers, think of it as picking a point, i.e. a vector of real numbers from Normal distribution, from the Z-space.

P.S.: In the original GANs and DCGAN paper, the z vector is 100-dimensional!

Style codes/Style vectors

Soon after you finish learning about vanilla GANs, you will come across a new kind of GANs i.e. StyleGANs. While GANs, at best, can perfectly replicate the training data and produce more data that looks just like it, the cool thing about StyleGANs is that they allow high-fidelity images to be generated which have much more variation in them — varied backgrounds, freckles, spectacles, hairstyles, etc.

To do this, the authors have implemented various architectural improvements. One of them is as follows: instead of passing the z-vector directly into the generator (which, FYI, is sometimes also called a synthesis network in StyleGANs paper), it is first passed through a mapping network to produce a w-vector AKA style code AKA style vector. This is then injected into the synthesis network at various layers (after undergoing some layer-specific transformations) and the output we get is an awesome high-fidelity image.

P.S. The shape of both Z and W space in the StyleGAN architecture is 512-D. Also, the distribution of Z is Gaussian but W space does not follow any specific distribution.

W-space & extended w-space (W+)

By now hopefully, you understand how spaces can be defined. So, naturally, W-space is some hypothetical residence of all the style vectors wdefined above, such that if we were to pick a vector at random from this space and feed it to the StyleGAN generator, it is capable of producing a realistic-looking fake image (say, I).

The latent space W is quite an important concept in StyleGANs as it holds the key to controlling various attributes/features of an image. This is because the W-space is disentangled, meaning each of the 512 dimensions encodes unique information about the image — for instance, the first dimension might control the expression, the second the pose, the third the illumination, etc. This knowledge allows us to make certain modifications to the image. For instance, if I were to somehow know the right values to change in the vector wto generate w’ and then feed w’to a StyleGAN generator, it can produce a smiling version of the original image I. (Note: We will see later in the tutorial how to find these “right” values to change in the latent codes).

Many a time to increase the expressiveness (i.e. ability to generate uniquely different faces that look hella different from an “average” face) of a StyleGAN generator, instead of using one style vector for all the layers, we use a unique style code for each layer in the synthesis/generator network. This is known as the extension of W-space and is usually represented as W+.

Encoding/Projecting/Embedding image in latent space

While it’s cool to be able to make modifications to the facial features of fake images spewed out by the StyleGAN Generator, what would be 100x cooler is if we could do all that on real images of you and me.

To do this, the very first step is to find a latent representation for my image in the W-space of StyleGAN (such that I can then modify the right value in there to generate a smiling/frowning pic of mine). This is, find a vector in W-space such that when this vector is fed to a StyleGAN generator, it will spew out my exact image. This is what is known as embedding/projecting/encoding an image in the latent space.

Research suggests that embedding a real input image works best when mapped into the extended latent space (W+) of a pre-trained StyleGAN. That means the latent representation will have a shape (18, 512) i.e. 18 unique style codes, each composed of 512-element embedding.

Note: A StyleGAN generator capable of synthesizing images at a resolution of 1024 × 1024 has 18 style input layers. That is why, the latent code in W+ takes the shape (18, 512). If your StyleGAN is synthesizing images at a higher or lower resolution than this, the corresponding number of style inputs might differ and so would the shape of your latent representation.

Now coming back to the main question: How do I find this vector/ latent representation of my image? That’s what GAN inversion is all about!

GAN inversion

GAN inversion is the process of obtaining the latent code for a given image such that when the code is fed to a generator we can easily reconstruct our original image.

I do not know if that’s mind = blown kinda information for you, but if not, here’s another way to think about the usefulness of GAN inversions (P.S. I cannot take credit for the following, I read it somewhere on the Internet):

In a way then, any human ever born or yet to be born is present in the latent space. (You simply need to find the right inversion).

There are two main methods defined in the literature for inverting an image:

Method 1: Pick a random latent representation and work towards optimizing it until you have minimized the error for the given image. This method takes longer but allows superior reconstruction.
Method 2: Train an encoder such that it can map a given image to its corresponding latent code. This method is quicker compared to its counterpart but the recreated image exhibits higher distortion, although they turn out to be highly editable.

Using either of these two methods (or a combination of the two), it is possible to obtain a GAN inverted image that looks reasonably similar to the original image, and the distortion, if any, is hardly noticeable. Here’s an example of a fictional character, Chidi Anagonye, from A Good Place, played by William Jackson Harper and his GAN inverted image. Quite remarkable, isn't it!

Original (on the left) and GAN inverted image (on the right) of William Jackson Harper. [Image generated by Author using the recently released e4e encoder].

However, in my honest opinion, getting a perfect recreation shouldn’t be your end goal. What’s more important is what you do with the GAN inverted image once you have it! More importantly, can we use the inversion for performing meaningful image editing? Let’s look at it next!

Semantic editing

One of the most common steps succeeding GAN inversion involves editing the latent code such that one can successfully manipulate some facial features in the image. As an example, here’s a smiling Chidi Anagonye, obtained by manipulating the latent code from GAN inversion:

Smile editing applied on the GAN inverted image of William Jackson Harper. [Image generated by Author]

Semantic editing encapsulates such editing with one important consideration that only the intended features must be modified whilst the remaining features must remain constant. For instance, in the above example changing the expressions for a person did not change their gender or pose.

For this very reason, we should aim for a highly disentangled latent space and as research has pointed out, W-space of a StyleGAN shows much higher disentanglement compared to Z-space, mainly because “W is not restricted to any certain distribution and can better model the underlying character of real data”. That is the reason, most existing research papers you will come across will try to find a latent representation for a new image in W space rather than Z-space.

Latent Space Interpolation

Another interesting feat to achieve with the projected latent code is to use it in combination with another latent code. How? you may ask!

Simply take two latent codes, which could be the codes for images of you and your favorite celeb. Now in a well-developed latent space, these two points would be far because chances are, you look nothing like your favorite celebrity. However, you can pick a point (in space) between these two points, feed it to the Generator and create an intermediate output. Sort of like a mashup of you and your celeb crush, (or a love-child) if you may! This is what latent space interpolation is all about- smooth transitions between two latent codes in latent space.

Here’s a short video clip released by MIT researchers studying 3D-GANs which can help you visualize the concept of interpolation. Here we can see how a broad chair with arms is morphed into a tall arm-less chair.

Interpolation in linear space [Video available in public domain]

The simplest linear interpolation can be achieved using straightforward vector arithmetic. That is, given two latent vectors: aand b (both having the shape as (18,512)), corresponding to the latent representations for you and your celeb crush, respectively:

If we interpolate exactly halfway, we would get a new point using the formula
a_b_half = a * 0.5 + b * 0.5
Upon feeding a_b_half to the Generator, the resultant image would look 50% like you and 50% like your crush.
If we interpolate a quarter of the distance between the two, the new point can be obtained using the formula
a_b_quarter = a * 0.25 + b * 0.75
Upon feeding a_b_quarter to the Generator, the resultant image would look more like your crush (we took 75% of b’s latent code) and a little less like you.

Linear space interpolation is a nice way to show a transition between two images and explore the GAN-generated latent space. The exploration helps develop an intuition and ensure that the latent space learned by the GAN is not something wacky .

Latent Directions/Semantic Boundary

Coming back to the topic of manipulating facial attributes, I promised to explain how to find “right” values to modify within the latent representation of an image. An important thing to note is that the values that need to be modified will depend on the attribute you are aiming to modify, such as smile pose, illumination, etc.

Now finding this value is analogous to finding a direction in which to move in the latent space, (much like doing some freestyle interpolation) to see what makes a face go from smiling to frowning, or eyes-open to eyes-closed, etc.

A latent direction/semantic boundary for attribute Ais a vector which when added to the latent code of an image, generates a new image with the attribute A added to the original image.

There are multiple ways (both supervised and unsupervised) to learn these latent directions but luckily for us, these directions are regularly made available as open-sourced by many researchers (Here you can find the directions for StyleGANs and StyleGANs2 models).

In case you want to generate these directions yourself (say for the ‘age’ attribute), the implementation details have been provided in the InterfaceGAN paper:

Sample latent codes from the W-space of a pre-trained GAN model (could be PGGAN, StyleGAN, StyleGAN2, etc.) and generate few images using the codes.
Pass the generated images through a pre-trained age classifier such that for each image you have a label that contains information whetherA is old or young.
Now train a linear SVM classifier with latent codes as inputs and the labels as output.
The weights learned by the SVM correspond to the latent directions for age.

P.S.: Usually, the learned directions are stored as .npy files or .pt files within a Github repo.

Conclusion

In this tutorial, we deciphered the meaning of few commonly occurring terms/concepts/keywords in the GANs domain — such as latent space, interpolation, inversion, extended latent space, etc. Hopefully, when you stumble across these terms in literature you would have a fair idea regarding their interpretations.

From hereon, I think you are all set to tackle most of the GAN papers head-on. I highly encourage you to read some of these papers (my current favorites are this, this, and this) as the level of implementational and architectural details in there is beyond any article or blog’s coverage capacity. As always, if I skipped over something crucial or you have an even simpler explanation, please feel free to bring it to my attention.

Until next time :)