Getting Started

Analyzing how StyleGAN works: style incorporation in high-quality image generation

In a previous post, we discussed 2K image-to-image translation, video to video synthesis, and large-scale class-conditional image generation. Namely, pix2pixHD, vid-to-vid, and BigGAN.

Nikolas Adaloglou

Published in

Towards Data Science

11 min readMay 9, 2020

But how far are we from generating realistic style-based images? Take a quick glance how stylish a real photo can be:

To this end, in this part, we will focus on style incorporation via adaptive instance normalization. To do so, we will revisit concepts of in-layer normalization that will be proven quite useful in our understanding of GANs.

No style, no party!

StyleGAN (A Style-Based Generator Architecture for Generative Adversarial Networks 2018)

Building on our understanding of GANs, instead of just generating images, we will now be able to control their style! How cool is that? But, wait a minute. We already saw in that InfoGAN can control image generations that are dependent on disentangled representations.

This work is heavily dependent on Progressive GANs, Adaptive Instance Normalization( AdaIN), and style transfer. We have already covered Progressive GANs in the previous part, so let’s dive into the rest of them before we focus on the understanding of this work.

1. Understanding feature space normalization and style transfer

The human visual system is strongly attuned to image statistics. It is known that spatially invariant statistics such as channel-wise mean and variance reliably encode the style of an image. Meanwhile, spatially varying features encode a specific instance.

Batch normalization

Image by Author, originally written in Latex

In the depicted equations, N is the number of image batch H the height and W the width. The Greek letter μ() refers to mean and the Greek letter σ() refers to standard deviation. Similarly, γ and β correspond to the trainable parameters that result in the linear/affine transformation, which is different for all channels. Specifically γ, β are vectors with the channel dimensionality. The batch features are x with a shape of [N, C, H, W], where the index c denotes the per-channel mean. Notably, the spatial dimensions, as well as the image batch, are averaged. This way, we concentrate our features in a compact space, which is usually beneficial.

However, in terms of style and global characteristics, all individual channels share the shame learned characteristics, namely γ,β. Therefore, BN can be intuitively understood as normalizing a batch of images to be centered around a single style. Still, the convolutional layers are able to learn some intra-batch style differences. As such, every single sample may still have different styles. This, for example, was undesirable if you want to transfer all images to the same shared style (i.e. Van Gogh style).

But what if we don’t mix the feature batch characteristics?

Instance normalization

Different from the BN layer, Instance Normalization (IN) is computed only across the features spatial dimensions, but again independently for each channel (and each sample). Literally, we just remove the sum over N in the previous equation. Surprisingly, it is experimentally validated that the affine parameters in IN can completely change the style of the output image. As opposed to BN, IN can normalize the style of each individual sample to a target style (modeled by γ and β). For this reason, training a model to transfer to a specific style is easier. Because the rest of the network can focus its learning capacity on content manipulation and local details while discarding the original global ones (i.e. style information).

In this manner, by introducing a set that consists of multiple γ, one can design a network to model a plethora of finite styles, which is exactly the case of conditional instance normalization.

Adaptive Instance Normalization (AdaIN)

The idea of style transfer of another image starts to become natural. What if γ, β is injected from the feature statistics of another image? In this way, we will be able to model any arbitrary style by just giving our desired feature image mean as β and variance as γ. AdaIN does exactly that: it receives an input x(content) and a style input y, and simply aligns the channel-wise mean and variance of x to match those of y. Mathematically:

That’s all! So what can we do with just a single layer with this minor modification? Let us see!

Image by Xun Huang et al. Source: https://arxiv.org/abs/1703.06868

In the upper part, you see a simple encoder-decoder network architecture with an extra layer of AdaIN for style alignment. In the lower part, you see some results of this amazing idea! To summarize, AdaIN performs style transfer (in the feature space) by aligning the first-order statistics (μ and σ), at no additional cost in terms of complexity. If you want to play around with this idea code is available here (official) and here (unofficial)

2. The style-based generator

Let’s go back to our original goal of understanding Style-GAN. Basically, NVIDIA in this work totally nailed our understanding and design of the majority of the generators in GANs. Let’s see how.

In a common GAN generator, the sampled input latent space vector z is projected and reshaped, so it will be further processed by transpose convolutional or upsample with or without convolutions. Here, the latent vector is transformed by a series of fully connected layers, the so-called mapping network f! This results in another learned vector w, called intermediate latent space W. But why would somebody do that?

Image by StyleGAN paper, Source: https://arxiv.org/abs/1812.04948

Mapping network f

The major reason for this choice is that the intermediate latent space W does not have to support sampling according to any fixed distribution. With continuous mapping, its sampling density is induced. This mapping f “unwraps” the space of W, so it implicitly enforces a sort of disentangled representation. This means that the factors of variation become more linear. The authors argue that it will be easier to generate realistic images based on a disentangled representation compared to entangled ones. With this totally unsupervised trick, we at least expect W to be less entangled than Z space. Let’s move onto the depicted A in the figure.

A block: the style features

B block: noise

Furthermore, the authors provide G with explicit noise inputs, as a direct method to model stochastic details. B blocks are single-channel images consisting of uncorrelated Gaussian noise. They are fed as an additive noise image to each layer of the synthesis network. The single noise image is broadcasted to all feature maps.

Synthesis network g

Except for the initial block, each layer starts with an upsampling layer to double spatial resolution. Then, a convolution block is added. After each convolution, a 2D per-pixel noise is added to model stochasticity. Finally, with the magic AdaIN layer, the learned intermediate latent space that corresponds to style is injected.

Style mixing and truncation tricks

Instead of truncating the latent vector z as in BigGAN, the use it in the intermediate latent space W. This is implemented as: w’ = E( w) — ψ* ( w — E( w) ), where E(w)= E(f(z)). Ε denotes the expected value. The important parameter that controls sample quality is ψ. When it is closer to 0, we roughly get the sampled faces converging to the mean image of the dataset. Truncation in W space seems to work well as illustrated in the image below:

As perfectly described by the original paper:

“It is interesting that various high-level attributes often flip between the opposites, including viewpoint, glasses, age, coloring, hair length, and often gender. “ ~ Tero Karras et al

Another trick that was introduced is the style mixing. Sampling 2 samples from the latent space Z they generate two intermediate latent vectors w1, w2 corresponding to two styles. w1 is applied before the crossover point and w2 after it. This probably is performed inside the block. This regularization trick prevents the network from assuming that adjacent styles are correlated. A percentage of generated images (usually 90%) use this trick and it is randomly applied to a different place in the network every time.

3. Looking through the design choices of the style-based generator

Each injected style and noise (blocks A and B) are localized in the network. This means that when modifying a specific subset of styles/noises, it is expected to affect only certain aspects of the image.

Style: Let’s see the reason for this localization, starting from style. We extensively saw that AdaIN operation first normalizes each channel to zero mean and unit variance. Then, it applies the style-based scales and biases. In this way, the feature statistics for the subsequent convolution operation are changed. Less literally, previous statistics/styles are discarded in the next AdaIN layer. Thus, each style controls only one convolution before being overridden by the next AdaIN operation.

Noise: In a conventional generator, the latent vector z is fed in the input of the network. This is considered sub-optimal as it consumes the generator’s learning capacity. This is justified as the need of the network to invent a way to generate spatially-varying numbers from earlier activations.

By adding per-pixel noise after each convolution, it is experimentally validated that the effect of noise appears localized in the network. New noise is available for every layer, similar to BigGAN, and thus there is no motivation to generate the stochastic effects from previous activations.

All the above can be illustrated below:

On the left, we have the generated image. In the middle, 4 different noises applied to a selected sub-region. The standard deviation of a big set of samples with different noise can be observed on the right.

4. Quantifying disentanglement of spaces

The awesomeness arises by the fact that they were also able to quantify the disentanglement of spaces for the first time! Because if you can’t count it, it doesn’t exist! To this end, they introduce two new ways of quantifying the disentanglement of spaces.

Perceptual path length

In case you don’t feel comfortable with entangled and disentangled representations you can revisit InfoGAN. In very simple terms, entangled is mixed and disentangled related to encoded but separable in some way. I like to refer to disentangled representations as a kind of decoded information of a lower dimensionality of the data.

Let us suppose that we have two latent space vectors and we would like to interpolate between them? How could we possibly identify that the “walking of the latent space” corresponds in an entangled or disentangled representation? Intuitively, a less-sharp latent space should result in a perceptually smooth transition as observed in the image.

Interpolation of latent-space vectors can tell us a lot. For instance, non-linear, non-smooth, and sharp changes in the image may appear. How would you call such a representation? As an example, features that are absent in either endpoint can appear in the middle of a linear interpolation path. This is a sign of a messy world, namely entangled representation.

The quantification comes in terms of a small step ε in the latent space. If we subdivide a latent space interpolation path into small segments, we can measure a distance. The latter is measured between two steps, specifically, t, where t in [0,1] and t+ε. However, it is meaningful to measure the distance based on the generated images. So, one can sum the distances from all the steps to walk across two random samples of latent space z1 and z2. Note that the distance is measured in the image space. Actually, in this work, they measure the pairwise distance between two VGG network embeddings. Mathematically this can be described as (slerp denotes spherical interpolation):

Interestingly, they found that by adding noise, the path length (average distance) is roughly halved while mixing styles is a little bit increased (+10%). Furthermore, this measurement proves that the 8-layer fully connected architecture clearly produces an intermediate latent space W, which is more disentangled than Z.

Linear separability

Let’s see how this works. Hold tight!

We start by generating 200K images with z ∼ P ( z) and classify them using an auxiliary classification network with label Y.
We keep 50% of samples with the highest confidence score. This results in 100k high-score auto-labeled (Y) latent-space vectors z for progressive GAN and w for the Style-GAN.
We fit a linear SVM to predict the label X based only on the latent-space point ( z and w for the Style-GAN) and classify the points by this plane.
We compute the conditional entropy H( Y | X) where X represents the classes predicted by the SVM and Y are the classes determined by the classifier.
We calculate the separability score as exp( Σ (H(Y| X) ) ), summing for all the given attributes of the dataset. We basically fit a model for each attribute. Note that the CelebA dataset contains 40 attributes such as gender info.

Quantitative results can be observed in the following table:

In essence, the conditional entropy H tells how much additional information is required to determine the true class of a sample, given the SVM label X. An ideal linear SVM would result in knowing Y with full certainty resulting in entropy of 0. A high value for the entropy would suggest high uncertainty, so the label based on a linear SVM model is not informative at all. Less literally, the lower the entropy H, the better.

Conclusion

The engineering solutions that are proposed on GANs never cease to amaze me. In this article, we saw an exciting design to inject the style of a reference image via Adaptive Instance Normalization. Style-GAN is definitely one of the most revolutionary works in the field. Finally, we highlighted the proposed metrics for linear separability, which makes us dive into more and more advanced concepts in this series.

As always, we focus on intuitions and we believe that you are not discouraged to start experimenting with GANs. If you would like to start experimenting with a bunch of models to reproduce the state-of-the-art results, you should definitely check this open-source library in Tensorflow or this one in Pytorch.

The next part is available on AI Summer!

References

[1] Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4401–4410).

[2] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

[3] Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.

[4] Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.

Originally published at https://theaisummer.com on May 9, 2020.